Archives for August 2006

A Tale of Two Formats

2006/08/22 By Rob 16 Comments

As he stood staring at them, they asked him no questions, for his face told them everything.

‘I cannot find it,’ said he, ‘and I must have it. Where is it?’

His head and throat were bare, and, as he spoke with a helpless look straying all around, he took his coat off, and let it drop on the floor.

‘Where is my bench? I’ve been looking everywhere for my bench, and I can’t find it. What have they done with my work? Time presses: I must finish those shoes.’

They looked at one another, and their hearts died within them.

Charles Dickens, a careful student of human nature, provides us here a vivid portrait of Dr. Alexandre Manette, who, after being held 18 years in the Bastille, is released, but is unable to adjust to his new freedom, and in times of stress lapses back to the familiarity of his prison labors, making shoes.

We all have been prisoners of Microsoft Office and their proprietary file formats. You may no longer recognize it as a prison, because this cell has been your home for the past 15 years, but here is what it looks like:

Editing a document requires Microsoft Office.
Since Office runs only on Windows, you also require Windows
These restrictions lead to a purely heavy-client view of document processing.
This also leads to a model of programmability that emphasizes storing executable code (macros/script) inside of the document, resulting in years of security nightmares. Here is a typical recital of the known dangers.
If you don’t want to put script inside your document, you could access the data via Office automation API’s, but this again required a machine running Windows and Office.
It also emphasizes a view of WYSIWYG which emphasizes early formatting and layout decisions and de-emphasized semantic richness in documents. For example, see “What has WYSIWYG Done to Us?”.
The tools that were created for us to record our thoughts instead now constrain or even substitute for our thoughts. For example, “PowerPoint Panders to our Weaker Points” in the Guardian, and Tufte’s “PowerPoint is Evil”.
The above also lead to a stifled the market for 3rd party document processing tools. We will never see the value of what was never allowed to occur, but the opportunity cost of the innovation that did not happen in this single-vendor world is enormous.
This also lead to general lack of competition in the productivity editor market, leading to a decade of buggy products with little innovation. Is the “Ribbon” the most we can look forward to?
We’ve been locked into a one-size-fits-all offerings of bloated applications. Many people are over-served by Office and therefor are over-paying for functionality they do not need, while others are under-served by the resulting products they cannot afford.
Functionality has been arbitrarily segregated into three and only three application classes, “Spreadsheet”, “Word Processor” and “Presentation Graphics”.

The move from proprietary binary formats to new standard formats, like OpenDocument Format (ODF), is a movement from imprisonment to freedom. The technical constraints have been lifted, but have we really made the mental adjustments necessary to engage our new freedom? Or are we still silently pacing a 10-foot cell in our minds? If we merely recreate our cell walls in XML, then we are still prisoners.

I am a creature of habit and have been as much a prisoner as you have, so don’t look to me for all the answers. But I do have a few thoughts on what this new freedom might look like.

Instead of being opaque black boxes that can only be used on one vendor’s system, documents will be transparent. Anyone can access them using whatever operating system and whatever tools they want, and for any purpose they want. Python on Linux, REXX on AS/400, and C# on Windows will all have equal opportunity.

This also implies that document processing will no longer be restricted, technically or by license, to the desktop. Innovative things will occur on servers. We’re starting to see some of that with Google Docs and wikiCalc. But that is only the beginning. We will see search engines that can intelligently search content for specific MathML expressions, spiders that will collect and aggregate slides from presentations and allow you to share them, document repositories that will automatically check citations in papers and calculate the intellectual social networks these imply, stock brokers that will allow you to download your statements formatted in a spreadsheet, with additional analytics calculated via spreadsheet formulas. Creating, editing, reading, viewing, storing, collaborating will be able to be done anywhere, from your cellphone to the largest servers.

Since the server typically has access not only to your own documents, but your organization’s as well, as well as easy access to other information about the users, such as your role and group via LDAP, an application can drive workflows that relate the contents of the document to similar content, as well as to you organizational role, and to your business. The companies that unlock the knowledge stored by your knowledge workers in your organization’s documents will be the companies leading us into the next decade.

The old walls will fall that once segregated functionality into the arbitrarily defined boundaries of “Spreadsheet”, “Word processor”, and “Presentation graphics”. Dan Bricklin is leading the way with his wikiCalc. Is it a Spreadsheet or is it a Wiki? If you have to ask the question then you are still a prisoner. The point is wikiCalc is whatever Dan Bricklin wants it to be. That is freedom to innovate. We will see the arbitrary divisions between application genres become fuzzy and fall away as we all recognize our new freedom.

Document programmability will be turned inside-out. Instead of putting code inside of the document, turning documents into virus vectors, the code will be carefully segregated. Once the code and the data are distinct, we can put the code on the server, where it can be more easily managed, maintained, and secured. This clean separation of code and data will be as important to system stability and security as was protected-mode in the 80286 processor when it first enforced this data/code separation at operating system level. I see macro viruses becoming a thing of the past, like smallpox, because the importance of data/code separation will finally be enforced, and users will not be emailing around code disguised in documents.

We will start thinking of documents as data, and as inputs to modules that process data. I see visual design tools that will allow you to drag and drop a document template onto a design surface and expose various fields in the document which can be wired up to databases, web services or other data sources.

I see financial analysts creating financial models in spreadsheets, then converting the spreadsheet into a web application that can then be deployed anywhere to provide browser-based access and execution of the model via any browser.

I see a variety of productivity editors available at a variety of price points, from free, open source ones, to commercial offerings for desktop and other devices, to specialized offerings with extra features for vertical markets, like legal, medical, academic, or scientific uses.

I see an escape from documents-as-pictures, where users sweat over pixel-perfection and pray that the applications don’t screw them up. Today the end user doesn’t worry about font kerning. We rely on the font managers to get this right, and we accept the results, and concentrate on what we, the authors, add to the document. We are freed from that mental burden of kerning. But why stop there? With smarter applications, we will be freed of most or all formatting burdens. We will concentrate on writing, not on styling, and rely on the applications to get the appearance right. This will free our time to give an increased emphasis on semantic richness, putting our knowledge and experience and outlooks and opinions into the document, and encoding it in an way that allows new modes of collaboration and redefines what a document is.

That is a gimpse at what freedom looks like to me. But let’s not forget that being freed is not the same as being free. There are those out there who are attempting to merely recreate the same single-vendor closed system we’ve had for the past 10 years, and recoding it in XML. This may be a comfortable choice to those who have known no other way. But is it really freedom? I look out and see the jailer offering to sell 10-foot apartments to those just released from their 10-foot prison cells. Will you follow?

Change Log

1/30/2007 — updated wikiCalc link, made other assorted wording changes at my whim, corrected a spelling error, changed to curly quotes.

The 96.97 percent problem

2006/08/21 By Rob 6 Comments

The press release puts out numbers of awesome import. We finally have the answers we seek, the science of web analytics and super-duper tools has laid all doubts to rest:

Amsterdam – August 14 2006 – OneStat.com, the number one provider of real-time intelligence web analytics, today reported that Microsoft’s Windows dominates the operating system market with a global usage share of 96.97 percent. The leading operating system on the web is Microsoft’s Windows XP with a global usage share of 86.80 percent. Microsoft’s Windows 2000 has a global usage share of 6.09 percent and is the second most popular OS on the web.

The global usage share of Apple’s Macintosh is 2.47 percent and the global usage share of Linux is 0.36 percent.

So what’s wrong with this picture? The first thing that hits me is that the survey quotes results to four significant digits. This is unusual in a survey of this kind, since it implies error bars of only +/- 0.005%. Now, what probably really happened here is that 96.97 % of the sampled users were running Windows. But to apply that level of precision to the entire population as they do when they call it “a global usage share of 96.97 percent”, that is something else altogether. Just because you can calculate a number does not mean that you know a number.

According to their press release, OneStat sampled 2 million users from those who visited their customers . We’ll deal with the potential bias issues later. But first let’s settle a statistical question, what sample size would be required to know results to 0.005%? This depends on the population size, the number of internet users, which in 2004 was estimated to be 840,000,000 so I’ll use a nice round billion (1,000,000,000) as an estimate for 2006.

There are a number of survey calculators on the web. I use this one from Creative Research Systems. Plug in the numbers into the Determine Sample Size form:

Confidence level = 95%
Confidence interval 0.005
Population: 1000000000

Press Calculate and you will see that the required sample size is around 280 million. So a sample of only 2 million users, even if perfectly sampled, will not allow you to state numbers like 96.97%. It is off by a factor of 100.

So the question then is, how accurate are the results can one expect from “only” 2 million users. You can use the second calculator on that page, and get an answer of around 0.07%. That isn’t bad at all and may allow you to say 97.0 +/- 0.1%, which is nothing to sneeze at.

(You can also use that form to discover some interesting facts, like a random sample of 384 people is enough to represent a population of any size to a 5% confidence level. It is this type of asymptotic behavior which allows market research firms to make predictions about the preferences of people all over the world, doing many small surveys, though you may find that you yourself may never be surveyed in your entire life.)

Now all of this is moot if the 2 million user sample is not representative of the total population. The results may be precise to one decimal place, but are they accurate? Are the people who visit the web sites of OneSite’s customers reflective of all all web users? Are they typical in terms of country, language, income, age, gender, etc? No supporting info is given.

Sampling bias can be a treacherous thing. For example, let’s look at this blog. Over the past few weeks I’ve received 30,807 visitors, of which 6,512 were running Linux and 14,335 were running Firefox. Based on those numbers, and assuming a world-wide web population of 1 billion, I can issue a press release stating the following:

With 95% confidence Linux has a global usage share of 21.1% (+/- 0.1%) and Firefox as a world wide usage share of 46.5% (+/- 0.1%)

Based purely on the numbers, a have a sample size suffiicent to support the stated precision. But do I think those numbers accurately reflect all web users?

In the end, it is a waste of time to do a survey of 2 million users unless you are rock solid sure that they are randomly selected and representative of the entire population. On the otherhand, if you have a truly unbiased sample, you could tell the OS breakdown of the web to 1% precision with a sampling under 40,000 users.

The lesson? Don’t be awed by numbers. There is often less there than meets the eye.

Four Shorts

2006/08/21 By Rob 2 Comments

I. OpenOffice.org Conference (OOOoCon 2006) in comming up, September 11-13th in Lyon, France. The last day starts with a panel discussion of ODF topics, and follows with a track dedicated to ODF. I’m on at 14:00 with a presentation with the exciting title, “A Technical Comparison: ISO/IEC 26300 vs Microsoft Office Open XML (Ecma International TC45 OOXML WD 1.3)”.

The abstract is:

Two XML office file formats have been pressing upon our attention, the OASIS OpenDocument Format, recently standardized by ISO, and the Draft Ecma Office Open XML. This presentation will review history of each, the process that created them, and examine each format to compare and contrast how they deal with issues such extensibility, modularization, expressivity, performance, reuse of standards, programability, ease of use, and application/OS neutrality.

II. KDE enthusiasts get together two weeks later, in Dublin, for their aKademy 2006. Tuesday the 26th will be OpenDocument Day. I’ll be there, and will give a lighting talk on something, probably related to some ODF programmability API ideas I’ve been having.

III. If you didn’t see it yet, Rick “Schemetron” Jelliffe has an interesting post over at O’Reilly looking at ODF and OOXML documents from the perspective of XML complexity metrics. This is a topic which Rick has done a good deal of work with in the past, so it is interesting to hear what he has to say. Did I see something there about OpenOffice loading documents faster than Office?

IV. The ODF Formula Subcommittee has set up a wiki page on our work defining OpenFormula. A lot of good information is there. This page will be updated with the latest status, so you’ll want to make it the first place to go for the latest info on our progress.

A Demo: Mathematica, MathML and ODF

2006/08/20 By Rob 6 Comments

Here’s a short tutorial on exchanging MathML between Mathematica and OpenOffice, showing what is possible today, and offering some suggestions for closer integration.

First, start with a new ODF document in OpenOffice. It is often easier to modify an existing document, inheriting its structure and default styles, than to create a new document from scratch. So I believe that a lot of interesting projects with ODF will start with an existing document as a template, and then add or replace content in it.

So, here’s what I made, a simple file with a formula describing the Euclidean metric, our old friend the Pythagorean Theorom. Click the image to load the ODF file.

If you rename the ODF file to a .zip extension, and unzip it, you can see the XML files it contains. Always start with the manifest.xml , for your convenience here, to which I draw your attention to the entry with the type “application/vnd.oasis.opendocument.formula”. This, according to Appendix C of the ODF 1.0 specification, is the registered MIME type of an ODF formula document. So that sounds like what we want. Let’s replace that equation with something else.

So into Mathematica we go. Suppose I want to calculate the indefinite double integral of the Euclidean metric. Why not? This is something I’d rather not do by hand, but I know Mathematica can quickly give me the answer:

Now I really don’t want to retype that result into OpenOffice. So, what can I do? I can use Mathematica’s ExpressionToMathML function to turn the above into MathML. When I do that I get MathML like this.

Let’s see now what happens if I simply drop that content in as a replacement for the original content.xml in the ODF file. Here’s what I get (click the image to open the ODF file):

So we got something, but it is not quite right. I’m seeing some little hollow boxes, usually an indication of an unprintable character. What’s up with this?

A closer look at the XML generated from Mathematica shows that these boxes are being displayed whenever the MathML uses the XML character entities corresponding to section 6.2.4 “Non-Marking Characters” of the MathML specification. This includes things like “InvisibleTimes” which handles cases where adjacency represents multiplication (xy == x*y). Using these characters provides hints to the application that can help it optimize its rendering and editing, but they should not be displayed.

In any case there appears to be a bug in OpenOffice 2.0.3 where it tries to display these characters and finds they don’t map to any printable Unicode character. No big deal, I will enter a bug report on that later. But for now I can easily clean this up by defining a new function in Mathematica, ExpressionToOO, defined as follows:

(Note I didn’t name this “ExpressionToODF”, since strictly speaking the ODF specification allows MathML 2.0, including the non-marking characters. This function is specifically to work around an OpenOffice bug. It outputs valid MathML, simply removing the non-marking characters which OO doesn’t understand.)

So, back to Mathematica, I run ExpressionToOO, grab that XML and inject that XML into the ODF document, and we get the following (click to open the ODF file):

That’s what we want! For those who are interested, the complete Mathematica notebook is here: Session.nb.

As you can see, this isn’t rocket science, though no doubt it may be useful to rocket scientists. Consider this a little “proof of concept”. Real end users will not be going around unzipping ODF documents and copying XML around. There needs to be some additional integration work to make this process simple and joyful. For example:

A Mathematica function that automatically inserts a formula into an ODF document
A OpenOffice add-in that lets the user automatically browser formulas from Mathematica and insert them into the current working document.
Clipboard level exchange of MathML between OpenOffice and Mathematica
An export filter from OpenOffice to export to the XHTML+MathML+SVG profile defined to the W3C. This, combined with Firefox, would provide kickass scientific publishing using open standards and tools.

Note that I’m using here Mathematica just as an example. There are over 100 MathML supporting applications out there, both commercial and open source. I’d be interested in hearing what other ideas people have for workflows involving ODF editors and other tools that work with the standards ODF includes, not just MathML, but SVG, XForms, etc. Let’s demonstrate the value of open standards working together.

Math You Can’t Use

2006/08/06 By Rob 13 Comments

Summary: In this post I will look at MathML, a web standard for displaying mathematical equations. I will show how well established it is on the web, how it is integrated into ODF, and how Microsoft has decided to go off in another direction with OMML, another “stealth” standard hidden in their 4,000 page Office Open XML specfication, but little mentioned. As I did with my prior analysis of their reliance on the rejected VML specification, I will show why this is a bad thing.

I’ve been reading Math You Can’t Use: Patents, Copyright, and Software a book by Ben Klemens, Guest Scholar at the Brookings Institute. It examines the current state of software patents in U.S. and the abuses thereof. He blends his legal and economic policy background with his insights as a programmer to give a perspective worth hearing. Mind you, I don’t agree with him on many points, and in fact I found the book infuriating at times, but he does make a serious argument and I respect that. In any case I like to have my opinions challenged every now and then. It keeps the mind limber.

Although I am not going to talk about patents and copyrights today, I will steal the title of this book and talk a bit about math, the kind you can use as well as the type you can’t. The topic for today is MathML.

MathML is a web standard from the W3C, an XML vocabulary for representing the structure and content of mathematical expressions. In other words, it represents equations for display, especially complicated expressions with integrals, summations, products, limits and all the Greek you can throw at it.

If you are running Firefox and have installed the math fonts then you can get an idea of its capabilities by loading MathML-enabled pages right now, like this one. If you are running Internet Explorer, then sadly you lack native support for MathML, but a browser plugin is available.

MathML 1.0 dates back to 1999, and has been revised through MathML 2.0 (second edition) in 2003.

There are about 100 implementations of MathML if you count producers, consumers and editors, including the powerful software used by working mathematicians and scientists like Maple and Mathematica.

The W3C has made a special effort to get the various MathML vendors together to evaluate how well they handle MathML and this is reported out in their Implementation and Interoperability Report .

Where MathML is supported natively, such as in Firefox, it will render along with the text, and not merely as an embedded GIF image. So, it will scale to different screen resolutions and print well. In theory, since it is just text markup in the page, it can be indexed by an intelligent search engine, though I am aware of none that do this currently. (Is there any use for a Google search of all web pages that include a 3rd degree polynomial inequality? I wouldn’t want to be the first to say “No”.)

MathML also is the key to enabling better support for mathematics via screen readers and other assistive agents. When a visually impaired user is presented an equation in the form of a GIF or other image format, they are left out. But put the formula in MathML and the possibilities look better. The work is not complete yet, but progress is being made. For example this report from CSUN 2004 and NIDE’s MathML Accessibility Project.

Further innovations are seen at sites like Wolfram’s MathMLCentral where we see web services for creating, displaying, or even integrating MathML expressions, using their Mathematica program as the backend.

For the above, and many other reasons, MathML was the only logical choice for us to use to support equations in OpenDocument Format (ODF). With such a thriving ecosystem of producers and consumers, with support the tools used by academia and industry like Mathematica and Maple, strong support in web browsers like Firefox, with the accessibility initiatives around it, I don’t see how you could argue otherwise. MathML is the way the web does math.

But the choice of MathML is more than just a fashion statement. It has practical significance and enables opportunities for innovative workflows around mathematical document production. If you create an equation block in OpenOffice, it saves the equation as a standalone MathML XML document in the ODT document archive. This makes it very easy to access, read, replace, etc.

We should be thinking about workflows like the following:

Do your complicated calculations in a tool like Mathematica
When you get the final results you want, export it to MathML, for example, using Mathematica’s MathMLForm[ ] function.
Copy the MathML into an ODF document archive
Take the ODF document and complete the prose write-up of the document in OpenOffice
Share the draft with colleagues, review, etc., in the editable ODF format
When ready to publish, export to XHTML with embedded MathML preserved for the equations, and embedded SVG for the charts.
Users can then view in Firefox or Internet Explorer (with extra plugin)

We’re not quite there yet, end to end. Step #6 in particular is not working as I’d expect in OpenOffice 2.03. But you get the idea. There is opportunity for fame glory and perhaps some profit to the person or company who provides an end-to-end mathematical editing and publishing solution based on open standards.

So, in this happy world I’ve described, what is missing? If you guessed “Microsoft Office” then you guessed correctly! Even though MathML is a 7 year-old standard, widely implemented, supported by the leading mathematical tools, the preferred format for publishing math on the web, etc, etc., (the mantra should be familiar), Microsoft has ignored it and instead is pushing forward a new competing format in their Office Open XML (OOXML) specification rushing through Ecma.

The new math markup format is called OMML and you’ve probably never heard of it. You can check Google, you can check Wikipedia, you can check MSDN. You won’t find it. In fact, I’m not even sure what OMML stands for since the acronym is not defined in the spec. But it is there, nestled away in the 4,081 page draft OOXML specification as the markup that “specifies the structures and appearance of equations in the document”, Section 25.1, all 93 pages of it.

OMML is not MathML, though it does the solves the same problem. But if you use OMML, it will not work with Firefox, with Mathematica, with OpenOffice or with any of the other 100 applications that support MathML. OMML works with Office, and that’s it. One door in, no doors out.

Consider that Ecma TC45’s Programme of Work included the goal of:

….enabling the implementation of the Office Open XML Formats by a wide set of tools and platforms in order to foster interoperability across office productivity applications and with line-of-business systems.

How exactly does the OOXML specification foster this interoperability when it ignores relevant web standards like MathML (and SVG and XForms)?

Microsoft’s typical argument is to say that the existing standards are inadequate, that Microsoft users expect more, that they need more features, that this is because they need to deal with billions of documents and trillions of dollars, etc. But this rings hollow when talking about math. An examination of the history of mathematical notation demonstrates, as you may already know, that mathematical notation is not exactly experiencing a high rate-of-change. Equations, as used in math and sciences, for the most part use the same notation they did 100 years ago, and many parts of notation are 200-300 years old. Certainly there is no essential change in notation since 1999, when MathML was created.

Now if Microsoft had merely wanted to create a proprietary format for equations and use that in Word in order to trap their customers onto that platform, then I’d simply say that’s not my concern and I’d blog about my heirloom tomatoes or something else. But when this shows up in a nominally open standard destined for approval by ISO, then this raises my eyebrows a little. The obvious choice would have been to simply reuse MathML. So, why are they creating, and standardizing a whole new math markup language? Are there no standards worth reusing? Will XPS replace PDF, VML replace SVG, Windows Media Photo format replace PNG, OMML replace MathML, and OOXML replace ODF? Let’s say “No” to OMML and “Yes” to MathML, the math you can use.