ODF

Unlocking the Wordhord

2006/11/01 By Rob Leave a Comment

I have a backlog of shorter items that I’ve accumulated in recent weeks that I’d like to share with you. I hope you find something here interesting.

First, congratulations to OpenOffice.org and KOffice, who both recently announced new releases. In my mind the notable features include an improved extensions framework in OpenOffice 2.04 and leading MathML conformance scores and command-line (UI-less) scripting for KOffice 1.6. Combined with the recent release of Firefox 2.0, it feels like Christmas has come early this year!

I get the feeling that there are more good things to come. Eike Rathke blogs about order of magnitude performance improvements in load time for large spreadsheets, a fix targeted for OpenOffice.org 2.1.

Some emerging technology at Adobe, a project codenamed “Mars”, which appears to be a reformulation of PDF, based on open standards such as SVG, PNG, JPG, JPG2000, OpenType, XPath and XML, all sitting in a Zip container file. There is a voice in my head saying, “This is important”. For example, could we have a single container file that included both ODF editable content as well as Mars/PDF for high-fidelity presentation? That way you can hand a document to someone and they can either view/edit it in a full heavy-weight editor, or get a fast high-fidelity read-only rendering. Both modes of use from the same file. To make this, and other cool things happen, Mars and ODF will want to synch-up on things like packaging, manifests and metadata. Adobe, call me ;-)

Two new ODF whitepapers to note. J. David Eisenberg looks at ODF and XForms and how they work together in OpenOffice.org, using a wrestling club application form as an example. Of course, source code is included. “Opportunities for innovation with OpenDocument Format XML” is the title of a new IBM whitepaper also just posted.

A couple weeks ago I participated in a roundtable discussion on ODF at the Berkman Center at Harvard Law School, held by the TransAtlantic Consumer Dialogue forum. You’ve probably already read Jame’s Love’s post on it on The Huffington Post. If not, take a look. Since I tend to spend my days with two kinds of people, the technical and the very technical, it was good to get out and hear a different perspective on the issues.

A familiar face at the Berkman Center was Sam Hiser, who has a new post, at once both visceral and witty, called “Pretending Interoperability”.

Finally, in order to increase the signal-to-noise ratio in this blog, I’ve instituted a new comment policy. Those comments which are outside of the prescribed bounds will not be published.

Why is OOXML Slow?

2006/10/19 By Rob 5 Comments

Of course, one could simply dismiss this question, saying that a specification for an XML vocabulary does not have performance as such, since a specification cannot be executed. However, the choices one makes in designing an XML language will impact the performance of the applications that work with the format. For example, both ODF and OOXML store their XML in compressed Zip files. This will cause reading and writing of a document to be faster in cases where memory is plentiful and computation is much faster than storage and retrieval, which is to say on most modern desktops. But this same scheme may be slower in other environments, say PDA’s. In the end, the performance characteristics of a format cannot be divorced from operational profile and environmental assumptions.

When comparing formats, it is important to isolate the effects of the format versus the application. This is important from the analysis standpoint, but also for legal reasons. Remember that the only implementation of (draft) OOXML is (beta) Office 2007, and the End User Licence Agreement (EULA) has this language:

7. SCOPE OF LICENSE. …You may not disclose the results of any benchmark tests of the software to any third party without Microsoft’s prior written approval

So let’s see what I can do while playing within those bounds. I started with a sample of 176 documents, randomly selected from the Ecma TC45’s document library. I’m hoping therefore that Microsoft will be less likely to argue that these are not typical. These documents are all in the legacy binary DOC format and include agendas, meeting minutes, drafts of various portions of the specification, etc.

Some basic statistics on this set of documents:

Min length = 1 page
Mode = 2 pages
Median length = 7 pages
Mean length = 34 pages
Max length = 409 pages

Min file size= 27,140 bytes
Median file size= 159,000 bytes
Mean file size= 749,000 bytes
Max file size= 15,870,000 bytes

So rather than pick a single document and claim that it reflects the whole, I looked at a wide range of document sizes in use within a specific document-centric organization.

I converted each document into ODF format as well as OOXML, using OpenOffice 2.03 and Office 2007 beta 2 respectively. As has been noted before, both ODF and OOXML formats are XML inside of a Zip archive. The compression from the zipping not only counters the expansion factor of the XML, but in fact results in files which are smaller than the original DOC files. The average OOXML document was 50% the size of the original DOC file, and the average ODF document was 38% the size of the DOC. So net result is that the ODF documents came out smaller, averaging 72% of their OOXML equivalents.

A quick sanity check of this result is easy to perform. Create an empty file in Word in OOXML format, and an empty file in OpenOffice in ODF format. Save both. The OOXML file ends up being 10,001 bytes, while the ODF file is only 6,888 bytes, or 69% of the OOXML file.

Here is a histogram of the ODF/OOXML size ratios for the sampled files. As you can see, there is a wide range of behaviors here, with some files even ending up larger in ODF format. But on average the ODF files were smaller.

What about the contents of the Zip archives? The OOXML documents tended to contain more XML files (on average 6 more) than the parallel ODF document, but these XML files were individually smaller, average 32,080 bytes versus 66,490 for ODF. However the net effect is that the average total size of the XML in the OOXML is greater than in ODF (684,856 bytes versus 401,406 bytes).

Here’s part 2 of the experiment. The proposal is that many (perhaps most) tools that deal with these formats will need to read and parse all of the XML files within the archive. So a core part of performance that these apps will share is how long it takes to unzip and parse these XML files. Of course this is only part of the performance story. What the application does with the parsed data is also critical, but that is application-dependent and hard to generalize. But the basic overhead of parsing is universal.

To test this out wrote a Python script to time how long it takes to unzip and parse (Python 2.4 minidom) all the XML’s in these 176 documents. I repeated each measurement 10 times and averaged. And I did this for both the OOXML and the ODF variants.

The results indicate that the ODF documents were parsed, on average 3.6x faster than the equivalent OOXML files. Here is a plot showing the ratio of OOXML parse time to ODF parse time as a function of page size:As you can see there is a wide variation in this ratio, especially with shorter documents. In some case the OOXML document took 8x or more longer time to parse than the equivelant ODF document. But with longer documents the variation settles out and settles on the 3.6x factor mentioned

Now how do we explain this? A rough model of XML parsing performance is that it has a fixed overhead to start up, initialize data structures, parse tables, etc., and then some incremental cost dependent on the size and complexity of the XML document. Most systems in the world work like this, fixed overhead plus incremental cost per unit of work. This is true whether we’re talking about XML parsing, HTTP transfers, cutting the lawn or giving blood at a blood bank. The general insight into these systems is that where the fixed overhead is significant, you want to batch up your work. Doing many small transactions will kill performance.

So one theory is that OOXML is slower because of the cost of initializing more XML parses. But it could also be because the aggregate size of the XML files are larger. More testing would be required to gauge the relative contribution of these two factors. However one thing is clear. Although this test was done with minidom on Python, the results are of wide applicability. I can think of no platform and no XML parser for which a larger document comprised of more XML files would be faster than a smaller document made up of fewer XML files. Parsing ODF word processing documents should be faster than OOXML versions everywhere.

I’m not the first one to notice some of these difference. Rick Jelliffe did some analysis of the differences between OOXML and ODF back in August. He approached it from a code complexity view, but in passing noted that the same word processor document loaded faster in ODF format in OpenOffice compared to the same document in OOXML format in Office 2007 beta. On the complexity side he noted that the ODF markup was more complex than the parallel OOXML document. So if ODF is more complex but also smaller, this may amount to higher information density, compactness of expression, etc., and that could certainly be a factor in performance.

So what’s your theory? Why do you think ODF word processing documents are faster than OOXML’s?

The Celerity of Verbosity

2006/10/17 By Rob 16 Comments

I’ve been hearing some rumblings from the north-west that Ecma Office Open XML (OOXML) format has better performance characteristics than OpenDocument Format (ODF), specifically because OOXML uses shorter tag names. Putting aside for the moment the question of whether OOXML is in fact faster than ODF (something I happen not to believe), let’s take a look at this reasonable question: What effect does using longer, humanly readable tags have on performance compared to using more cryptic terse names?

Obviously there are a number of variables at play here:

What XML API are you using? DOM or SAX? The overhead of holding the entire document in memory at once would presumably cause DOM to suffer more from tag length than SAX.
What XML parser implementation are you using? The use of internal symbol tables might make tag length less important or even irrelevant in some parsers.
What language are you programming in? Some language, like Java have string internalization features which can conflate all identical strings into a single instance.
What size document are you working with? Document parsing has fixed overhead as well as overhead proportionate to document size. A very short document will be dominated by fixed costs.

So there may not be a single answer for all users with all tools in all situations.

First, let’s talk a little about the tag length issue. It is important to note that the designer of an XML language has control over some, but not all names. For example take a namespace declaration:

xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"

The values of namespace URI’s are typically predetermined and are often long in order to reduce the chance of accidental collisions. But the namespace prefix is usually chosen to be quite short, and is under the control of the application writing the XML, though a specific prefix is typically not mandated by language designer.

Element and attribute names can certainly be set by the language designer.

Attribute values may or may not be determined by the language designer. For example:

val="Heading1"

Here the name of the style may be determined by the template, or even directly by the user if he is entering a new named style. So the language designer and the application may have no control over the length of attribute values. Other attribute values may be fixed as part of the schema, and the length of those are controlled by the language designer.

Similarly, the length of character content is also typically determined by the user, since this is typically how free-form user content is entered, i.e., the text of the document.

Finally, note that the core XML markup for beginning and ending elements, delimiting attribute values, character entities etc., are all non-negotiable. You can’t eliminate them to save space.

Now for a little experiment. For the sake of this investigation, I decided to explore the performance of a DOM parse in Python 2.4 of a medium-sized document. The document I picked was a random, 60 page document selected from Ecma TC45’s XML document library which I converted from Microsoft’s binary DOC format into OOXML.

As many of you know, an OOXML document is actually multiple XML documents stored inside a Zip archive file. The main content is in a file called “document.xml” so I restricted my examination to that file.

So, how much overhead is there in a our typical OOXML document? I wrote a little Python script to count up the size of all of the element names and attributes names that appeared in the document. I counted only the characters which were controllable by the language designer. So w:pPr counts as three characters, counting only “pPr” since the namespace and XML delimiters cannot be removed. “pPr” is what the XML specification calls an NCName, also called a non-qualified name, since it is not qualified or limited by a namespace. There were 51,800 NCName’s in this document, accounting for 16% of the overall document size. The average NCName was 3.2 characters long.

For comparison, a comparably sized ODF document had an average NCName length of 7.7 and an NCName’s represented 24% of the document size.

So, ODF certainly uses longer names than OOXML. Personally I think this is a good thing, from the perspective of readability, a concern of particular interest to the application developer. Machines will get faster, memory will get cheaper, bandwidth will increase and latency will decrease, but programmers will never get any smarter and schedules will never allow enough time to complete the project. Human Evolution progresses at too slow a speed. So if you need to make a small trade-off between readability and performance, I usually favor readability. I can always tune the code to make it faster. But the developers are at a permanent disadvantage if the language uses cryptic. I can’t tune them.

But let’s see if there is really a trade-off to be made here at all. Let’s measure, not assume. Do longer names really hurt performance as Microsoft claims?

Here’s what I did. I took the original document.xml and expanded the NCNames for the most commonly-used tags. Simple search and replace. First I doubled them in length. Then quadrupled. Then 8x longer. Then 16x and even 32x longer. I then timed 1,000 parses of these XML files, choosing the files at random to avoid any bias over time caused by memory fragmentation or whatever. The results are as follows:

Expansion Factor	NCName Count	Total NCName Size (bytes)	File size (bytes)	NCName Overhead	Average NCName Length (bytes)	Average Parse Time (seconds)
1 (original)	51,800	166,898	1,036,393	16%	3.2	3.3
2	51,800	187,244	1,056,739	18%	3.6	3.2
4	51,800	227,936	1,097,443	21%	4.4	3.2
8	51,800	309,320	1,178,827	26%	6.0	3.2
16	51,800	472,088	1,341,595	35%	9.1	3.3
32	51,800	797,624	1,667,131	48%	15.4	3.3

If you like box-and-whisker plots (I sure do!) then here you go:What does this all mean? Even though we expanded some NCNames to 32-times their original length, making a 5x increase in the average NCName length, it made no significant difference in parse time. There is no discernible slow down in parse time as the element and attribute names increase.

Keep in mind again that the typical ODF documents shows an average NCName length of 7.7 . The above tests dealt with lengths twice that amount, and still no slowdown.

“Myth Busted”. I revert this topic to the spreaders of such FUD to substantiate their contrary claims.

When language goes on holiday

2006/10/15 By Rob 4 Comments

This apt phrase is from Wittgenstein, Philosophical Investigations, section 38, “Philosophical problems arise when language goes on holiday”. One cannot be sloppy in language without at the same time being sloppy in thought.

Of course, this thought is not new. In Analects 13:3, Confucius is given a hypothetical question by a disciple: “If the ruler of Wei put the administration of his state in your hands, what would you do first?”. Confucius replied, “There must be a Rectification of Names,” explaining:

If language is not correct, then what is said is not what is meant; if what is said is not what is meant, then what must be done remains undone; if this remains undone, morals and art will deteriorate; if justice goes astray, the people will stand about in helpless confusion. Hence there must be no arbitrariness in what is said. This matters above everything.

In that spirit, let us talk of “choice”, a word loaded with meaning. Choice is good, right? Who would voluntarily give up their god-given right to choose for himself? Reducing choice is immoral. A central role of government is to ensure that we can choose freely. For a market to thrive it must be free of every regulation that reduces our ability to choose. These are all self-evident truths.

Or are they?

Let me set you a problem. I place before you a glass of water. Whether it is half full or half empty I leave to your imagination. What use is this glass of water to you? Certainly you can drink it. Or you could sell it to someone else. Or you could create a derivative option to buy the water, and sell this option to someone else. Or you could pledge the water as collateral for some other purchase. You have several options, several choices. But suppose you are thirsty. Then what do you do with this nice, cold glass of water? If you drink it, then you can no longer sell it, sell options on it, or pledge it. Drinking the water eliminates choice. So better not to drink it. Just let it sit there, on the table. But still you get thirstier and thirstier.

What a cruel dilemma I’ve given you! You cannot drink without reducing your future options, without eliminating choice. Of course, the water slowing gets warmer and evaporates. Even not choosing is itself a choice.

The Moving Finger writes; and, having writ,
Moves on: nor all your Piety nor Wit
Shall lure it back to cancel half a Line,
Nor all your Tears wash out a Word of it.
— Omar Khayyam

How are we to make sense of this paradox? The fact is that every decision, ever choice you make, commits you and eliminates some other choices. We choose because without choosing we cannot claim the value in a single path among alternatives. If you want to quench your thirst then you must drink the water. It is that simple.

So I’ve found it amusing to see how Microsoft and their supporters constantly attack open source and open standards on the grounds that they reduce choice. For example, Microsoft’s lobbying arm, with the Orwellian doublespeak name “The Freedom to Innovate Network” lists this among its policy talking points:

[G]overnments should not freeze innovation by mandating use of specific technology standards

This talking point is picked up and repeated. Open Malaysia picks on a local news article which quoted a Microsoft director speaking on Malaysia’s move toward favoring Free and Open Source Software (FOSS) in government procurements:

My opinion is that it [the policy] limits choice as the country has a software procurement preference policy

The Initiative For Software Choice is the latest face on the hundred-headed hydra spreading FUD around the world. However they have recently had the embarrassment of seeing an example of their handiwork leaked to the press which is worth a read in full.

This in itself is neither new nor news, but it just recently occurred to me that this is all just an abuse of language, with no substance behind it. When one adopts a technology standard one does it with some desired outcome in mind. One chooses this path in order to receive that benefit. Adopting a standard is like drinking a glass of water. You doing it because you are thirsty.

A recent Danish report (the “Rambøll Report”) looked at the significant cost savings of moving the Danish government to OpenOffice/ODF compared to using MS Office with OOXML. Is it wrong to choose a less expensive alternative? Or is it better not to choose at all, and forgo the cost savings?

I think we need to all ask ourselves what we thirst for. Are you suffering from vendor lock-in? Are your documents tied to a single platform and vendor? Are you overpaying for software of which you use only a fraction of the functionality? Are you unable to move to a more robust desktop platform because your application vendor has tied its applications to a single platform? If you are thirsty, I have one word of advice: “Drink”.

Lingua franca, lingua exposita

2006/10/05 By Rob Leave a Comment

Eiffel Tower

Via Bob Sutor’s Open Blog, news that a French Government report is recommending that all government publications be made available in ODF format. It also encourages their European partners to do the same when exchanging documents.

More, from InfoWorld.