I’ve been hearing some rumblings from the north-west that Ecma Office Open XML (OOXML) format has better performance characteristics than OpenDocument Format (ODF), specifically because OOXML uses shorter tag names. Putting aside for the moment the question of whether OOXML is in fact faster than ODF (something I happen not to believe), let’s take a look at this reasonable question: What effect does using longer, humanly readable tags have on performance compared to using more cryptic terse names?
Obviously there are a number of variables at play here:
- What XML API are you using? DOM or SAX? The overhead of holding the entire document in memory at once would presumably cause DOM to suffer more from tag length than SAX.
- What XML parser implementation are you using? The use of internal symbol tables might make tag length less important or even irrelevant in some parsers.
- What language are you programming in? Some language, like Java have string internalization features which can conflate all identical strings into a single instance.
- What size document are you working with? Document parsing has fixed overhead as well as overhead proportionate to document size. A very short document will be dominated by fixed costs.
So there may not be a single answer for all users with all tools in all situations.
First, let’s talk a little about the tag length issue. It is important to note that the designer of an XML language has control over some, but not all names. For example take a namespace declaration:
The values of namespace URI’s are typically predetermined and are often long in order to reduce the chance of accidental collisions. But the namespace prefix is usually chosen to be quite short, and is under the control of the application writing the XML, though a specific prefix is typically not mandated by language designer.
Element and attribute names can certainly be set by the language designer.
Attribute values may or may not be determined by the language designer. For example:
Here the name of the style may be determined by the template, or even directly by the user if he is entering a new named style. So the language designer and the application may have no control over the length of attribute values. Other attribute values may be fixed as part of the schema, and the length of those are controlled by the language designer.
Similarly, the length of character content is also typically determined by the user, since this is typically how free-form user content is entered, i.e., the text of the document.
Finally, note that the core XML markup for beginning and ending elements, delimiting attribute values, character entities etc., are all non-negotiable. You can’t eliminate them to save space.
Now for a little experiment. For the sake of this investigation, I decided to explore the performance of a DOM parse in Python 2.4 of a medium-sized document. The document I picked was a random, 60 page document selected from Ecma TC45’s XML document library which I converted from Microsoft’s binary DOC format into OOXML.
As many of you know, an OOXML document is actually multiple XML documents stored inside a Zip archive file. The main content is in a file called “document.xml” so I restricted my examination to that file.
So, how much overhead is there in a our typical OOXML document? I wrote a little Python script to count up the size of all of the element names and attributes names that appeared in the document. I counted only the characters which were controllable by the language designer. So w:pPr counts as three characters, counting only “pPr” since the namespace and XML delimiters cannot be removed. “pPr” is what the XML specification calls an NCName, also called a non-qualified name, since it is not qualified or limited by a namespace. There were 51,800 NCName’s in this document, accounting for 16% of the overall document size. The average NCName was 3.2 characters long.
For comparison, a comparably sized ODF document had an average NCName length of 7.7 and an NCName’s represented 24% of the document size.
So, ODF certainly uses longer names than OOXML. Personally I think this is a good thing, from the perspective of readability, a concern of particular interest to the application developer. Machines will get faster, memory will get cheaper, bandwidth will increase and latency will decrease, but programmers will never get any smarter and schedules will never allow enough time to complete the project. Human Evolution progresses at too slow a speed. So if you need to make a small trade-off between readability and performance, I usually favor readability. I can always tune the code to make it faster. But the developers are at a permanent disadvantage if the language uses cryptic. I can’t tune them.
But let’s see if there is really a trade-off to be made here at all. Let’s measure, not assume. Do longer names really hurt performance as Microsoft claims?
Here’s what I did. I took the original document.xml and expanded the NCNames for the most commonly-used tags. Simple search and replace. First I doubled them in length. Then quadrupled. Then 8x longer. Then 16x and even 32x longer. I then timed 1,000 parses of these XML files, choosing the files at random to avoid any bias over time caused by memory fragmentation or whatever. The results are as follows:
|Expansion Factor||NCName Count||Total NCName Size (bytes)||File size (bytes)||NCName Overhead||Average NCName Length (bytes)||Average Parse Time (seconds)|
If you like box-and-whisker plots (I sure do!) then here you go:What does this all mean? Even though we expanded some NCNames to 32-times their original length, making a 5x increase in the average NCName length, it made no significant difference in parse time. There is no discernible slow down in parse time as the element and attribute names increase.
Keep in mind again that the typical ODF documents shows an average NCName length of 7.7 . The above tests dealt with lengths twice that amount, and still no slowdown.
“Myth Busted”. I revert this topic to the spreaders of such FUD to substantiate their contrary claims.