XML

Of course, one could simply dismiss this question, saying that a specification for an XML vocabulary does not have performance as such, since a specification cannot be executed. However, the choices one makes in designing an XML language will impact the performance of the applications that work with the format. For example, both ODF and OOXML store their XML in compressed Zip files. This will cause reading and writing of a document to be faster in cases where memory is plentiful and computation is much faster than storage and retrieval, which is to say on most modern desktops. But this same scheme may be slower in other environments, say PDA’s. In the end, the performance characteristics of a format cannot be divorced from operational profile and environmental assumptions.

When comparing formats, it is important to isolate the effects of the format versus the application. This is important from the analysis standpoint, but also for legal reasons. Remember that the only implementation of (draft) OOXML is (beta) Office 2007, and the End User Licence Agreement (EULA) has this language:

7. SCOPE OF LICENSE. …You may not disclose the results of any benchmark tests of the software to any third party without Microsoft’s prior written approval

So let’s see what I can do while playing within those bounds. I started with a sample of 176 documents, randomly selected from the Ecma TC45’s document library. I’m hoping therefore that Microsoft will be less likely to argue that these are not typical. These documents are all in the legacy binary DOC format and include agendas, meeting minutes, drafts of various portions of the specification, etc.

Some basic statistics on this set of documents:

Min length = 1 page
Mode = 2 pages
Median length = 7 pages
Mean length = 34 pages
Max length = 409 pages

Min file size= 27,140 bytes
Median file size= 159,000 bytes
Mean file size= 749,000 bytes
Max file size= 15,870,000 bytes

So rather than pick a single document and claim that it reflects the whole, I looked at a wide range of document sizes in use within a specific document-centric organization.

I converted each document into ODF format as well as OOXML, using OpenOffice 2.03 and Office 2007 beta 2 respectively. As has been noted before, both ODF and OOXML formats are XML inside of a Zip archive. The compression from the zipping not only counters the expansion factor of the XML, but in fact results in files which are smaller than the original DOC files. The average OOXML document was 50% the size of the original DOC file, and the average ODF document was 38% the size of the DOC. So net result is that the ODF documents came out smaller, averaging 72% of their OOXML equivalents.

A quick sanity check of this result is easy to perform. Create an empty file in Word in OOXML format, and an empty file in OpenOffice in ODF format. Save both. The OOXML file ends up being 10,001 bytes, while the ODF file is only 6,888 bytes, or 69% of the OOXML file.

Here is a histogram of the ODF/OOXML size ratios for the sampled files. As you can see, there is a wide range of behaviors here, with some files even ending up larger in ODF format. But on average the ODF files were smaller.

What about the contents of the Zip archives? The OOXML documents tended to contain more XML files (on average 6 more) than the parallel ODF document, but these XML files were individually smaller, average 32,080 bytes versus 66,490 for ODF. However the net effect is that the average total size of the XML in the OOXML is greater than in ODF (684,856 bytes versus 401,406 bytes).

Here’s part 2 of the experiment. The proposal is that many (perhaps most) tools that deal with these formats will need to read and parse all of the XML files within the archive. So a core part of performance that these apps will share is how long it takes to unzip and parse these XML files. Of course this is only part of the performance story. What the application does with the parsed data is also critical, but that is application-dependent and hard to generalize. But the basic overhead of parsing is universal.

To test this out wrote a Python script to time how long it takes to unzip and parse (Python 2.4 minidom) all the XML’s in these 176 documents. I repeated each measurement 10 times and averaged. And I did this for both the OOXML and the ODF variants.

The results indicate that the ODF documents were parsed, on average 3.6x faster than the equivalent OOXML files. Here is a plot showing the ratio of OOXML parse time to ODF parse time as a function of page size:As you can see there is a wide variation in this ratio, especially with shorter documents. In some case the OOXML document took 8x or more longer time to parse than the equivelant ODF document. But with longer documents the variation settles out and settles on the 3.6x factor mentioned

Now how do we explain this? A rough model of XML parsing performance is that it has a fixed overhead to start up, initialize data structures, parse tables, etc., and then some incremental cost dependent on the size and complexity of the XML document. Most systems in the world work like this, fixed overhead plus incremental cost per unit of work. This is true whether we’re talking about XML parsing, HTTP transfers, cutting the lawn or giving blood at a blood bank. The general insight into these systems is that where the fixed overhead is significant, you want to batch up your work. Doing many small transactions will kill performance.

So one theory is that OOXML is slower because of the cost of initializing more XML parses. But it could also be because the aggregate size of the XML files are larger. More testing would be required to gauge the relative contribution of these two factors. However one thing is clear. Although this test was done with minidom on Python, the results are of wide applicability. I can think of no platform and no XML parser for which a larger document comprised of more XML files would be faster than a smaller document made up of fewer XML files. Parsing ODF word processing documents should be faster than OOXML versions everywhere.

I’m not the first one to notice some of these difference. Rick Jelliffe did some analysis of the differences between OOXML and ODF back in August. He approached it from a code complexity view, but in passing noted that the same word processor document loaded faster in ODF format in OpenOffice compared to the same document in OOXML format in Office 2007 beta. On the complexity side he noted that the ODF markup was more complex than the parallel OOXML document. So if ODF is more complex but also smaller, this may amount to higher information density, compactness of expression, etc., and that could certainly be a factor in performance.

So what’s your theory? Why do you think ODF word processing documents are faster than OOXML’s?

Expansion Factor	NCName Count	Total NCName Size (bytes)	File size (bytes)	NCName Overhead	Average NCName Length (bytes)	Average Parse Time (seconds)
1 (original)	51,800	166,898	1,036,393	16%	3.2	3.3
2	51,800	187,244	1,056,739	18%	3.6	3.2
4	51,800	227,936	1,097,443	21%	4.4	3.2
8	51,800	309,320	1,178,827	26%	6.0	3.2
16	51,800	472,088	1,341,595	35%	9.1	3.3
32	51,800	797,624	1,667,131	48%	15.4	3.3

Expansion Factor

NCName Count

Total NCName Size (bytes)

File size (bytes)

NCName Overhead

Average NCName Length (bytes)

Average Parse Time (seconds)

1 (original)

51,800

166,898

1,036,393

16%

3.2

3.3

51,800

187,244

1,056,739

18%

3.6

3.2

51,800

227,936

1,097,443

21%

4.4

3.2

51,800

309,320

1,178,827

26%

6.0

3.2

51,800

472,088

1,341,595

35%

9.1

3.3

51,800

797,624

1,667,131

48%

15.4

3.3

XML

Why is OOXML Slow?

The Celerity of Verbosity