Performance

ODF Performance Tip #1: Don’t re-compress images

2010/04/07 By Rob 4 Comments

ODFDOM is an open source (Apache 2.0) Java library for reading, writing and modifying ODF documents. It runs standalone, not requiring OpenOffice.org or any other editor to be installed. It operates directly on the ODF document itself.

One of the things we’re focusing on in the next release of (the 0.9 release) is optimizing the performance, getting ODFDOM to read and write ODF documents as fast as possible, and with as low a memory footprint as possible. The aim is to make it optimal for concurrent use, say in a Java servlet.

Some of the the things we’re finding as we profile ODFDOM are worth sharing, since they are not specific to this library. They are tips and techniques that are applicable more broadly, potentially to all applications that work with ODF documents. I’ll do a series of posts on these ideas. Hopefully you will find them useful and maybe even can share your tricks as well.

The first thing I’ll point out concerns documents with many image resources, such as large presentation files with a lot of graphics. We found that writing these documents was rather slow. The problem was in how the images were stored in the ZIP archive. As you may know, ZIP allows a file to be compressed (most commonly using the DEFLATE algorithm). Most ZIP libraries will, by default, compress every file you add to the archive. However, for many common media types, like PNG and JPG images, the data has already been compressed, at the level of the image encoding. So if you have your ZIP library try to compress the images a second time, you will typically waste time with very little incremental savings in storage.

Most ZIP libraries have an alternative way to store files in their original, uncompressed form, a method called STORE. What we found in ODFDOM was that if we store images rather than compress them, the time needed to save our large presentation was reduced by 20%, while the size of the archive increased only 3%. So this was a good trade-off.

I think this technique would be applicable to other libraries and editors.

Spreadsheet file format performance

2008/05/13 By Rob 19 Comments

I’ve been doing some performance timings of file format support, comparing MS Office and OpenOffice. Most of the results are as expected, but some are surprising, and one in particular is quite disappointing.

But first, a little details of my setup. All timings, done by stopwatch, were from Office 2003 and OpenOffice 2.4.0 running on Windows XP, with all current service packs and patches. The machine is a Lenova T60p, dual-core Intel 2.16 Ghz and 2 GB of RAM. I took all the standard precautions — disk was defragmented, and test files were confirmed as defragmented using contig. No other applications were running and background tasks were all shut down.

For test files, I went back to an old favorite, George Ou’s (at the time with ZDNet) monster 50MB XLS file from his series of tests back in 2005. This file, although very large, is very simple. There are no formulas, indeed no formatting or styles. It is just text and numbers, treating a spreadsheet like a giant data table. So tests of this file will emphasize the raw throughput of the applications. Real world spreadsheets will typically be worse than this due to additional overhead from process styles, formulas, etc.

A test of a single file is not really that interesting. We want to see trends, see patterns. So I made a set of variations on George’s original file, converting it into ODF, XLS and OOXML formats, as well as making scaled down versions of it. In total I made 12 different sized subsets of the original file, ranging down to a 437KB version, and created each file in all three formats. I then tested how long it took to load each file in each of the applications. In the case of MS Office, I installed the current versions of the translators for those formats, the Compatibility Pack for OOXML, and the ODF Add-in for the ODF support.

I find it convenient to report numbers per 100,000 spreadsheet cells. You could equally well use the original XLS spreadsheet size, or the number of rows of data, or any other correlated variable as the ordinate, but values per 100K cells is simple for anyone to understand.

I’ll spare you all the pretty picture. If you want to make some, here is the raw data (CSV format). But I will give some summary observations.

For document sizes, the results are as follows:

Binary XLS format = 1,503 KB per 100K cells
OOXML format = 491 KB per 100K cells
ODF format = 117 KB per 100K cells

So the XML formats are far smaller than the legacy binary format. This is due to the added Zip compression that both XML formats use. Also, note that the ODF files are significantly smaller than the OOXML files, less than 1/4 the size on average. Upon further examination, the XML document representing the ODF content is larger than the corresponding XML in OOXML, as expected, due to its use of longer, more descriptive markup tags. However the ODF XML compresses far better than the OOXML version, enough to overcome its greater verbosity and result in files smaller than OOXML. The compression ratio (original/zipped) for ODF’s content.xml is 87, whereas the compression ratio for OOXML’s sheet1.xml is only 12. We could just mumble something about entropy and walk away, but I think this area could bear further investigation.

Any ideas?

For load time, the times for processing the binary XLS files were:

Microsoft Office 2003 = 0.03 seconds per 100K cells
OpenOffice 2.4.0 = 0.4 seconds per 100K cells

Not too surprising. These binary formats are optimized for the guts of MS Office. We would expect them to load faster in their native application.

So what about the new XML formats? There has been recent talk about the “Angle Bracket Tax” for XML formats. How bad is it?

Microsoft Office 2003 with OOXML = 1.5 seconds per 100K cells
OpenOffice 2.4.0 with ODF = 2.7 seconds per 100K cells

For typical sized documents, you probably will not notice the difference. However with the largest documents, like the 16-page, 3-million cells monster sheet, the OOXML document took 40 seconds to load in Office, the ODF sheet took 90 seconds to load in OpenOffice, whereas the XLS binary took less than 2 seconds to load in MS Office.

OK. So what are we missing. Ah, yes, ODF format in MS Office, using their ODF Add-in.

Microsoft Office 2003 with ODF, using the ODF Add-in = 74.6 seconds per 100K cells

Yup. You read that right. To put this in perspective, let’s look at a single test file, a 600K cells file, as we load it in the various formats and editors:

Microsoft Office 2003 in XLS format = 0.75 seconds
OpenOffice 2.4.0 in XLS format = 3.03 seconds
Microsoft Office 2003 in OOXML format = 8.28 seconds
OpenOffice 2.4.0 in ODF format = 14.09 seconds
Microsoft Office 2003 in ODF format = 515.60 seconds

Can someone explain to me why Microsoft Office needs almost 10 minutes to load an ODF file that OpenOffice can load in 14 seconds?

(I was not able to test files larger than this using the ODF Add-in since they all crashed .)

(Update: Since it is the question everyone wants to know, the beta version of OpenOffice 3.0 opens the OOXML version of that file in 49.4 seconds and Sun’s ODF Plugin for Microsoft Office loads this file in 30.03 seconds. )

This is one reason why I think file format translation is a poor engineering approach to interoperability. When OpenOffice wants to read an legacy XLS file, it does not approach the problem by translating the XLS into an ODF document and then loading the ODF file. Instead they simply load the XLS file, via a file filter, into the internal memory model of OpenOffice.

What is a file filter? It is like 1/2 of a translator. Instead of translating from one disk format to another disk format, it simply loads the disk format and maps it into an application-specific memory model that the application logic can operate directly on. This is far more efficient than translation. This is the untold truth that the layperson does not know. But this is how everyone does it. That is how we support formats in SmartSuite. That is how OpenOffice does it. And that is how MS Office does it for the file formats they care about. In fact, that is the way that Novell is now doing it now, since they discovered that the Microsoft approach is doomed to performance hell.

So it is with some amusement that I watch Microsoft and others propose translation as a solution to interoperability, creating reports about translation, even a proposal for a new work item in JTC1/SC34 concerning file format translation, when the single concrete attempt at translation is such an abysmal failure. It may look great on paper, but it is an engineering disaster. What customers need is direct, internal support for ODF in MS Office, via native code, in a file filter, not a translator that takes 10 minutes to load a file.

The astute engineer will agree with the above, but will also feel some discomfort at the numbers. There is more here than can be explained simply by the use of translators versus import filters. That choice might explain a 2x difference in performance. A particularly poor implementation might explain a 5x difference. But none of this explains why MS Office is almost 40x slower in processing ODF files. Being that much slower is hard to do accidentally. Other forces must be at play.

Any ideas?

Why is OOXML Slow?

2006/10/19 By Rob 5 Comments

Of course, one could simply dismiss this question, saying that a specification for an XML vocabulary does not have performance as such, since a specification cannot be executed. However, the choices one makes in designing an XML language will impact the performance of the applications that work with the format. For example, both ODF and OOXML store their XML in compressed Zip files. This will cause reading and writing of a document to be faster in cases where memory is plentiful and computation is much faster than storage and retrieval, which is to say on most modern desktops. But this same scheme may be slower in other environments, say PDA’s. In the end, the performance characteristics of a format cannot be divorced from operational profile and environmental assumptions.

When comparing formats, it is important to isolate the effects of the format versus the application. This is important from the analysis standpoint, but also for legal reasons. Remember that the only implementation of (draft) OOXML is (beta) Office 2007, and the End User Licence Agreement (EULA) has this language:

7. SCOPE OF LICENSE. …You may not disclose the results of any benchmark tests of the software to any third party without Microsoft’s prior written approval

So let’s see what I can do while playing within those bounds. I started with a sample of 176 documents, randomly selected from the Ecma TC45’s document library. I’m hoping therefore that Microsoft will be less likely to argue that these are not typical. These documents are all in the legacy binary DOC format and include agendas, meeting minutes, drafts of various portions of the specification, etc.

Some basic statistics on this set of documents:

Min length = 1 page
Mode = 2 pages
Median length = 7 pages
Mean length = 34 pages
Max length = 409 pages

Min file size= 27,140 bytes
Median file size= 159,000 bytes
Mean file size= 749,000 bytes
Max file size= 15,870,000 bytes

So rather than pick a single document and claim that it reflects the whole, I looked at a wide range of document sizes in use within a specific document-centric organization.

I converted each document into ODF format as well as OOXML, using OpenOffice 2.03 and Office 2007 beta 2 respectively. As has been noted before, both ODF and OOXML formats are XML inside of a Zip archive. The compression from the zipping not only counters the expansion factor of the XML, but in fact results in files which are smaller than the original DOC files. The average OOXML document was 50% the size of the original DOC file, and the average ODF document was 38% the size of the DOC. So net result is that the ODF documents came out smaller, averaging 72% of their OOXML equivalents.

A quick sanity check of this result is easy to perform. Create an empty file in Word in OOXML format, and an empty file in OpenOffice in ODF format. Save both. The OOXML file ends up being 10,001 bytes, while the ODF file is only 6,888 bytes, or 69% of the OOXML file.

Here is a histogram of the ODF/OOXML size ratios for the sampled files. As you can see, there is a wide range of behaviors here, with some files even ending up larger in ODF format. But on average the ODF files were smaller.

What about the contents of the Zip archives? The OOXML documents tended to contain more XML files (on average 6 more) than the parallel ODF document, but these XML files were individually smaller, average 32,080 bytes versus 66,490 for ODF. However the net effect is that the average total size of the XML in the OOXML is greater than in ODF (684,856 bytes versus 401,406 bytes).

Here’s part 2 of the experiment. The proposal is that many (perhaps most) tools that deal with these formats will need to read and parse all of the XML files within the archive. So a core part of performance that these apps will share is how long it takes to unzip and parse these XML files. Of course this is only part of the performance story. What the application does with the parsed data is also critical, but that is application-dependent and hard to generalize. But the basic overhead of parsing is universal.

To test this out wrote a Python script to time how long it takes to unzip and parse (Python 2.4 minidom) all the XML’s in these 176 documents. I repeated each measurement 10 times and averaged. And I did this for both the OOXML and the ODF variants.

The results indicate that the ODF documents were parsed, on average 3.6x faster than the equivalent OOXML files. Here is a plot showing the ratio of OOXML parse time to ODF parse time as a function of page size:As you can see there is a wide variation in this ratio, especially with shorter documents. In some case the OOXML document took 8x or more longer time to parse than the equivelant ODF document. But with longer documents the variation settles out and settles on the 3.6x factor mentioned

Now how do we explain this? A rough model of XML parsing performance is that it has a fixed overhead to start up, initialize data structures, parse tables, etc., and then some incremental cost dependent on the size and complexity of the XML document. Most systems in the world work like this, fixed overhead plus incremental cost per unit of work. This is true whether we’re talking about XML parsing, HTTP transfers, cutting the lawn or giving blood at a blood bank. The general insight into these systems is that where the fixed overhead is significant, you want to batch up your work. Doing many small transactions will kill performance.

So one theory is that OOXML is slower because of the cost of initializing more XML parses. But it could also be because the aggregate size of the XML files are larger. More testing would be required to gauge the relative contribution of these two factors. However one thing is clear. Although this test was done with minidom on Python, the results are of wide applicability. I can think of no platform and no XML parser for which a larger document comprised of more XML files would be faster than a smaller document made up of fewer XML files. Parsing ODF word processing documents should be faster than OOXML versions everywhere.

I’m not the first one to notice some of these difference. Rick Jelliffe did some analysis of the differences between OOXML and ODF back in August. He approached it from a code complexity view, but in passing noted that the same word processor document loaded faster in ODF format in OpenOffice compared to the same document in OOXML format in Office 2007 beta. On the complexity side he noted that the ODF markup was more complex than the parallel OOXML document. So if ODF is more complex but also smaller, this may amount to higher information density, compactness of expression, etc., and that could certainly be a factor in performance.

So what’s your theory? Why do you think ODF word processing documents are faster than OOXML’s?

The Celerity of Verbosity

2006/10/17 By Rob 16 Comments

I’ve been hearing some rumblings from the north-west that Ecma Office Open XML (OOXML) format has better performance characteristics than OpenDocument Format (ODF), specifically because OOXML uses shorter tag names. Putting aside for the moment the question of whether OOXML is in fact faster than ODF (something I happen not to believe), let’s take a look at this reasonable question: What effect does using longer, humanly readable tags have on performance compared to using more cryptic terse names?

Obviously there are a number of variables at play here:

What XML API are you using? DOM or SAX? The overhead of holding the entire document in memory at once would presumably cause DOM to suffer more from tag length than SAX.
What XML parser implementation are you using? The use of internal symbol tables might make tag length less important or even irrelevant in some parsers.
What language are you programming in? Some language, like Java have string internalization features which can conflate all identical strings into a single instance.
What size document are you working with? Document parsing has fixed overhead as well as overhead proportionate to document size. A very short document will be dominated by fixed costs.

So there may not be a single answer for all users with all tools in all situations.

First, let’s talk a little about the tag length issue. It is important to note that the designer of an XML language has control over some, but not all names. For example take a namespace declaration:

xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"

The values of namespace URI’s are typically predetermined and are often long in order to reduce the chance of accidental collisions. But the namespace prefix is usually chosen to be quite short, and is under the control of the application writing the XML, though a specific prefix is typically not mandated by language designer.

Element and attribute names can certainly be set by the language designer.

Attribute values may or may not be determined by the language designer. For example:

val="Heading1"

Here the name of the style may be determined by the template, or even directly by the user if he is entering a new named style. So the language designer and the application may have no control over the length of attribute values. Other attribute values may be fixed as part of the schema, and the length of those are controlled by the language designer.

Similarly, the length of character content is also typically determined by the user, since this is typically how free-form user content is entered, i.e., the text of the document.

Finally, note that the core XML markup for beginning and ending elements, delimiting attribute values, character entities etc., are all non-negotiable. You can’t eliminate them to save space.

Now for a little experiment. For the sake of this investigation, I decided to explore the performance of a DOM parse in Python 2.4 of a medium-sized document. The document I picked was a random, 60 page document selected from Ecma TC45’s XML document library which I converted from Microsoft’s binary DOC format into OOXML.

As many of you know, an OOXML document is actually multiple XML documents stored inside a Zip archive file. The main content is in a file called “document.xml” so I restricted my examination to that file.

So, how much overhead is there in a our typical OOXML document? I wrote a little Python script to count up the size of all of the element names and attributes names that appeared in the document. I counted only the characters which were controllable by the language designer. So w:pPr counts as three characters, counting only “pPr” since the namespace and XML delimiters cannot be removed. “pPr” is what the XML specification calls an NCName, also called a non-qualified name, since it is not qualified or limited by a namespace. There were 51,800 NCName’s in this document, accounting for 16% of the overall document size. The average NCName was 3.2 characters long.

For comparison, a comparably sized ODF document had an average NCName length of 7.7 and an NCName’s represented 24% of the document size.

So, ODF certainly uses longer names than OOXML. Personally I think this is a good thing, from the perspective of readability, a concern of particular interest to the application developer. Machines will get faster, memory will get cheaper, bandwidth will increase and latency will decrease, but programmers will never get any smarter and schedules will never allow enough time to complete the project. Human Evolution progresses at too slow a speed. So if you need to make a small trade-off between readability and performance, I usually favor readability. I can always tune the code to make it faster. But the developers are at a permanent disadvantage if the language uses cryptic. I can’t tune them.

But let’s see if there is really a trade-off to be made here at all. Let’s measure, not assume. Do longer names really hurt performance as Microsoft claims?

Here’s what I did. I took the original document.xml and expanded the NCNames for the most commonly-used tags. Simple search and replace. First I doubled them in length. Then quadrupled. Then 8x longer. Then 16x and even 32x longer. I then timed 1,000 parses of these XML files, choosing the files at random to avoid any bias over time caused by memory fragmentation or whatever. The results are as follows:

Expansion Factor	NCName Count	Total NCName Size (bytes)	File size (bytes)	NCName Overhead	Average NCName Length (bytes)	Average Parse Time (seconds)
1 (original)	51,800	166,898	1,036,393	16%	3.2	3.3
2	51,800	187,244	1,056,739	18%	3.6	3.2
4	51,800	227,936	1,097,443	21%	4.4	3.2
8	51,800	309,320	1,178,827	26%	6.0	3.2
16	51,800	472,088	1,341,595	35%	9.1	3.3
32	51,800	797,624	1,667,131	48%	15.4	3.3

If you like box-and-whisker plots (I sure do!) then here you go:What does this all mean? Even though we expanded some NCNames to 32-times their original length, making a 5x increase in the average NCName length, it made no significant difference in parse time. There is no discernible slow down in parse time as the element and attribute names increase.

Keep in mind again that the typical ODF documents shows an average NCName length of 7.7 . The above tests dealt with lengths twice that amount, and still no slowdown.

“Myth Busted”. I revert this topic to the spreaders of such FUD to substantiate their contrary claims.