I’ve been doing some performance timings of file format support, comparing MS Office and OpenOffice. Most of the results are as expected, but some are surprising, and one in particular is quite disappointing.
But first, a little details of my setup. All timings, done by stopwatch, were from Office 2003 and OpenOffice 2.4.0 running on Windows XP, with all current service packs and patches. The machine is a Lenova T60p, dual-core Intel 2.16 Ghz and 2 GB of RAM. I took all the standard precautions — disk was defragmented, and test files were confirmed as defragmented using contig. No other applications were running and background tasks were all shut down.
For test files, I went back to an old favorite, George Ou’s (at the time with ZDNet) monster 50MB XLS file from his series of tests back in 2005. This file, although very large, is very simple. There are no formulas, indeed no formatting or styles. It is just text and numbers, treating a spreadsheet like a giant data table. So tests of this file will emphasize the raw throughput of the applications. Real world spreadsheets will typically be worse than this due to additional overhead from process styles, formulas, etc.
A test of a single file is not really that interesting. We want to see trends, see patterns. So I made a set of variations on George’s original file, converting it into ODF, XLS and OOXML formats, as well as making scaled down versions of it. In total I made 12 different sized subsets of the original file, ranging down to a 437KB version, and created each file in all three formats. I then tested how long it took to load each file in each of the applications. In the case of MS Office, I installed the current versions of the translators for those formats, the Compatibility Pack for OOXML, and the ODF Add-in for the ODF support.
I find it convenient to report numbers per 100,000 spreadsheet cells. You could equally well use the original XLS spreadsheet size, or the number of rows of data, or any other correlated variable as the ordinate, but values per 100K cells is simple for anyone to understand.
I’ll spare you all the pretty picture. If you want to make some, here is the raw data (CSV format). But I will give some summary observations.
For document sizes, the results are as follows:
- Binary XLS format = 1,503 KB per 100K cells
- OOXML format = 491 KB per 100K cells
- ODF format = 117 KB per 100K cells
So the XML formats are far smaller than the legacy binary format. This is due to the added Zip compression that both XML formats use. Also, note that the ODF files are significantly smaller than the OOXML files, less than 1/4 the size on average. Upon further examination, the XML document representing the ODF content is larger than the corresponding XML in OOXML, as expected, due to its use of longer, more descriptive markup tags. However the ODF XML compresses far better than the OOXML version, enough to overcome its greater verbosity and result in files smaller than OOXML. The compression ratio (original/zipped) for ODF’s content.xml is 87, whereas the compression ratio for OOXML’s sheet1.xml is only 12. We could just mumble something about entropy and walk away, but I think this area could bear further investigation.
Any ideas?
For load time, the times for processing the binary XLS files were:
- Microsoft Office 2003 = 0.03 seconds per 100K cells
- OpenOffice 2.4.0 = 0.4 seconds per 100K cells
Not too surprising. These binary formats are optimized for the guts of MS Office. We would expect them to load faster in their native application.
So what about the new XML formats? There has been recent talk about the “Angle Bracket Tax” for XML formats. How bad is it?
- Microsoft Office 2003 with OOXML = 1.5 seconds per 100K cells
- OpenOffice 2.4.0 with ODF = 2.7 seconds per 100K cells
For typical sized documents, you probably will not notice the difference. However with the largest documents, like the 16-page, 3-million cells monster sheet, the OOXML document took 40 seconds to load in Office, the ODF sheet took 90 seconds to load in OpenOffice, whereas the XLS binary took less than 2 seconds to load in MS Office.
OK. So what are we missing. Ah, yes, ODF format in MS Office, using their ODF Add-in.
- Microsoft Office 2003 with ODF, using the ODF Add-in = 74.6 seconds per 100K cells
Yup. You read that right. To put this in perspective, let’s look at a single test file, a 600K cells file, as we load it in the various formats and editors:
- Microsoft Office 2003 in XLS format = 0.75 seconds
- OpenOffice 2.4.0 in XLS format = 3.03 seconds
- Microsoft Office 2003 in OOXML format = 8.28 seconds
- OpenOffice 2.4.0 in ODF format = 14.09 seconds
- Microsoft Office 2003 in ODF format = 515.60 seconds
Can someone explain to me why Microsoft Office needs almost 10 minutes to load an ODF file that OpenOffice can load in 14 seconds?
(I was not able to test files larger than this using the ODF Add-in since they all crashed .)
(Update: Since it is the question everyone wants to know, the beta version of OpenOffice 3.0 opens the OOXML version of that file in 49.4 seconds and Sun’s ODF Plugin for Microsoft Office loads this file in 30.03 seconds. )
This is one reason why I think file format translation is a poor engineering approach to interoperability. When OpenOffice wants to read an legacy XLS file, it does not approach the problem by translating the XLS into an ODF document and then loading the ODF file. Instead they simply load the XLS file, via a file filter, into the internal memory model of OpenOffice.
What is a file filter? It is like 1/2 of a translator. Instead of translating from one disk format to another disk format, it simply loads the disk format and maps it into an application-specific memory model that the application logic can operate directly on. This is far more efficient than translation. This is the untold truth that the layperson does not know. But this is how everyone does it. That is how we support formats in SmartSuite. That is how OpenOffice does it. And that is how MS Office does it for the file formats they care about. In fact, that is the way that Novell is now doing it now, since they discovered that the Microsoft approach is doomed to performance hell.
So it is with some amusement that I watch Microsoft and others propose translation as a solution to interoperability, creating reports about translation, even a proposal for a new work item in JTC1/SC34 concerning file format translation, when the single concrete attempt at translation is such an abysmal failure. It may look great on paper, but it is an engineering disaster. What customers need is direct, internal support for ODF in MS Office, via native code, in a file filter, not a translator that takes 10 minutes to load a file.
The astute engineer will agree with the above, but will also feel some discomfort at the numbers. There is more here than can be explained simply by the use of translators versus import filters. That choice might explain a 2x difference in performance. A particularly poor implementation might explain a 5x difference. But none of this explains why MS Office is almost 40x slower in processing ODF files. Being that much slower is hard to do accidentally. Other forces must be at play.
Any ideas?
“Being that much slower is hard to do accidentally. Other forces must be at play.”
Never attribute to malice that which can be adequately explained by stupidity.
Come on, given that MS weren’t even capable of spell-checking or copy-editing the OOXML spec to a level that high school students would consider acceptably unembarassing, I consider it entirely possible that they’re incompetent enough to have this happen accidentally.
I know they won’t do it to you, but why not be kind, give them the benefit of the doubt, and just assume they’re a complete bunch of retards?
I bet that ODF does better by not having 30 ways to specify the same thing. That should help compression a lot, even if individual tags are longer and more readable.
As for why Office sucks in terms of performance, that’s probably because there’s no money in it for them, so they let their worst coders work on it. They’re just stingy that way.
> Never attribute to malice that which can be adequately explained by stupidity.
Don’t forget that the corollary to the law you quote is that sufficiently stupid actions are indistinguishable from malice …
Only guesses about the CleverAge ODF addin, but still :
– written in .NET, so incurs a few seconds just to load
– not written with performance in mind, meaning bad String use all over the place. Effect : many garbage collections, a big time consumer.
– inefficiency of the resulting translated file. To store strings and numbers, there are ways Excel optimizes it (shared strings, factorized floats, single precision floats versus double precision floats). This greatly impacts both the file size, and the time to open it in its native application.
-Stephane Rodriguez
It’s hard to be shocked by a factor of 40; I don’t think it’s that uncommon to see order-of-magnitude variability in performance for the same task implemented by different programmers.
Heck, even considering a simple matrix multiplication, which is conceptually just three loops and four lines of code, the implementations vary in speed by an order of magnitude mainly due to how efficiently the code exploits caches and pipelines.
Everyone should realize that the Microsoft sponsored ODF plugin is just a hoax to drive people away from ODF with its lousy performance and lack of features. A poorly behaving MS Office ODF plugin is a benefit for OOXML, clearly, as it gives a random user impression that “OOXML seems to be faster and better than ODF, I’ll surely use OOXML!”.
However, I’d encourage everyone to try out Sun ODF Plugin for Office, much better:
http://www.sun.com/software/star/odf_plugin/index.jsp
Office has always been abysmally and unreasonably slow at loading anything but the current version of document it works with – even just the previous version will load an order of magnitude slower than the current one.
Although it quite probably *is* part of their marketing engine (e.g. along with all the warnings about format conversion), it could just as easily be an economic or technical reason. i.e. the first version that does the job might not be the fastest, but there is no incentive to put any more work into it. It may also be that the internals are so strictly optimised for the current version that it is just difficult to make it work – but an order of magnitude difference or more stretches the credibility of this quite beyond breaking.
With the ODF plugin, perhaps they ARE using xslt translation. XSLT is a dreadfully slow and terrible technical solution to the problem it solves – it easily runs orders of magnitude slower than custom code (which in many cases isn’t much harder to write in the first place). A typical computer science solution to an engineering problem – sounds good but just doesn’t work in practice.
Many years ago, I tested the Microsoft connector that allowed Outlook to retrieve email from a Lotus Domino server. Domino, of course, supports a hierarchical naming standard that Outlook knows nothing about, typically providing names in this format…
CN=Rob Weir/OU=US/O=IBM
The connector, needing to deal with emails where the From and CC lists were all in this format, obviously needed to translate those names into a simple common name format.
A no-brainer, right? You just take everything to the left of the “/” character, and to the right of the “=” character, right? This would take about 2 minutes to code.
Not according to Microsoft.
MSFT’s thinking was “the common name portion of a person’s name is typically the second entry in their alias list in the Domino directory. So we’ll look it up.”
For each name listed in the From or CC list, Outlook queried the directory on the server for the complete alias list, and then took the second element in that list. This meant that reading an email from a Domino server in the Outlook client could easily have 10 times the network impact of reading that same email from an Exchange server.
You can imagine how performance suffered as well.
Of course, there’s no rule that says the second entry in the alias list actually has to be the common name for that user. That’s how I discovered the behavior in the first place — because one particular user had an alias of a former employee in the alias list, so all messages appeared in Outlook as if they were coming from that former employee.
Who in their right mind would choose to perform a database query for each and every name in an email instead of simply parsing a string?
The answer: no one. It was obviously a feeble attempt to cripple the performance of the connector in order to make Domino look bad.
MSFT is, of course, free to code their own applications as badly as they wish. But I’ve always considered that example to be indicative of who’s interests MS holds dear.
C# itself and garbage collection maybe explains part of this. But it isn’t like small-object allocations are problem-free in C++ either. I typically find myself writing a custom allocator in that case and having a large-object heap and small-object heap to reduce heap fragmentation.
In any case, my understanding is the Add-in is a lot of XSLT with a little C#.
There also appears to be a lot of hits to temporary files on disk. That can’t be helping.
I bet that ODF does better by not having 30 ways to specify the same thing. That should help compression a lot, even if individual tags are longer and more readable.
I would bet that the real reason ODF compresses better is that tags like o:p (IIRC) don’t make it into the Huffman coding (which typically will have a minimum number of character cutoff) while the more verbose tags of ODF will be Huffman coded.
Of course, I’m using terminology from rusty memory, so I might not have the terms right, but the idea is the same :)
Also, Microsoft has some of the best developers in the business (along with some bad ones, I’m sure). Unfortunately they are usually tied down by corporate policy and thus can’t always work to the best of their ability. So I wouldn’t, in general, blame MS developers for problems like this.
How about comparing the time to import MS 2007 XML into OpenOffice.org? Version 3.0b has this capability.
Just for a direct comparison, how does the SUN ODF filter for MS office perform?
I’ve added results for Sun’s ODF Plugin for Microsoft Office, and for OpenOffice.org 3.0’s beta support for OOXML.
Interesting fact is that Sun’s ODF Plugin for Microsoft Office loads the test file 17x faster than Microsoft’s Add-in for ODF (30 seconds versus 516 seconds). Same file, same version of Microsft Office. But Microsoft’s Add-in is 17x slower. What’s up with that?
Perhaps RTF might be a better format for interoperability. See this posting and comments at the Microsoft Office Word Team’s Blog.
Reports on RTF format file size, speed, compatibility would be welcome… just ask Novell how it feels about having RTF as an interoperability format.
“Perhaps RTF might be a better format for interoperability. “
RTF has a standards history that looks like a rehearsal of OOXML.
Quote from GrokDoc:
http://www.grokdoc.net/index.php/Dirty_Tricks_history#RTF
[
Sadly, RTF is not a reliable standard. Why? Because its main purpose seems to be to provide seamless compatability between versions and OS platforms of Microsoft Word itself. It is routinely updated with every new version of Microsoft Word. It has never been proposed to ANSI or the ISO or any other standards group for peer review. It is 100% Microsoft.
]
Winter
Rob,
Do you have a link to a development page of the Novell converter not using XSLT? Do you know how far they are in the development?
Thanks,
Jesper
Jesper, sorry, I do not have that info. Maybe one of our Novell readers can fill in the details.
Hi Rob,
Ok – in the mean time I’ll write to Florian to see if he knows something.
:o)
/Jesper
The great Arthur C Clarke said it best – giafly.
http://en.wikipedia.org/wiki/Clarke's_three_laws
“Any sufficiently awful stupidity is indistinguishable from malice.”