Archives for 2008

Spreadsheet file format performance

2008/05/13 By Rob 19 Comments

I’ve been doing some performance timings of file format support, comparing MS Office and OpenOffice. Most of the results are as expected, but some are surprising, and one in particular is quite disappointing.

But first, a little details of my setup. All timings, done by stopwatch, were from Office 2003 and OpenOffice 2.4.0 running on Windows XP, with all current service packs and patches. The machine is a Lenova T60p, dual-core Intel 2.16 Ghz and 2 GB of RAM. I took all the standard precautions — disk was defragmented, and test files were confirmed as defragmented using contig. No other applications were running and background tasks were all shut down.

For test files, I went back to an old favorite, George Ou’s (at the time with ZDNet) monster 50MB XLS file from his series of tests back in 2005. This file, although very large, is very simple. There are no formulas, indeed no formatting or styles. It is just text and numbers, treating a spreadsheet like a giant data table. So tests of this file will emphasize the raw throughput of the applications. Real world spreadsheets will typically be worse than this due to additional overhead from process styles, formulas, etc.

A test of a single file is not really that interesting. We want to see trends, see patterns. So I made a set of variations on George’s original file, converting it into ODF, XLS and OOXML formats, as well as making scaled down versions of it. In total I made 12 different sized subsets of the original file, ranging down to a 437KB version, and created each file in all three formats. I then tested how long it took to load each file in each of the applications. In the case of MS Office, I installed the current versions of the translators for those formats, the Compatibility Pack for OOXML, and the ODF Add-in for the ODF support.

I find it convenient to report numbers per 100,000 spreadsheet cells. You could equally well use the original XLS spreadsheet size, or the number of rows of data, or any other correlated variable as the ordinate, but values per 100K cells is simple for anyone to understand.

I’ll spare you all the pretty picture. If you want to make some, here is the raw data (CSV format). But I will give some summary observations.

For document sizes, the results are as follows:

Binary XLS format = 1,503 KB per 100K cells
OOXML format = 491 KB per 100K cells
ODF format = 117 KB per 100K cells

So the XML formats are far smaller than the legacy binary format. This is due to the added Zip compression that both XML formats use. Also, note that the ODF files are significantly smaller than the OOXML files, less than 1/4 the size on average. Upon further examination, the XML document representing the ODF content is larger than the corresponding XML in OOXML, as expected, due to its use of longer, more descriptive markup tags. However the ODF XML compresses far better than the OOXML version, enough to overcome its greater verbosity and result in files smaller than OOXML. The compression ratio (original/zipped) for ODF’s content.xml is 87, whereas the compression ratio for OOXML’s sheet1.xml is only 12. We could just mumble something about entropy and walk away, but I think this area could bear further investigation.

Any ideas?

For load time, the times for processing the binary XLS files were:

Microsoft Office 2003 = 0.03 seconds per 100K cells
OpenOffice 2.4.0 = 0.4 seconds per 100K cells

Not too surprising. These binary formats are optimized for the guts of MS Office. We would expect them to load faster in their native application.

So what about the new XML formats? There has been recent talk about the “Angle Bracket Tax” for XML formats. How bad is it?

Microsoft Office 2003 with OOXML = 1.5 seconds per 100K cells
OpenOffice 2.4.0 with ODF = 2.7 seconds per 100K cells

For typical sized documents, you probably will not notice the difference. However with the largest documents, like the 16-page, 3-million cells monster sheet, the OOXML document took 40 seconds to load in Office, the ODF sheet took 90 seconds to load in OpenOffice, whereas the XLS binary took less than 2 seconds to load in MS Office.

OK. So what are we missing. Ah, yes, ODF format in MS Office, using their ODF Add-in.

Microsoft Office 2003 with ODF, using the ODF Add-in = 74.6 seconds per 100K cells

Yup. You read that right. To put this in perspective, let’s look at a single test file, a 600K cells file, as we load it in the various formats and editors:

Microsoft Office 2003 in XLS format = 0.75 seconds
OpenOffice 2.4.0 in XLS format = 3.03 seconds
Microsoft Office 2003 in OOXML format = 8.28 seconds
OpenOffice 2.4.0 in ODF format = 14.09 seconds
Microsoft Office 2003 in ODF format = 515.60 seconds

Can someone explain to me why Microsoft Office needs almost 10 minutes to load an ODF file that OpenOffice can load in 14 seconds?

(I was not able to test files larger than this using the ODF Add-in since they all crashed .)

(Update: Since it is the question everyone wants to know, the beta version of OpenOffice 3.0 opens the OOXML version of that file in 49.4 seconds and Sun’s ODF Plugin for Microsoft Office loads this file in 30.03 seconds. )

This is one reason why I think file format translation is a poor engineering approach to interoperability. When OpenOffice wants to read an legacy XLS file, it does not approach the problem by translating the XLS into an ODF document and then loading the ODF file. Instead they simply load the XLS file, via a file filter, into the internal memory model of OpenOffice.

What is a file filter? It is like 1/2 of a translator. Instead of translating from one disk format to another disk format, it simply loads the disk format and maps it into an application-specific memory model that the application logic can operate directly on. This is far more efficient than translation. This is the untold truth that the layperson does not know. But this is how everyone does it. That is how we support formats in SmartSuite. That is how OpenOffice does it. And that is how MS Office does it for the file formats they care about. In fact, that is the way that Novell is now doing it now, since they discovered that the Microsoft approach is doomed to performance hell.

So it is with some amusement that I watch Microsoft and others propose translation as a solution to interoperability, creating reports about translation, even a proposal for a new work item in JTC1/SC34 concerning file format translation, when the single concrete attempt at translation is such an abysmal failure. It may look great on paper, but it is an engineering disaster. What customers need is direct, internal support for ODF in MS Office, via native code, in a file filter, not a translator that takes 10 minutes to load a file.

The astute engineer will agree with the above, but will also feel some discomfort at the numbers. There is more here than can be explained simply by the use of translators versus import filters. That choice might explain a 2x difference in performance. A particularly poor implementation might explain a 5x difference. But none of this explains why MS Office is almost 40x slower in processing ODF files. Being that much slower is hard to do accidentally. Other forces must be at play.

Any ideas?

Berry Good, Berry Bad

2008/05/08 By Rob 5 Comments

It has been an interesting Spring here in Westford, weather-wise. April dipped below freezing on the 3rd, 8th, 9th, 15th and 16th. Then we got a warm spell, a week of days that reached 75 °F (23 °C) and even one day that reached 87.4 °F (30.8 °C) (April 23rd). Then it struck, on May 1st, an overnight low of 28.7 °F (-1.8 °C).

The vulnerability, when a late frost like this occurs, is in bud development. If the plant, by warm sunny days, has been tricked into bud development, and then a freeze occurs, the bud will be injured or killed. Strawberries are particularly prone to this problem.

Because of interactions of thermal inversions at the ground, humidity levels, etc., a simple temperature reading is not an accurate indicator of whether damage actually occurred. For example, if humidity is high, the temperature can dip to freezing, but in the act of freezing water vapor (creating frost) energy is released (latent heat of fusion, as the chemists call it). So you have a few degrees of tolerance if humidity is high, if you have a fog, etc. In fact, commercial strawberry growers will handle this problem by running sprinklers when a freeze threatens, to increase the amount of water around the plants available to freeze, as a buffer to protect the plants. Every degree helps.

But I wasn’t so lucky. The extent of damage was not clear until the strawberry plants started blooming this week. Here’s what I am seeing.

A healthy strawberry blossom

Above is an example of a normal, healthy strawberry blossom. You see the ring of stamens, the male organs of the flower, each with a filament stalk tipped with an anther containing the pollen. In the center is the receptacle with the many carpels, which are the female side of the equation.

But in the picture below, we see a blossom from my garden that shows injury. Although the plant is sound, and it did flower, the carpels are dead. This blossom will not yield a berry.

From the looks of it, 40-50% of buds are damaged in this way. So no strawberry wine this year. I’ll only have enough for fresh eating and ice cream.

A strawberry flower that has suffered frost damage

Achieving the impossible

2008/05/07 By Rob Leave a Comment

Unadulterated copy of James Clark’s Relax NG validator jing. Unadulterated copy of Kohsuke Kawaguchi’s Sun Multi-Schema Validator msv. Unadulterated copy of the ODF 1.0 Relax NG schema. Unadulterated copy of the ODF 1.0 Standard, in ODF format.

No errors from either validator.

msv is so good as to tell us “the document is valid”. But jing indicates success with only silence. So will I.

Standards Words

2008/05/06 By Rob 7 Comments

Introduction

There are several words, more widely used than understood, that recur frequently when discussing standards. Specification and standardization requires us precisely to describe technology in such a way that practitioners in that field can achieve the goals set out in the standard. But this precision is only perfectly intelligible to those who share the same code words. What follows is a handful of the more important ones, what they mean, and how they are unintentionally confused or intentionally misused. You are at a distinct disadvantage when reading (or writing) a news article, a blog post, or evaluating an argument if you do not know the correct meaning of the following words.

Standard

Take the definition from ISO/IEC Guide 2:2004, definition 3.2:

[A] document, established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context.

NOTE Standards should be based on the consolidated results of science, technology and experience, and aimed at the promotion of optimum community benefits.

So, it is a document, a written description, not an embodiment in the form of a product, that is standardized. Its aims are the “achievement of optimum degree of order” and “promotion of optimum community benefits”, and it is achieved through consensus and consolidation.

international standard

According to ISO/IEC Guide 2:2004, definition 3.2.1.1:

[A] standard that is adopted by an international standardizing/standards organization and made available to the public.

International Standard

[An] international standard where the international standards organization is ISO or IEC

Note the distinction. With capital letters only ISO or IEC standards apply. With lowercase, other standards are included. This is a bit self-serving. ISO and IEC Standards are the only International Standards, because ISO says so. Sorry ITU, sorry CEN, sorry W3C.

So think of “International Standards” as a controlled mark of ISO, like “parmigiano reggiano” is a controlled mark of the Northern Italian Cheese Consorzio.

Normative

The normative parts of a standard are those which set out the scope and provisions of the standard. See ISO Directives, Part 2, section 3.8.

Provisions

The provisions of a standard consist of:

Requirements that must be met for conformance
Recommendations
Statements of permissible, possible or capable actions

See ISO Directives, Part 2, section 3.12.

Note that standards have specific words which denote and distinguish requirements, recommendations and capabilities. Different standards organizations have different vocabulary for this, so a W3C Recommendation, an IETF RFC and an ISO Standard may have different ways of stating the same provision. For ISO Standards, the conventions are:

“shall” and “shall not” are the normal terms for expressing requirements.
“should” and “should not” are the normal terms for expressing recommendations.
“may” and “need not” are the normal terms for expressing permission.
“can” and “cannot” are the normal terms for expressing possibility and capability.

This is necessary because of the extreme ambiguity of the English language in the area of modality. Consider the following sentences, using the word “must”:

(On hearing the doorbell ring), “Oh, that must be the mailman!” [expressing likelihood]
(To a misbehaving child) “You must obey your mother” [expressing obligation]

Or the following exchange with a teenage daughter:

Teen: “I shall return by 11pm” [simple future]
Parent: “No, you shall return by 10pm” [expressing a command]

We can be loose and still be understood, in context, in normal conversation, but in standards work we try to be precise and uniform in the use of our control vocabulary.

Conformance

This simply is a question of whether something meets the requirements of the standard. However, for many standards, there are multiple levels, perhaps even multiple classes of conformance. So you need to be very specific about what you are saying.

For example, you should not ask “Does Excel 2007 conform to OOXML?” You should ask “Is Excel 2007 a conforming transitional class SpreadsheetML Producer?” If you count it all up, OOXML probably has at least 18 distinct conformance classes, by various combinations of applications, documents, readers/writers and transitional/strict conformance classes.

Not in particular that conformance does not mean that an application implements the entire standard.

[My definition above is not very satisfactory. Anyone have something better? Is there an ISO definition of conformance?]

Compliance

This is not a typical standards term. The more typical term is “conformance”. Best to avoid it unless you are talking in regulatory or legal context. See ISO Directives, Part 2, section 6.6.1.1:

A document does not in itself impose any obligation upon anyone to follow it. However, such an obligation may be imposed, for example, by legislation or by a contract. In order to be able to claim compliance with a document, the user needs to be able to identify the requirements he/she is obliged to satisfy. The user also needs to be able to distinguish these requirements from other provisions where there is a certain freedom of choice.

Validity

This is an XML term, referring to the relationship between an XML document instance (an XML file) and a schema (the definition of the syntax of the markup language). Generally, an XML document instance is valid if it adheres to the constraints defined in the schema. The precise definition of validity will depend on the schema definition language used.

I’d welcome any suggestions for other words or definitions that should be included here.

The Challenge

2008/05/05 By Rob 17 Comments

<?xml version="1.0" encoding="UTF-8"?>
<office:document-content
xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
office:version="1.0">
<office:body>
<office:text>
<text:p>Dear Alex Brown. Please prove that I am invalid ODF 1.0 (ISO 26300:2006).
I do not think that I am. In fact I think that your statement that there are
no valid ISO ODF documents in the world, and that there cannot be, is a brash,
irresponsible and indefensible piece of bombast that you should retract.</text:p>
<text:p>(Please note that this document contains no ID, IDREF or IDREFS attributes.
Nor does it contain custom content.)</text:p>
</office:text>
</office:body>
</office:document-content>