Archives for October 2006

Ass-backwards Compatibility

2006/10/27 By Rob Leave a Comment

I refer you to Ben Langhinrichs at Genii Software with his post on “Self deprecating standards”.

After you’ve read that, then I hope you’ll return back here for some additional thoughts.

From what Ben writes, it seems that one source of OOXML’s length and complexity is that it had grandfathered in all sorts of special cases and exceptions for older versions of Office. So an implementor not only must worry about how Office 2007 does footnote placement, but also needs to worry about how Word 6 and Word 95 did it differently. Many other examples in his post. (Ben, you should take a look at the VML section — that entire markup is deprecated in the spec. )

We call this “crud”, the accumulated debris from old implementations, old platforms, old formats, old mistakes. Don’t get me wrong. As a forensic exercise in documenting the byzantine complexity of this arcane format the draft OOXML specification is a considerable achievement. The 6,000 pages testify to the industry of its authors as well as the thanklessness of their task.

But we need to keep in mind that there is a difference between a specification and a standard. A specification tells the plain facts of what a particular technology does without comment as to whether it is good or bad. I could drop a box of toothpicks on my desk and write up a detailed specification on how they landed. An open standard, on the other hand, goes beyond mere specification, and promotes a preferred way of achieving cross-vendor and cross-application interoperability. Ideally a standard says, “This specification is good, we thought of the wider needs of the market, including other vendors and the consumer, and if we all implement this technically elegant specification then we all win.”

However one thinks of the OOXML specification, it lacks the quality, the consideration and the perspective of an open standard. It was doomed from day 1, when its charter limited TC45 to not making any changes that would be incompatible with Microsoft’s existing proprietary binary formats. Think of it this way — The binary formats were certainly never designed to be a standard, right? They were developed in-house at Microsoft, kept from public scrutiny for 15 years, never brought before a standards organization, licensed on terms that prevented competition with Office, etc. Now, 15 years later they take the accumulated crud from 15 years of a proprietary binary format, convert it into XML and expect to call that an “open standard”?

From an architectural and design perspective this is all ass-backwards. You can build complexity on top of simplicity, but you can’t easily build simplicity on top of complexity. The recommended approach when designing a standard-aspiring document format is to survey the range of functionality in existing and anticipated implementations, map out the intersection of this complexity and then generalize it. The generalization of complexity often leads to simplification. A good example is the art page borders, where the OOXML specification enumerates hundreds of specific images that can be used in page borders, an approach which is at once limiting and complex. The generalization of having the document include the clipart would both simplify the specification as well as increase its functionality. Instead of this approach, Microsoft has squandered their opportunity to improve document-centric computing and interoperability and instead produced 6,000 pages of indigestible complexity, an approach that cuts off alternate implementations and props up their monopoly.

It is time to move on.

The Chernobyl Design Pattern

2006/10/26 By Rob 14 Comments

In 1994, the world learned that the Intel Pentium chip had a bug. In certain cases it gave the wrong answer when calculating floating-point division. These cases were rare, only 1 in 9 billion divisions, and typically only resulted in errors past the 8th decimal place.

What did Intel do about this? Well, there was denial at first, and then dismissal of the problem as being trivial and unimportant. But eventually they saw the light and offered a no-questions-asked replacement policy for defective processors. No doubt this was expensive for Intel, but this preserved their good name and reputation.

It could have been different. For example, they could have simply kept the bug. They could have preserved that bug in future versions of the Pentium for backwards compatibility, arguing that there was some software out there that may have worked around the original defect, and for Intel to fix the bug now would only break the software that worked around the bug. This is a dangerous line of reasoning. What bug can’t be excused by that argument?

Intel could have further decided to turn their bug into a standard, and get it blessed by a standards development organization and maybe even ISO. “It’s not a bug, it’s a standard”.

But Intel is not Microsoft, so they don’t have quite the audacity to turn a bug into a standard, which is what Microsoft is attempting to do by declaring in Office Open XML (OOXML) that the the year 1900 should be treated as a leap year, in contradiction of the Gregorian Calendar which has been in use almost 500 years. (Years divisible by 100 are leap years only if they are also divisible by 400)

By mandating the perpetuation of this bug, we are asking for trouble. Date libraries in modern programming languages like C, C++, Java, Python, Ruby all calculate dates correctly according to the Gregorian Calendar. So any interpretation of dates in OOXML files in these languages will be off by one day unless the author of the software adds their own workaround to their code to account for Excel’s bug. Certainly some will make the “correction” properly, at their own expense. But many will not, perhaps because they did not see it deep within the 6,000 page specification.

There is something I call the “Chernobyl Design Pattern”, where you take your worst bug, the ugliest part of your code, the part that is so bad, so radioactive that no one can touch it without getting killed, and you make it private and inaccessible, and then put a new interface around it, essentially entomb it in concrete so that no one can get close to it. In other words, if you can’t fix it, at least contain the damage, prevent it from spreading.

Microsoft has taken another approach here. Instead of containment, they are propagating the bug even further. We need to think beyond Excel and think as well of other applications that work with OOXML data, and other applications that work with those apps and so on, the entire network of data dependencies. The mere existence of this bug in a standard will lead to buggy implementations, poor interoperability, and general chaos around dates. The fallout of this bug should have been contained within the source code of Excel. For this to leak out, into a specification, then a standard and then into other implementations, contradicting both the civil calendar and every other tool that deals with dates, will pollute the entire ecosystem.

This is bad news. Just say no.

Why is OOXML Slow?

2006/10/19 By Rob 5 Comments

Of course, one could simply dismiss this question, saying that a specification for an XML vocabulary does not have performance as such, since a specification cannot be executed. However, the choices one makes in designing an XML language will impact the performance of the applications that work with the format. For example, both ODF and OOXML store their XML in compressed Zip files. This will cause reading and writing of a document to be faster in cases where memory is plentiful and computation is much faster than storage and retrieval, which is to say on most modern desktops. But this same scheme may be slower in other environments, say PDA’s. In the end, the performance characteristics of a format cannot be divorced from operational profile and environmental assumptions.

When comparing formats, it is important to isolate the effects of the format versus the application. This is important from the analysis standpoint, but also for legal reasons. Remember that the only implementation of (draft) OOXML is (beta) Office 2007, and the End User Licence Agreement (EULA) has this language:

7. SCOPE OF LICENSE. …You may not disclose the results of any benchmark tests of the software to any third party without Microsoft’s prior written approval

So let’s see what I can do while playing within those bounds. I started with a sample of 176 documents, randomly selected from the Ecma TC45’s document library. I’m hoping therefore that Microsoft will be less likely to argue that these are not typical. These documents are all in the legacy binary DOC format and include agendas, meeting minutes, drafts of various portions of the specification, etc.

Some basic statistics on this set of documents:

Min length = 1 page
Mode = 2 pages
Median length = 7 pages
Mean length = 34 pages
Max length = 409 pages

Min file size= 27,140 bytes
Median file size= 159,000 bytes
Mean file size= 749,000 bytes
Max file size= 15,870,000 bytes

So rather than pick a single document and claim that it reflects the whole, I looked at a wide range of document sizes in use within a specific document-centric organization.

I converted each document into ODF format as well as OOXML, using OpenOffice 2.03 and Office 2007 beta 2 respectively. As has been noted before, both ODF and OOXML formats are XML inside of a Zip archive. The compression from the zipping not only counters the expansion factor of the XML, but in fact results in files which are smaller than the original DOC files. The average OOXML document was 50% the size of the original DOC file, and the average ODF document was 38% the size of the DOC. So net result is that the ODF documents came out smaller, averaging 72% of their OOXML equivalents.

A quick sanity check of this result is easy to perform. Create an empty file in Word in OOXML format, and an empty file in OpenOffice in ODF format. Save both. The OOXML file ends up being 10,001 bytes, while the ODF file is only 6,888 bytes, or 69% of the OOXML file.

Here is a histogram of the ODF/OOXML size ratios for the sampled files. As you can see, there is a wide range of behaviors here, with some files even ending up larger in ODF format. But on average the ODF files were smaller.

What about the contents of the Zip archives? The OOXML documents tended to contain more XML files (on average 6 more) than the parallel ODF document, but these XML files were individually smaller, average 32,080 bytes versus 66,490 for ODF. However the net effect is that the average total size of the XML in the OOXML is greater than in ODF (684,856 bytes versus 401,406 bytes).

Here’s part 2 of the experiment. The proposal is that many (perhaps most) tools that deal with these formats will need to read and parse all of the XML files within the archive. So a core part of performance that these apps will share is how long it takes to unzip and parse these XML files. Of course this is only part of the performance story. What the application does with the parsed data is also critical, but that is application-dependent and hard to generalize. But the basic overhead of parsing is universal.

To test this out wrote a Python script to time how long it takes to unzip and parse (Python 2.4 minidom) all the XML’s in these 176 documents. I repeated each measurement 10 times and averaged. And I did this for both the OOXML and the ODF variants.

The results indicate that the ODF documents were parsed, on average 3.6x faster than the equivalent OOXML files. Here is a plot showing the ratio of OOXML parse time to ODF parse time as a function of page size:As you can see there is a wide variation in this ratio, especially with shorter documents. In some case the OOXML document took 8x or more longer time to parse than the equivelant ODF document. But with longer documents the variation settles out and settles on the 3.6x factor mentioned

Now how do we explain this? A rough model of XML parsing performance is that it has a fixed overhead to start up, initialize data structures, parse tables, etc., and then some incremental cost dependent on the size and complexity of the XML document. Most systems in the world work like this, fixed overhead plus incremental cost per unit of work. This is true whether we’re talking about XML parsing, HTTP transfers, cutting the lawn or giving blood at a blood bank. The general insight into these systems is that where the fixed overhead is significant, you want to batch up your work. Doing many small transactions will kill performance.

So one theory is that OOXML is slower because of the cost of initializing more XML parses. But it could also be because the aggregate size of the XML files are larger. More testing would be required to gauge the relative contribution of these two factors. However one thing is clear. Although this test was done with minidom on Python, the results are of wide applicability. I can think of no platform and no XML parser for which a larger document comprised of more XML files would be faster than a smaller document made up of fewer XML files. Parsing ODF word processing documents should be faster than OOXML versions everywhere.

I’m not the first one to notice some of these difference. Rick Jelliffe did some analysis of the differences between OOXML and ODF back in August. He approached it from a code complexity view, but in passing noted that the same word processor document loaded faster in ODF format in OpenOffice compared to the same document in OOXML format in Office 2007 beta. On the complexity side he noted that the ODF markup was more complex than the parallel OOXML document. So if ODF is more complex but also smaller, this may amount to higher information density, compactness of expression, etc., and that could certainly be a factor in performance.

So what’s your theory? Why do you think ODF word processing documents are faster than OOXML’s?

The Celerity of Verbosity

2006/10/17 By Rob 16 Comments

I’ve been hearing some rumblings from the north-west that Ecma Office Open XML (OOXML) format has better performance characteristics than OpenDocument Format (ODF), specifically because OOXML uses shorter tag names. Putting aside for the moment the question of whether OOXML is in fact faster than ODF (something I happen not to believe), let’s take a look at this reasonable question: What effect does using longer, humanly readable tags have on performance compared to using more cryptic terse names?

Obviously there are a number of variables at play here:

What XML API are you using? DOM or SAX? The overhead of holding the entire document in memory at once would presumably cause DOM to suffer more from tag length than SAX.
What XML parser implementation are you using? The use of internal symbol tables might make tag length less important or even irrelevant in some parsers.
What language are you programming in? Some language, like Java have string internalization features which can conflate all identical strings into a single instance.
What size document are you working with? Document parsing has fixed overhead as well as overhead proportionate to document size. A very short document will be dominated by fixed costs.

So there may not be a single answer for all users with all tools in all situations.

First, let’s talk a little about the tag length issue. It is important to note that the designer of an XML language has control over some, but not all names. For example take a namespace declaration:

xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006"

The values of namespace URI’s are typically predetermined and are often long in order to reduce the chance of accidental collisions. But the namespace prefix is usually chosen to be quite short, and is under the control of the application writing the XML, though a specific prefix is typically not mandated by language designer.

Element and attribute names can certainly be set by the language designer.

Attribute values may or may not be determined by the language designer. For example:

val="Heading1"

Here the name of the style may be determined by the template, or even directly by the user if he is entering a new named style. So the language designer and the application may have no control over the length of attribute values. Other attribute values may be fixed as part of the schema, and the length of those are controlled by the language designer.

Similarly, the length of character content is also typically determined by the user, since this is typically how free-form user content is entered, i.e., the text of the document.

Finally, note that the core XML markup for beginning and ending elements, delimiting attribute values, character entities etc., are all non-negotiable. You can’t eliminate them to save space.

Now for a little experiment. For the sake of this investigation, I decided to explore the performance of a DOM parse in Python 2.4 of a medium-sized document. The document I picked was a random, 60 page document selected from Ecma TC45’s XML document library which I converted from Microsoft’s binary DOC format into OOXML.

As many of you know, an OOXML document is actually multiple XML documents stored inside a Zip archive file. The main content is in a file called “document.xml” so I restricted my examination to that file.

So, how much overhead is there in a our typical OOXML document? I wrote a little Python script to count up the size of all of the element names and attributes names that appeared in the document. I counted only the characters which were controllable by the language designer. So w:pPr counts as three characters, counting only “pPr” since the namespace and XML delimiters cannot be removed. “pPr” is what the XML specification calls an NCName, also called a non-qualified name, since it is not qualified or limited by a namespace. There were 51,800 NCName’s in this document, accounting for 16% of the overall document size. The average NCName was 3.2 characters long.

For comparison, a comparably sized ODF document had an average NCName length of 7.7 and an NCName’s represented 24% of the document size.

So, ODF certainly uses longer names than OOXML. Personally I think this is a good thing, from the perspective of readability, a concern of particular interest to the application developer. Machines will get faster, memory will get cheaper, bandwidth will increase and latency will decrease, but programmers will never get any smarter and schedules will never allow enough time to complete the project. Human Evolution progresses at too slow a speed. So if you need to make a small trade-off between readability and performance, I usually favor readability. I can always tune the code to make it faster. But the developers are at a permanent disadvantage if the language uses cryptic. I can’t tune them.

But let’s see if there is really a trade-off to be made here at all. Let’s measure, not assume. Do longer names really hurt performance as Microsoft claims?

Here’s what I did. I took the original document.xml and expanded the NCNames for the most commonly-used tags. Simple search and replace. First I doubled them in length. Then quadrupled. Then 8x longer. Then 16x and even 32x longer. I then timed 1,000 parses of these XML files, choosing the files at random to avoid any bias over time caused by memory fragmentation or whatever. The results are as follows:

Expansion Factor	NCName Count	Total NCName Size (bytes)	File size (bytes)	NCName Overhead	Average NCName Length (bytes)	Average Parse Time (seconds)
1 (original)	51,800	166,898	1,036,393	16%	3.2	3.3
2	51,800	187,244	1,056,739	18%	3.6	3.2
4	51,800	227,936	1,097,443	21%	4.4	3.2
8	51,800	309,320	1,178,827	26%	6.0	3.2
16	51,800	472,088	1,341,595	35%	9.1	3.3
32	51,800	797,624	1,667,131	48%	15.4	3.3

If you like box-and-whisker plots (I sure do!) then here you go:What does this all mean? Even though we expanded some NCNames to 32-times their original length, making a 5x increase in the average NCName length, it made no significant difference in parse time. There is no discernible slow down in parse time as the element and attribute names increase.

Keep in mind again that the typical ODF documents shows an average NCName length of 7.7 . The above tests dealt with lengths twice that amount, and still no slowdown.

“Myth Busted”. I revert this topic to the spreaders of such FUD to substantiate their contrary claims.

A bit about the bit with the bits

2006/10/15 By Rob 16 Comments

I had an interesting meal in Paris a few weeks ago at a small bistro. I like Louisiana Cajun-style food, especially spicy andouille sausage, so when I saw “andouillette” on the menu, my stomach grumbled in anticipation. Certainly, the word ended in “ette”, but even my limited knowledge of French told me that this is just a diminutive ending. So maybe these sausages were small. No big deal, right?

When my lunch arrived, something was not quite right. First, this did not smell like any andouille sausage I had ever had. It was a familiar scent, but I couldn’t quite place it. But as soon as I cut into the sausage, and the filling burst out of the casing, it was clear what I had ordered. Tripe. Chitterlings . Pig intestines. With french fries.

I then knew where I had smelt this before. My grandfather, a Scotsman, was fond of his kidney pies and other dishes made of “variety meats”. This is food from an earlier time. The high fat content, and (in earlier days at least) cheaper prices of these cuts of meat provided essential meals for the poor. Although my grandfather ate these dishes out of preference, I’m pretty sure that his grandfather ate them out of necessity. How times change.

This was brought to mind recently as was reading the “final draft” of the Ecma Office Open XML (OOXML), something that was probably once done out of necessity in the memory-poor world of 1985, but now looks like an anachronism in the modern world of XML markup.

I’m talking about bitmasks. If you are a C programmer then you know already what I am talking about.

In C, imagine you want to store values for a number of yes/no (Boolean) type questions. C does not define a Boolean type, so the convention is to use an integer type and set it to 1 for true, and 0 for false. (Or in some conventions, 0 for true and anything else for false. Long story.) The smallest variable you can declare in C is a “char” (character) type, on most systems 8 bits (1 byte long) or even padded to a full 16 bits. But the astute reader will notice that a yes/no boolean question is really expressing only 1 bit of information, so storing it in an 8 bit character is a waste of space.

Thus the bitmask, a technique used by C programmers to encode multiple values into a single char (or int or long) variable by ascribing meaning to individual bits of the variables. For example, an 8-bit char can actually store the answer to 8 different yes/no questions, if we think of it in binary. So 10110001 is Yes/No/Yes/Yes/No/No/No/Yes. Expressed as an integer, it can be stored in a single variable, with the value of 177 (the decimal equivalent of 10110001).

The C language does not provide a direct way to set or query the values of an individual bit, but it does provide some “bitwise” operators that can be used to indirectly set and query bits in a bitmask. So if you want to test to see if the fifth (counting from the right) bit is true, then you do a bitwise AND with the number 16 and see if it is anything other than zero. Why 16? Because 16 in binary is 00010000, so doing a bitwise AND will extract just that single bit. Similarly you get set a single bit by doing a bitwise OR with the right value. This is one of the reasons why facility with binary and hexadecimal representations are important for C programmers.

So what does this all have to do with OOXML?

Consider this C-language declaration:
typedef struct tagLOCALESIGNATURE { DWORD lsUsb[4]; DWORD lsCsbDefault[2]; DWORD lsCsbSupported[2]; } LOCALESIGNATURE, *PLOCALESIGNATURE;

This, from MSDN is described as a memory structure for storing:

…extended font signature information, including two code page bitfields (CPBs) that define the default and supported character sets and code pages. This structure is typically used to represent the relationships between font coverage and locales.

Compare this data structure to the XML defined in section 2.8.2.16 (page 759) of Volume 4 the OOXML final draft:

The astute reader will notice that this is pretty much a bit-for-bit dump of the Windows SDK memory structure. In this case the file format specification provides no abstraction or generalization. It merely is a memory dump of a Windows data structure.

This is one example of many. Other uses of bitmasks in OOXML include things such as:

paragraph conditional formatting
table cell conditional formatting
table row conditional formatting
table style conditional formatting settings exception
pane format filter

If this all sounds low-level and arcane, the you perceive correctly. I like the obscure as much as the next guy. I can recite Hammurabi in Old Babylonian, Homer in Greek, Catullus in Latin and Anonymous in Old English. But when it comes to an XML data format, I seek to be obvious, not obscure. Manipulating bits, my friends, is obscure in the realm of XML.

Why should you care? Bitmasks are use by C programmers, so why not in XML? One reason is addressing bits within an integer runs into platform-specific byte ordering difference. Different machine processors (physical and virtual) make different assumptions. Two popular conventions are go by the names of Big-endian and Little-endian. It would divert me too far from my present argument to explain the significance of that, so if you want more detail on that I suggest you seek out a programmer with grey hairs and ask him about byte-ordering conventions.

A second reason to avoid bitmasks in XML is that avoids being part of the XML data model. You’ve created a private data model inside an integer and it cannot be described or validated by XML Schema, RELAX NG, Schematron, etc. Even XSLT, the most-used method of XML transformation today, lacks functions for bit-level manipulations. TC45’s charter included the explicit goal of:

…enabling the implementation of the Office Open XML Formats by a wide set of tools and platforms in order to foster interoperability across office productivity applications and with line-of-business systems

I submit that the use of bitmasks is the not the thing to do if you want support in a “wide set of tools and platforms”. It can’t be validated and it can’t be transformed.

Thirdly, the reasons for using bitmasks in the first place are not relevant in XML document processing. Don’t get me wrong. I’m not saying bit-level data structures are always wrong on all occasions. They are certainly the bread and butter of systems programmers, even today, and they was truly needed in the days where data was transferred via XModem on 12kbps lines. But in XML, when the representation of the data is already in an expansive text representation to facilitate cross-platform use, trying to save a byte of storage here or there, at the expense of the additional code and complexity required to deal with bitmasks, that the wrong trade-off. Remember in the end, the XML gets zipped up anyways, and will typically end up to be 10-20% the size of the same document in DOC format. So, these bitmasks aren’t really saving you much, if any, storage.

Fourthy, bitmasks are not self-describing. If I told you the “table style conditional formatting exception” had the value of 32, would that mean anything to you? Or would it send you hunting through a 6,000+ page specification in search for a meaning? But what if I told you that the value was “APPLY_FIRST_ROW”, then what would you say? A primary virtue of XML is that it is humanly readable. Why throw that advantage away?

Finally, there are well supported alternatives to bitmasks in standard XML, such as enumeration types on XML Schema. Why avoid a data representation that allows both validation and manipulation by common XML tools?

It seems to me that the only reason that bitmasks were used here is that the Excel application already used them. Much easier for Microsoft to make the specification match the source code than to make a standard that is good, platform and application neutral XML.

So, for the second time in a month the thought enters my mind: “You expect me to eat this tripe ?!”