≡ Menu

A bit about the bit with the bits

I had an interesting meal in Paris a few weeks ago at a small bistro. I like Louisiana Cajun-style food, especially spicy andouille sausage, so when I saw “andouillette” on the menu, my stomach grumbled in anticipation. Certainly, the word ended in “ette”, but even my limited knowledge of French told me that this is just a diminutive ending. So maybe these sausages were small. No big deal, right?

When my lunch arrived, something was not quite right. First, this did not smell like any andouille sausage I had ever had. It was a familiar scent, but I couldn’t quite place it. But as soon as I cut into the sausage, and the filling burst out of the casing, it was clear what I had ordered. Tripe. Chitterlings . Pig intestines. With french fries.

I then knew where I had smelt this before. My grandfather, a Scotsman, was fond of his kidney pies and other dishes made of “variety meats”. This is food from an earlier time. The high fat content, and (in earlier days at least) cheaper prices of these cuts of meat provided essential meals for the poor. Although my grandfather ate these dishes out of preference, I’m pretty sure that his grandfather ate them out of necessity. How times change.

This was brought to mind recently as was reading the “final draft” of the Ecma Office Open XML (OOXML), something that was probably once done out of necessity in the memory-poor world of 1985, but now looks like an anachronism in the modern world of XML markup.

I’m talking about bitmasks. If you are a C programmer then you know already what I am talking about.

In C, imagine you want to store values for a number of yes/no (Boolean) type questions. C does not define a Boolean type, so the convention is to use an integer type and set it to 1 for true, and 0 for false. (Or in some conventions, 0 for true and anything else for false. Long story.) The smallest variable you can declare in C is a “char” (character) type, on most systems 8 bits (1 byte long) or even padded to a full 16 bits. But the astute reader will notice that a yes/no boolean question is really expressing only 1 bit of information, so storing it in an 8 bit character is a waste of space.

Thus the bitmask, a technique used by C programmers to encode multiple values into a single char (or int or long) variable by ascribing meaning to individual bits of the variables. For example, an 8-bit char can actually store the answer to 8 different yes/no questions, if we think of it in binary. So 10110001 is Yes/No/Yes/Yes/No/No/No/Yes. Expressed as an integer, it can be stored in a single variable, with the value of 177 (the decimal equivalent of 10110001).

The C language does not provide a direct way to set or query the values of an individual bit, but it does provide some “bitwise” operators that can be used to indirectly set and query bits in a bitmask. So if you want to test to see if the fifth (counting from the right) bit is true, then you do a bitwise AND with the number 16 and see if it is anything other than zero. Why 16? Because 16 in binary is 00010000, so doing a bitwise AND will extract just that single bit. Similarly you get set a single bit by doing a bitwise OR with the right value. This is one of the reasons why facility with binary and hexadecimal representations are important for C programmers.

So what does this all have to do with OOXML?

Consider this C-language declaration:

typedef struct tagLOCALESIGNATURE {
DWORD lsUsb[4];
DWORD lsCsbDefault[2];
DWORD lsCsbSupported[2];

This, from MSDN is described as a memory structure for storing:

…extended font signature information, including two code page bitfields (CPBs) that define the default and supported character sets and code pages. This structure is typically used to represent the relationships between font coverage and locales.

Compare this data structure to the XML defined in section (page 759) of Volume 4 the OOXML final draft:

The astute reader will notice that this is pretty much a bit-for-bit dump of the Windows SDK memory structure. In this case the file format specification provides no abstraction or generalization. It merely is a memory dump of a Windows data structure.

This is one example of many. Other uses of bitmasks in OOXML include things such as:

  • paragraph conditional formatting
  • table cell conditional formatting
  • table row conditional formatting
  • table style conditional formatting settings exception
  • pane format filter

If this all sounds low-level and arcane, the you perceive correctly. I like the obscure as much as the next guy. I can recite Hammurabi in Old Babylonian, Homer in Greek, Catullus in Latin and Anonymous in Old English. But when it comes to an XML data format, I seek to be obvious, not obscure. Manipulating bits, my friends, is obscure in the realm of XML.

Why should you care? Bitmasks are use by C programmers, so why not in XML? One reason is addressing bits within an integer runs into platform-specific byte ordering difference. Different machine processors (physical and virtual) make different assumptions. Two popular conventions are go by the names of Big-endian and Little-endian. It would divert me too far from my present argument to explain the significance of that, so if you want more detail on that I suggest you seek out a programmer with grey hairs and ask him about byte-ordering conventions.

A second reason to avoid bitmasks in XML is that avoids being part of the XML data model. You’ve created a private data model inside an integer and it cannot be described or validated by XML Schema, RELAX NG, Schematron, etc. Even XSLT, the most-used method of XML transformation today, lacks functions for bit-level manipulations. TC45′s charter included the explicit goal of:

…enabling the implementation of the Office Open XML Formats by a wide set of tools and platforms in order to foster interoperability across office productivity applications and with line-of-business systems

I submit that the use of bitmasks is the not the thing to do if you want support in a “wide set of tools and platforms”. It can’t be validated and it can’t be transformed.

Thirdly, the reasons for using bitmasks in the first place are not relevant in XML document processing. Don’t get me wrong. I’m not saying bit-level data structures are always wrong on all occasions. They are certainly the bread and butter of systems programmers, even today, and they was truly needed in the days where data was transferred via XModem on 12kbps lines. But in XML, when the representation of the data is already in an expansive text representation to facilitate cross-platform use, trying to save a byte of storage here or there, at the expense of the additional code and complexity required to deal with bitmasks, that the wrong trade-off. Remember in the end, the XML gets zipped up anyways, and will typically end up to be 10-20% the size of the same document in DOC format. So, these bitmasks aren’t really saving you much, if any, storage.

Fourthy, bitmasks are not self-describing. If I told you the “table style conditional formatting exception” had the value of 32, would that mean anything to you? Or would it send you hunting through a 6,000+ page specification in search for a meaning? But what if I told you that the value was “APPLY_FIRST_ROW”, then what would you say? A primary virtue of XML is that it is humanly readable. Why throw that advantage away?

Finally, there are well supported alternatives to bitmasks in standard XML, such as enumeration types on XML Schema. Why avoid a data representation that allows both validation and manipulation by common XML tools?

It seems to me that the only reason that bitmasks were used here is that the Excel application already used them. Much easier for Microsoft to make the specification match the source code than to make a standard that is good, platform and application neutral XML.

So, for the second time in a month the thought enters my mind: “You expect me to eat this tripe ?!”

Creative Commons License
This work, unless otherwise expressly stated, is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License.

{ 15 comments… add one }

  • ray 2006/10/17, 04:09

    Hi Rob

    Would it not have been a good idea for IBM to have joined TC45 and bring all these issues up before the spec reached a final stage?

  • Rob 2006/10/17, 07:57

    Ray, I certainly have not been holding back on my criticism of the draft OOXML. A read of previous entries in this blog reveals my criticism of OOXML is not exactly a new avocation. I’ve been pointing out its deficiencies since the first drafts.

    In any case, I don’t believe that my meager talents would have been useful to any TC whose charter constrained them to producing a specification that preferentially advantages a single vendor to the exclusion of all others. That’s a “sweetheart deal” of a charter for Microsoft to have, and on principle I’d decline to participate.

  • hAl 2006/10/17, 09:31

    [quote]A primary virtue of XML is that it is humanly readable. Why throw that advantage away?[/quote]

    Because it is useless in an office fileformat. Anybody that is reading the file from the XML using only the tags as a describing guide should be locked up in an asylum.

  • Rob 2006/10/17, 09:59

    Maybe I should be in a straight jacket then. I read XML all the time. Of course, the average end user won’t touch XML, but the average developer will. And I don’t mean just people who write file open code for Excel. Anyone who writes a program, from a script, to a full GUI for reading, writing, modifying, scanning, indexing, filtering, etc., OOXML files will need to deal with the XML.

    Having a humanly readable format is invaluable for debugging, testing, etc. It shortens the mental distance from the output to the interpretation of the output.

    This is similar to the reason why choosing good variable names in a program leads to easier comprehension and maintenance, and hopefully higher quality and lower cost. Ditto for well designed functions, classes, modules, libraries, etc. We’ve acquired the wisdom over the years to realize the importance of well-designed and well-documented code libraries. Although the end-user does not see the code directly, the end user should care, because the quality, innovation and cost of the product they use is partially determined by the quality of the libraries it is built upon.

    A well-designed XML language is very similar in this regard.

  • hAl 2006/10/17, 11:28

    [quote]Maybe I should be in a straight jacket then. I read XML all the time[/quote]
    I also often read xml.
    But not the xml of a standard file format. Because it is standardised there is no need for descriptive tags. The standard is the descriptive part.
    Descriptive tagging is usefull for non standard xml like in x-html webpages where each page can have different new xml tags.
    In standardized xml formatting there is much room for efficiency and optimising the xml as xml generally creates a very bloated inefficient format.

  • Rob 2006/10/17, 12:10

    The fact that something is a standard makes readable XML no less important. A developer should not need to hunt through a 6,000 page specification everytime they want to find out what an element is. The names of elements and attributes should give a good indication of their meaning.

    Ditto for things like the standard C library. The fact that it is a standard does not eliminate the need for choosing good names. Sure, I bet an C-compiler would be slightly faster if printf() was just called p() and fopen() called f(), but the gain is miniscule.

    Here is the simple argument for choosing intelligability over compaction for things that developers need to work with on a daily basis: machines will get faster, storage will get cheaper, bandwidth will increase, latency will decrease. But developers are not going to get any smarter than they are today. So a trade-off that sacrifices comprehension for miniscule performance benefits is ususally the wrong decision.

    Also, keep in mind that ODF, with longer, more descriptive names results in smaller sized documents which parse faster than the same document expressed as OOXML. Choosing smaller element names, as OOXML does, cannot make up for the fact that OOXML requires far more XML documents to describe the same document. That is where the time is spent, parsing many small XML files. It is sad that OOXML has ended up both slower, as well as more obscure. Not much of a trade-off, eh?

  • hAl 2006/10/18, 09:14

    Funny enough I do not find C libs routine names descriptive enough. Those are object with names created for everyday use by programmers and should be descriptive and the names have need for optimising as they are removed by compilers anyways.
    Those names are created for interpretating by humans (mostly programmers only).
    XML tags in an office format are mostly interpreted by applications like office tools. There use for human interpretation is extremly limited compared to the interpreting done by programs. With C routine names this is just the opposite.

  • Steve 2006/10/23, 01:48

    HAL, you’re approaching this from the wrong direction. Identifiers and data types should be as human-readable as they can. C etc. has many limits with identifier names and data types, but XML does not. Whether you, or one person, or a million people look at the XML or not; XML is designed to be wordy. The idea of self-describing data is at the heart of XML.

    You should never through that away unless
    1) Brevity is an issue
    2) Technical limitations preclude it
    Neither is the case with XML.

    You want to make more work on yourself, fine, but don’t use XML. Since “no one will read it” anyway, you might as well use a binary format. The rest of the world recognizes the value of self-documenting data.

  • hAl 2006/10/23, 09:10

    The purpose of selfdocumenting data in XML is the flexibility that should give you in creating different content and the ability to interpret content based on different tags.
    However that ability of XML is completly useless for a standardised formats like ODF or OOXML.
    Most of the interpretation (99,9% or more )is done via application which is made against the standard.
    A programmer of such an application will work from the specs (which is the standard) and not from the descriptive tag.

  • hAl 2006/10/23, 09:22

    “you might as well use a binary format.”
    There isn’t any standardized form of binary format for this level of information. Else that would indeed be a lot more efficient.
    Where I work we exchange about 1 million real time remote interfac transactions. Only 2% of those done are in XML and those cause 31% of all on line waiting time.

  • Rob 2006/10/23, 09:31

    “A programmer of such an application will work from the specs (which is the standard) and not from the descriptive tag.”

    Think of it this way. Searching through a 6,000 pages specification to find the meaning of an element name will take you how long? 30 seconds? 2 minutes? Keep in mind that short names make searching more difficult. Good luck searching the 6,000 page specification for the meaning of an element called “t”.

    On the other hand, having a name that is self-describing will take you how long to interpret? 0.1 seconds? 0.5 seconds?

    This doesn’t mean the names need to be long. But they should be long enough to mean something to the reader without having to search through a 6,000 page specification.

    The fact that 99.9% of the interpretation is done by machines is irrelevant. The important fact is that 100% of the bugs are caused by humans, humans who are writing code under short schedules, pressure from their bosses and not enough time to test adequately. Using intuitive names in XML is for their benefit.

  • hAl 2006/10/23, 16:17

    I have been a programmer for about 8 years but when writing an interpreter you create a good technical analysis document that to program the code from.
    Yhe code must can contain selfdescribing names for procedures and/or functions and variables but the the xml tagging is just constant data for a programmer.
    A good programmer does not use [long-name] or [l] in his programming. A programmer creates constants for all tags.
    for instance:
    tag_open_table_cell = “[table-cell]”
    tag_close_table_cell = “[/c]“

    (changed tags to ‘[' and ']‘ due to blog limitations)

  • Steve 2006/10/24, 21:38

    It seems, HAL, that you’re not against verbose XML per se, but against any verbosity in open standards. Well, different strokes for different folks. XML is not a panacea–it is not the answer to all file formats.

    I think that with preservation and semantic content as 2 key goals of ODF, the benefits of XML oughtweigh the drawbacks. A standard is a great thing to have, yet ODF also uses human-readable XML because standards may be inconvenient or impossible to use far in the future. ODF is a totally free standard (ie it will last a long time), so we are doubly blessed.

  • hAl 2006/10/25, 14:43

    Verbose tagging is usefull when the tags need to mean something. when creating xhtml documents with your own variable tags it is handy to know what the data between the tags represents.
    However when the tags themselves become the data that is not so relevant. When you program to build standardized xml formatting then the need is even very low. Then it might be a consideration to take into account other things like the performance, the memory use and the diskspace. Fortunatly the diskspace issues are resolved immediatly by placing this particular format into zip containers.
    Suggestion here that verbose tagging improves the programming surrounding the format I find mainly a sign of using poor programmers. Tags in a standard file are static data to a programmer. All tags are known beforehand and cannot change. It would be weird using them in a program like is suggested here.

    If it is proven that the verbose tags perform exactly the same an the non-verbose tags I would certainly not object to using them.
    But claiming there is a need for verbose tagging should have some decent basis and I cannot find that when verbose tags are used in a standard.
    If people were actually creating odf documents using an ascii tekst editor and typing the tags together with the office data then verbose tagging would have a reasonable use.

  • Arne Vogel 2006/12/13, 18:35

    hAl (funny, I know a Microsoft fanboy who calls himself H A L), of course any programmer worth his money will be able to figure out and handle a complex file format such as OOXML, especially given the specification (even if it’s longish). However, the point that you are missing is that it will still take him much longer than figuring out and using a simpler and more self-explanatory format such as ODF. What this boils down to is less work for highly qualified people, or lesser costs for the customer I’m inclined to believe though that ignoring the customer will give Microsoft more than short-time benefits.

    “you might as well use a binary format. There isn’t any standardized form of binary format for this level of information. Else that would indeed be a lot more efficient.”

    hAl, this is true, but if Office Open XML is standardized, then only in the sense of having an “ECMA standard” label attached to it. XML is standardized of course, but it’s not a format, it’s a meta-format. While XML nicely handles syntactical and some structural issues, it cannot cover semantical problems because it was never designed to. Auxiliary standards are necessary to achieve this, and by that I mean standards that actually try to satisfy the immediate needs of their users, and not the needs of isolated application maintainers who have to deal with legacy cruft like incorrect date handling for the year 1900. Developers matter, but in this case, instead of giving developers something that works well for most, here Microsoft is asking the majority of developers to bend over backwards so that a few guys at Microsoft itself have less worries. This is certainly not what I intended to express.

    Also, what’s efficient is a question of criteria. Zipped ODF is a lot more space-efficient than binary MS Office files.

    “Suggestion here that verbose tagging improves the programming surrounding the format I find mainly a sign of using poor programmers. Tags in a standard file are static data to a programmer. All tags are known beforehand and cannot change.”

    hAl, are you kidding or where can I buy the stuff you seem to be smoking? Of course tags are irrelevant if all you are doing is dump data to and from XML in a single application. You could just as well enumerate them. But we are talking about a format whose stated purpose is interoperability between different applications, possibly even in different programming environments, where you cannot even re-use libraries easily if you had the source code. This means there are going to be two to many (given the baseless insults you sputter, I am however no longer sure whether you can count that far) implementations, probably written by many different people. This means a training effort that someone, anyone has to pay for, even if you might not, and obviously less effort per person saves actual hard-earned money. If you don’t care, then it’s either because you don’t care about money at all (I’ll be glad to give you my IBAN), or because it’s simply not yours. The latter would be called egocentrism.

    “If people were actually creating odf documents using an ascii tekst editor and typing the tags together with the office data then verbose tagging would have a reasonable use.”

    Really? Have you ever written an XSLT stylesheet? Sure, there may eventually be a reliable, efficient and cheap Java library, with a flexible license, an easy to use API, excellent documentation and a vibrating user community that is eager to provide support, which transparently handles OOXML documents. Also, there may be a Perl module, and a Ruby module and whatever. Just as there might have been such libraries which directly read and write MS Office binary formats, but which in the real world never materialized, because no one had both the ability and the inclination to tackle this beast. In an ideal world, such libraries will be growing on source trees. My predicition, however, is that lack of suitable libraries will again be a constraint that greatly reduces the de-facto interoperability of the new MS Office formats. While library support for ODF is at the present inadequate, too, lack of libraries is much less of a problem given a simple and self-explanatory format than given OOXML.

Leave a Comment