≡ Menu

No Representation Without Specification

Maybe I just have an ear for this, but whenever I hear a number of people saying the same odd thing, using the same strained phrase, it catches my attention and makes me take a closer look. Individuals naturally have a great diversity of expression and phrasing, so where this is lacking, and the Borg starts speaking as one, it is good to pay it some heed.

The word for today is “represents”. A few exemplary quotations to demonstrate a particular pattern of use that attracted my attention:

From Microsoft’s Open XML Community:

Open XML was designed to provide users the benefits of: faithfully representing in an open format existing office documents, interoperability, support across platforms and applications, integration with business data, internationalization, support for accessibility and assistive technologies, and long-term document preservation.

Microsoft’s Jean Paoli as quoted by Tim Anderson:

As a design goal, we said that those formats have to represent all the information that enables high-fidelity migration from the binary formats.

And Paoli again in a Microsoft press release:

So the Office Open XML file formats represent all the characteristics of the Office binary file formats, while making it easier for people to connect to the different islands of data in the enterprise.

Microsoft’s Brian Jones in a comment response on his blog:

We had to leave some legacy behaviors in place because the goal of our work was to create an XML format that could represent our existing base of Office documents.

From the OOXML Overview whitepaper [pdf] presented to JTC1:

OpenXML was designed from the start to be capable of faithfully representing the pre-existing corpus of word-processing documents, presentations, and spreadsheets that are encoded in binary formats defined by Microsoft Corporation.

From Ecma’s response to the JTC1 NB contradiction objections:

OpenXML has been designed to be capable of faithfully representing the majority of existing office documents in form and functionality.

Microsoft’s Stephen McGibbon:

I represent Microsoft at all kinds of meetings and my firm understanding is that one of the things that differentiates OpenXML and ODF is OpenXML’s ability to faithfully represent all of the previously created Microsoft Office binary format documents.

So what are we to make of this? They are being very specific about their choice of words, aren’t they? I wonder why…

A file format represents data. It stores data. It encodes data. These are all synonymous. But the ability to represent data is a trivial thing to do. For example, here an example of a markup language that can also represent all legacy Microsoft Office documents:


Since the above markup directly maps to binary, it can faithfully represent 100% of existing Office documents with 100% backwards compatibility. It can also represent perfectly the documents of every other vendor, past, present and future.

But before Ecma gets all excited that they may soon have another standard to Fast Track, I must admit the obvious. This markup is not all that useful as an interoperable document format. Why? Because although it can represent 100% of legacy documents, it does not specify how to do anything with them. Except at the level of a bit, the format does not express any structure or semantics. Although you can express anything you want with 1,’s and 0’s, there is no common, interoperable use above the level of 1’s and 0’s provided for. My binary document means something only to me, and unless I go outside of the standard and share additional information with you, you will not be able to understand my binary document.

Interoperability comes not from representation, but from specification.

(An aside — There is however speculation that it is possible to transmit information via a binary code in a way that presupposes no other prior agreement or knowledge other than universals like mathematical and physical laws. It would require a bootstrapping approach where very basic elements of notation and mathematical logic are transmitted, followed by increasingly more complex concepts. By this theory it would be possible to communicate with alien intelligences without any prior conventions. See, for example, Carl Sagan’s novel, Contact. But this is probably overkill for an office document format, unless your workplace is a lot stranger than mine.)

So what is the difference between representing and specifying? When you represent, it means that you can map from the features of the legacy format to the the new format. When you specify, it means that you provide the map, and enough detail so that others can read and write that same representation. That is a big difference.

Of course, OOXML is more than 1’s and 0’s. But when you see attributes with names like, “useWord97LineBreakRules,” with no additional specification, then you know that the fix is in. My guess is that MS Word has code someplace that looks like this:

if (useWord97LineBreakRules)
doCrappyOldWayOfLineBreaking(); // reuse legacy code from Word 97
doNewWayOfLineBreaking(); // Use new rules

If this is true, then MS Word can implement this feature trivially. But no one else can make sense of it, because we lack a specification of its behavior . They might has well had called the attribute, “Fred.” It is just as useful.

Another example is how OOXML deals with PowerPoint slide transitions, the things that people use in an attempt to make a boring presentation seem more interesting. Microsoft has ensured that they can represent all of the transitions. They are all there listed in Section blinds, checker, circle, comb, cover, cut, etc. But when you drill down into the definitions, this is what you find:

wheel (Wheel Slide Transition)

This element describes a wheel slide transition effect.

[Example: Consider we have a slide with a wheel slide transition. The <wheel> element should be used as follows:

End example]

That’s it. Ditto for all of the other slide transitions. Not exactly specified fully, is it? Although the text claims that it “describes a wheel slide transition effect,” in truth it merely labels it. There is no specification, only representation. And that curious little example — is this some sort of joke? Did someone really think that attributes with no definition are improved by trivial examples? It reminds me of the old spelling bee joke:

Judge: The word is “synecdoche.”
Student: Could you use that in a sentence?
Judge: Certainly. “Synecdoche” is a very hard word to spell.

100% correct, but also 100% useless. As I read through the OOXML specification I am finding hundreds of places like this where things are labeled, but no definition is given.

So I think we need to ask more questions when we hear the claims that OOXML was designed to faithfully represent 100% of the legacy documents. We need to respond that representation is not enough for an open format. Even an XML format of just <one>’s and <zero>’s can do that. To be of use to anyone other than Microsoft we need more than just representation. We need specification, and we need the map to the legacy formats. To accept anything else is to embark on a voyage with a foreign dictionary missing the definitions. It can represent everything that you want to say, but you’ll be unable to say any of it.

ISO defines a standard as a:

…document, established by consensus and approved by a recognized body, that provides, for common and repeated use, rules, guidelines or characteristics for activities or their results, aimed at the achievement of the optimum degree of order in a given context

A key clause there is the requirement for providing, “common and repeated use.” Providing explicit representation for a single vendor’s legacy formats while not providing for common use of that ability, this is not the purpose of an ISO standard and to my eyes appears to be an abuse of the standardization process.

{ 7 comments… add one }
  • Anonymous 2007/06/19, 8:06 pm

    Didn’t I see that in a movie somewhere? I think they were transmitting in unary, though. They first transmitted two pulses, then three, then five, etc. and went through a bunch of prime numbers. Not sure what they did after that.

    Oh, there’s also been a mention that ANSI is getting form letters, see the Groklaw news picks, it’s somewhere on the sidebar. I liked the “although this is a form letter” type bit on one of them :-)

  • Rob 2007/06/19, 9:14 pm

    Carl Sagan’s novel Contact later made into a movie starring Jodie Foster and Matthew McConaughey, had that idea. Similar also is Stanislaw Lem’s novel His Master’s Voice

    I’ve seen those form letters. We’re getting 3 or 4 a day in INCITS, but I’ve heard that they are arriving in other countries as well.

  • Anonymous 2007/06/19, 10:44 pm

    Speaking of hearing the same thing over and over again…

    how about all the duplicate letters at incitis, with more and more every day (tenfold duplicate!)

    ” Dear Lisa Rajchel,

    We are writing to voice our strong support for the approval of Ecma’s Office Open XML File Formats as an ISO/IEC International Standard. We strongly urge the American National Standards Institute to communicate its support for the ISO/IEC ratification of this standard to the JTC1 Secretariat.
    Open XML represents an important advance in document standards that offers benefits to technology users, the technology industry, consumers, businesses and governments worldwide. The standard received a strong endorsement when it was approved by Ecma in December 2006 and submitted to JTC1 for fast-track approval.

    Open XML will enable backward compatibility with billions of archived documents, and the extensive standard accommodates a wide range of languages and cultures, as well as assistive technologies that help people with disabilities. Governments and businesses will both benefit from the standard itself, as well as from the range of new products that implement the standard. Furthermore, Open XML in no way contradicts any other international document standard.

    Thank you for your support for Open XML. If you have any questions, please contact ____ ____ “


  • Anonymous 2007/06/20, 3:48 am
  • Anonymous 2007/06/20, 4:00 am

    Heh – it’s not just when the same phrase keeps popping up that things get interesting. I always like to look at how well MS stacks up when it starts attacking people on a certain front.

    Like when they were making all the noise about ODF spreadsheets not being portable because formulae weren’t specified, when their own formats didn’t either.

    With the patent saber-rattling about Linux, it’s interesting to look at how many patent lawsuits have been brought against MS and how often they’ve settled, and then looking at how many patent lawsuits have been bought against the OSS products they claim infringe upon theirs.

    I’ve often found when MS start bad-mouthing another company/group, they’re normally /more/ guilty of the thing they’re screaming about than the person they’re accusing.

  • Rob 2007/06/20, 7:49 am

    The spreadsheet formula issue is an interesting one. Microsoft never documented theirs before. So back when they were scared by ODF in Massachusetts they recommended the Office 2003 Reference Schemas for state use, even though it lacked spreadsheet formulas. It wasn’t until the ODF TC started writing a formula specification that Microsoft suddenly got religion and decided that they needed one to.

    Certainly, they completed their version faster. All they really needed to do was to transcribe some internal Excel documentation. That isn’t much of a standardization activiity. The ODF TC however, sat down and looked at the wide range of spreadsheets in use today, commercial and open source, involved multiple vendors, brought in experts (a professor of statistics, for example) identified what was in common, what the conventions where, categorized the functions by frequency of use, etc., and wrote up a much more detailed specification.

    When you compare the two, you’ll see the difference. For example the Ecma version doesn’t even state whether the SIN() and COS() functions take their arguments in degrees or radians. You could think that this is something important to know and would have been noticed in even a cursory review by Ecma.

  • Anonymous 2007/06/21, 1:34 am

    Negotiating a communication code goes in stages. First you need to establish a self delimiting code (see Chaitin, of IBM)

    Say, you get the following sequences:
    0, 1010, 1011, 1101000, …, 1101011,
    11011000, …., 11011111, 11101000000, …, 11101001111
    Guessing these should be self delimiting sequences, you can work out the code (prepend a binary of the length in bits, eg, 100 = 4, and prepend that with a 1 for every bit in the length code followed by a 0, eg, 11110).

    Having established this code, start defining a universal turing machine using code -> output relations. Eg, “A” “B” is a turing machine that on input A generates B and halts. Then, work up to more complex machines, enumerate symbols and state change tables.

    If the other side know their mathematics and assumes you want to communicate too, this should eventually work.


Leave a Reply

Next post:

Previous post: