≡ Menu

Calling Captain Kirk

I suppose I was the odd child in my neighborhood. While the other boys were playing with light sabers and phasers, I wanted only one thing from the future: the Universal Translator, the ultimate piece of linguistic technology which would immediately translate from all alien tongues. Captain Kirk had one, and one day, I vowed, I would have one as well. Who wouldn’t want one? It certainly beats spending hours memorizing vocabulary and conjugations and declensions.

Flash forward now to the 21st Century, the present. We have Babelfish and Google, and they do a fair job at text translation, but the Universal Translator is still science fiction.

Or is it?

The Ecma Office Open XML (OOXML) specification seems to presuppose the existence of a Universal Translator of sorts. Take a look at section 11.3.1 “Alternative Format Input Part” (Page 38):

An alternative format import part allows content specified in an alternate format (HTML, MHTML, RTF, earlier versions of WordprocessingML, or plain text) to be embedded directly in a WordprocessingML document in order to allow that content to be migrated to the WordprocessingML format.

According to the schema, these alternate formats may be the main content of the document, or specifically applied to comments, endnotes, footer, footnotes or headers.

Let’s parse the original more closely, starting by defining some terms:

  • The term “part” in OOXML refers to the individual items (XML documents, images, scripts, other binary blobs, etc.) contained in the OOXML Zip file, which they call a “package”. So a package is made up of one or more parts.
  • HTML should be self-evident. But does this also include the HTML-like output from earlier versions of Word, which wasn’t always well-formed?
  • MHTML what you get when you save a “complete web page” within Internet Explorer. It is MIME-encoded version of the HTML page plus the embedded images. MHTML is listed as a having a status of “Proposed Standard” in the IETF, but it appears to have been held at that state since 1999. (Does anyone know why it never advanced to the Standard status?)
  • RTF – Rich Text Format is a proprietary document format occasionally updated by Microsoft. As one wag quipped, “RTF is defined as whatever Microsoft Word exports when it exports to RTF”.
  • WordProcessingML – I’ve seen this term used to refer to the XML format of Word 2003 as well as Word 2007. Presumably the 2003 version is intended here?

As you can see, we have several problems here from a specification standpoint.

First, no versions are specified for HTML, MHTML, RTF or WordProcessingML. Are we supposed to support all versions of of these? Only some? Does this include WordProcessingML from beta versions of Office 2007 as well?

Second, the specification provides no normative references for MHTML, RTF or “earlier versions of WordProcessingML”.

Third, this is a closed list of formats that seems biased toward Microsoft’s legacy formats. Why not XHTML? Why not DocBook? Why not TeX or troff? Why not ODF? Is there a legitimate reason to restrict the set of supported formats in this way?

Fourth, “plain text” is not a phrase I like to see in file format specification, since it is undefined. No encoding is mentioned. What is meant here? ASCII, Latin-1, UTF-8. UTF-16, EBCDIC? Some of the above? All of the above? What encodings are included under the name “plain text”?

Reading further we have:

A WordprocessingML consumer shall treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML, and if that consumer is also a WordprocessingML producer, it shall emit the legacy text in WordprocessingML format.

Three words should raise an eyebrow. The first is the use of the word “equivalent” and the other two are the instances of the word “shall”. “Shall” is spec talk for a requirement, something a conformant application must do. According to Annex H of ISO Directives Part 2, “Rules for the Structure and Drafting of International Standards”, the word “shall” is used,“to indicate requirements strictly to be followed in order to conform to the document and from which no deviation is permitted.”

So, compliant consumers are required to take input from a variety of formats and convert them in the “equivalent” WordProcessingML. Putting aside the question as to what version or versions of HTML are intended, there is nothing here that defines the mapping between any version of HTML and WordProcessingML. So the conversion is application-defined. Considering that this is indicated to be a required feature of a conformant application, I find the lack of specificity here disturbing. How can there ever be interoperable processing of OOXML documents if this is not defined?

Reading the OOXML specification a little further down:

This Standard does not specify how one might create a WordprocessingML package that contains Alternative Format Import relationships and altChunk elements.

However, a conforming producer shall not create a WordprocessingML package that contains Alternative Format Import relationships and elements.

“Shall not” is another one of the special specification words. So, essentially, we’re not allowed, in a conforming application, to create a document with Alternative Format Input Parts, but if we read a document that has one, then we are required to process it, transforming it into equivalent WordProcessingML.

Further, we get this informative note:

Note: The Alternative Format Import machinery provides a one time conversion facility. A producer could have an extension that allows it to generate a package containing these relationships and elements, yet when run in conforming mode, does not do so.

Putting on my tinfoil hat for a moment, I find this all rather fishy. The OOXML specification, at 6,000+ pages has now just sucked in the complexity of one or more versions of HTML, MHTML, RTF and WordProcessingML. It requires that a conformant application understand these formats, but forbids a conformant application from producing them.

This is another example of how you never know what you’re getting when you get an OOXML file. To support OOXML is not to support a single format, or even a single family of formats. To fully support OOXML requires that you support OOXML plus a motley hodgepodge of various other formats, deprecated, abandoned and proprietary. The cost of compatibility with billions of legacy Microsoft documents is that you must support their legacy of years of false starts and restarts in the file format arena.

When you get an OOXML document, you don’t know what is inside. It might use the deprecated VML specification for vector graphics, or it might using DrawingML. It might use the line spacing defined in WordProcessingML, or it might have undefined legacy compatibility overrides for Word 95. It might have all of its content in XML, or it might have it mostly in RTF, HTML, MHTML, or “plain text”. Or it may have any mix of the above. Even the most basic application that reads OOXML will also need to be conversant in RTF, HTML and MHTML.

Captain Kirk, where are you? I need a Universal Translator!

{ 9 comments… add one }
  • Anonymous 2007/01/17, 1:39 am

    Well said, and that applies not just to introspect an OOXML file, but also to generate one : for instance what you put in the dually conflicting [Content_Types].xml and relationship type for those “legacy” formats.

    In addition, while you seem to concentrate on Word related features, the same holds true for Excel as well. In fact, it’s worse. For instance, sticky notes in Excel 2007 are now wrapped in cryptic VML. It was not the case in earlier versions. And when you take a look at how the OPC rules are violated when defining such thing, the actual comment part is inferred, you get the feeling that the barrier to entry to do whatever with OOXML : read, write, calculate, render is incredibly high and this gives Microsoft not only many years of first-mover advantage, but a vendor should budget for many years.

    This revelation I had when I tried to support the Excel charts in their new XML franca. This is so convoluted that I estimate it will take more time to decipher and abstract away than the BIFF8 records for charts. I have given up since I won’t be spending another two years chasing the tail light just because Microsoft intentionally crippled the file formats.

  • Jeff Kaplan 2007/01/17, 1:44 am

    Instead of a Universal Translator, maybe we just need a giant eraser.

    Given all the flotsom and jetson in ooXML (i.e., the mass of legacy formats), maybe we just need a clean slate that will henceforth be easier and cheaper to implement.

    Or to keep with your Star Trek theme, we need something to break the giant tractor beam locking us into proprietary black holes.

    Enter ODF.

    Founder & Director
    Open ePolicy Group

  • Stephen Samuel 2007/01/17, 2:28 am

    As I understand it, the ‘definition’ for the old formats (like Word-97) simply suggests that an implementor reverse-engineer the appropriate program.

    I bet, however, that MS’s EULA for many(if not all) of those versions of Office and/or Word Explicitly forbids this suggested reverse engineering.

    This means that — although MS might not be able to sue you for using their patents in reading an OOXML they might still be able to sue someone who (miraculously) succeeds in producing a compliant third-party OOXML reader including old formats for breach of contract.

  • Rob 2007/01/17, 8:44 am

    Aside from anti-reverse-engineering clauses, it is important to go back to Microsoft’s Open Specification Promise (OSP) to see what areas are outside of its protection. According to the OSP, things that are not described in detail, but are merely referenced in the OOXML specification, are not covered.

    So beyond the technical ambiguities and difficulties in the specification, and the economic mountain you must climb to compete against Office, there are several layers of legal hurdles whenever you try to go beyond what is detailed in the spec: 1) OSP no longer applies, so you risk patent infringement, 2)Any reverse engineering restrictions in the EULA, 3) In some areas, like DRM and encryption, the DMCA may apply.

    Of course, I am not a laywer. But if I were implementing this stuff, I would be sure to talk to one.

  • Ed 2007/01/17, 9:38 am

    It has been my understanding that when one converts a legacy format to a new format, one reformats everything in the legacy format to fit the new format. Certainly, some elements of Microsoft’s “spec” suggest they are of the same mind, while still not wanting other people to play. Yet they also allow legacy formats to be embedded, rather than converted? This is crazy.

    Basically, what this means is that Office 2007 will need to contain all of the old code for reading and writing. Given that we understand the old writing code to be a memory dump, this suggests using the old routines as much as is possible – routines which have been demonstrated to be so full of exploitable bugs that many people conscious of security switched to OpenOffice as soon as it was available, since the unknown was statistically speaking safer.

    It also sounds like their “spec” is vague enough that not even Microsoft can produce a conformant product – they define the formats that can be embedded so vaguely that it seems *every* older format is potentially included. I don’t think they can render all of those – certainly, there are many older formats that office 2000 cannot render.

  • Anonymous 2007/01/17, 10:02 am

    Wierdly enough both ODF and OOXML have no real conformance clauses on how a document would classify as being an ODF of OOXML document.
    In fact that when you check the actual conformance rules on both specs any zip file will qualify as an ODF or OOXML file.
    The conformance clauses in both spec refer to applications mostly and not to any minimum requirements to the actual document format.

    So I now classify Winzip as a both ODF and OOXML producing application.

    I would have expected a standard to at least state that a document should have minimum requirements and would have required certain parts of its content to be able to validate against XML scheme’s.
    However non such conformance is required within the document standards which in general allows just about any kind of weird files to be send as documents.

  • Rob 2007/01/17, 10:43 am

    Regarding document conformance versus application conformance — I think you’ll find that document conformance is a bit better defined than you suggest. ODF specification section 1.4 says, “The normative XML Schema for the OpenDocument format is embedded within this specification”. So that clearly defines document conformance. The Ecma OOXML also defines document conformance in section 2.4 (page 13) of that specification.

    Also, since these are both file format specifications, the normative text in its entirety defines document conformance. So I don’t think either specification can be interpreted as allowing “any zip file” to be conformant.

    Application conformance is a different story. This is much more loosely defined, and perhaps it should be since there are many different kinds of applications that might handle ODF or OOXML documents, from the very simple to the very complex. So WinZip might be a perfectly reasonable OOXML consumer. I certainly use it for that task frequently.

  • Albert 2007/01/24, 12:40 am

    “plain text” is what notepad.exe

    Windows apps auto-detect the encoding by
    looking at the data. When the result is
    ambiguous, Windows prefers to choose
    UTF-16 or that non-ISO Windows code

    This leads to some interesting bugs.
    Type “john the ceo cries” into notepad,
    without quotes or a newline. Save it.
    Restart notepad with that file, and…
    you see a row of missing-character

  • Yuhong Bao 2007/03/11, 12:39 am

    actually you see chinese characters:

Leave a Comment