I suppose I was the odd child in my neighborhood. While the other boys were playing with light sabers and phasers, I wanted only one thing from the future: the Universal Translator, the ultimate piece of linguistic technology which would immediately translate from all alien tongues. Captain Kirk had one, and one day, I vowed, I would have one as well. Who wouldn’t want one? It certainly beats spending hours memorizing vocabulary and conjugations and declensions.
Or is it?
The Ecma Office Open XML (OOXML) specification seems to presuppose the existence of a Universal Translator of sorts. Take a look at section 11.3.1 “Alternative Format Input Part” (Page 38):
An alternative format import part allows content specified in an alternate format (HTML, MHTML, RTF, earlier versions of WordprocessingML, or plain text) to be embedded directly in a WordprocessingML document in order to allow that content to be migrated to the WordprocessingML format.
According to the schema, these alternate formats may be the main content of the document, or specifically applied to comments, endnotes, footer, footnotes or headers.
Let’s parse the original more closely, starting by defining some terms:
- The term “part” in OOXML refers to the individual items (XML documents, images, scripts, other binary blobs, etc.) contained in the OOXML Zip file, which they call a “package”. So a package is made up of one or more parts.
- HTML should be self-evident. But does this also include the HTML-like output from earlier versions of Word, which wasn’t always well-formed?
- MHTML what you get when you save a “complete web page” within Internet Explorer. It is MIME-encoded version of the HTML page plus the embedded images. MHTML is listed as a having a status of “Proposed Standard” in the IETF, but it appears to have been held at that state since 1999. (Does anyone know why it never advanced to the Standard status?)
- RTF – Rich Text Format is a proprietary document format occasionally updated by Microsoft. As one wag quipped, “RTF is defined as whatever Microsoft Word exports when it exports to RTF”.
- WordProcessingML – I’ve seen this term used to refer to the XML format of Word 2003 as well as Word 2007. Presumably the 2003 version is intended here?
As you can see, we have several problems here from a specification standpoint.
First, no versions are specified for HTML, MHTML, RTF or WordProcessingML. Are we supposed to support all versions of of these? Only some? Does this include WordProcessingML from beta versions of Office 2007 as well?
Second, the specification provides no normative references for MHTML, RTF or “earlier versions of WordProcessingML”.
Third, this is a closed list of formats that seems biased toward Microsoft’s legacy formats. Why not XHTML? Why not DocBook? Why not TeX or troff? Why not ODF? Is there a legitimate reason to restrict the set of supported formats in this way?
Fourth, “plain text” is not a phrase I like to see in file format specification, since it is undefined. No encoding is mentioned. What is meant here? ASCII, Latin-1, UTF-8. UTF-16, EBCDIC? Some of the above? All of the above? What encodings are included under the name “plain text”?
Reading further we have:
A WordprocessingML consumer shall treat the contents of such legacy text files as if they were formatted using equivalent WordprocessingML, and if that consumer is also a WordprocessingML producer, it shall emit the legacy text in WordprocessingML format.
Three words should raise an eyebrow. The first is the use of the word “equivalent” and the other two are the instances of the word “shall”. “Shall” is spec talk for a requirement, something a conformant application must do. According to Annex H of ISO Directives Part 2, “Rules for the Structure and Drafting of International Standards”, the word “shall” is used,“to indicate requirements strictly to be followed in order to conform to the document and from which no deviation is permitted.”
So, compliant consumers are required to take input from a variety of formats and convert them in the “equivalent” WordProcessingML. Putting aside the question as to what version or versions of HTML are intended, there is nothing here that defines the mapping between any version of HTML and WordProcessingML. So the conversion is application-defined. Considering that this is indicated to be a required feature of a conformant application, I find the lack of specificity here disturbing. How can there ever be interoperable processing of OOXML documents if this is not defined?
Reading the OOXML specification a little further down:
This Standard does not specify how one might create a WordprocessingML package that contains Alternative Format Import relationships and altChunk elements.
However, a conforming producer shall not create a WordprocessingML package that contains Alternative Format Import relationships and elements.
“Shall not” is another one of the special specification words. So, essentially, we’re not allowed, in a conforming application, to create a document with Alternative Format Input Parts, but if we read a document that has one, then we are required to process it, transforming it into equivalent WordProcessingML.
Further, we get this informative note:
Note: The Alternative Format Import machinery provides a one time conversion facility. A producer could have an extension that allows it to generate a package containing these relationships and elements, yet when run in conforming mode, does not do so.
Putting on my tinfoil hat for a moment, I find this all rather fishy. The OOXML specification, at 6,000+ pages has now just sucked in the complexity of one or more versions of HTML, MHTML, RTF and WordProcessingML. It requires that a conformant application understand these formats, but forbids a conformant application from producing them.
This is another example of how you never know what you’re getting when you get an OOXML file. To support OOXML is not to support a single format, or even a single family of formats. To fully support OOXML requires that you support OOXML plus a motley hodgepodge of various other formats, deprecated, abandoned and proprietary. The cost of compatibility with billions of legacy Microsoft documents is that you must support their legacy of years of false starts and restarts in the file format arena.
When you get an OOXML document, you don’t know what is inside. It might use the deprecated VML specification for vector graphics, or it might using DrawingML. It might use the line spacing defined in WordProcessingML, or it might have undefined legacy compatibility overrides for Word 95. It might have all of its content in XML, or it might have it mostly in RTF, HTML, MHTML, or “plain text”. Or it may have any mix of the above. Even the most basic application that reads OOXML will also need to be conversant in RTF, HTML and MHTML.
Captain Kirk, where are you? I need a Universal Translator!