≡ Menu

The Carolino Effect

“There is it some game in this wood?”

Pedro Carolino wanted to write and publish an Portuguese/English phrase book.

“Another time there was plenty some black beasts and thin game, but the poachers have killed almost all.”

But one small problem — Carolino did not know English.

“Look a hare who run! let do him to pursue for the hounds! it go one’s self in the plonghed land.”

Undeterred, Carolino hatched a clever plan.

“Here that it rouse. let aim it! let make fire him!”

He had a copy of an Portuguese/French phrasebook, O Novo guia da conversação em francês e português by José da Fonseca. And he had a French/English dictionary.

“I have put down killed.”

With these two resources, writing his phrase book would be easy. Or so he thought.

“Me, i have failed it; my gun have miss fixe.”

Starting from the French half of the text in da Fonseca’s book, Carolino dutifully used his dictionary to translate, word-for-word, the French into English.

The result, O Novo Guia da Conversação, em Português e Inglês, em Duas Partes was published in Paris in 1855, and is now considered to be a classic of unintentional humor.

“Here certainly a very good hunting.”

A similar problem occurs in DIS 29500 “Office Open XML”. The scope of OOXML, as amended by the BRM is stated as:

This International Standard defines a set of XML vocabularies for representing word-processing documents, spreadsheets and presentations. The goal of this standard is, on the one hand, to represent faithfully the existing corpus of word-processing documents, spreadsheets and presentations that have been produced by Microsoft Office applications (from Microsoft Office 97 to Microsoft Office 2008 inclusive). It also specifies requirements for Office Open XML consumers and producers , and on the other hand, to facilitate extensibility and interoperability by enabling implementations by multiple vendors and on multiple platforms.

Faithful representation of Microsoft Office 97-2008. I’ve learned it is rarely polite to ask a man what he means by “faithful”, but let me make an exception here. We have now the binary Office format specifications, not part of the standard, but posted by Microsoft. And we have OOXML specification. In what way does the OOXML “represent faithfully” the “existing corpus” of legacy documents?

Does OOXML tell you how to translate a binary document into OOXML? No. Does it tell you how to map the features of legacy documents in OOXML? No. Does it give an implementor any guidance whatsoever on how to “represent faithfully” legacy documents? No. So it is both odd and unsatisfactory that primary goal of the OOXML standard is so tenuously supported by its text.

Now, certainly, someone using the binary formats specifications, and using the OOXML specification, could string them together and attempt a translation, but the results will not be consistent or satisfactory. It is the Carolino Effect. Knowing the two endpoints is not the same as knowing how to correctly map between them. A faithful mapping requires knowledge not only of the two vocabularies, but also the interactions.

Also, having the two specifications does not help with the 77 features in OOXML which are declared to the “implementation-defined” or “application-defined”. How are these translated from the binary formats?

Note that DIS 29500 bears the obvious marks of its legacy roots, from the use of VML and non-hierarchical run structures in WordProcessingML, to bit fields and idiosyncratic leap year calculations in SpreadsheetML. This suggests the likelihood that the authors of this standard did not just sit down and design the standard from scratch, but that they in fact had access to the binary format specification and mapped it into XML as a preparatory step. It is difficult to explain the presence of elements such as “lineWrapLikeWord6” without positing the presence of such a mapping.

Microsoft should simply publish this mapping. Without such a mapping, conversions will be inconsistent, interoperability will suffer and a primary goal of the standard will not be met. Given the same binary document, Microsoft Office, Apple iWork, OpenOffice.org, etc., will all produce different OOXML documents. How is this “faithfully representing” existing documents? What is needed is a canonical mapping.

Note that the initiation of a open source project to develop a convertor between the binary formats and OOXML is insufficient. What is required is a canonical mapping. Otherwise we are faced with the reality that the true goal of OOXML is more accurately stated as:

To allow Microsoft the ability to represent their legacy documents in XML and pretend that it is a capability that other vendors can practice as well.

Though this issue was of great interest to several NB’s, it was not able to be raised at the BRM for lack of time.

{ 11 comments… add one }
  • Anonymous 2008/03/04, 17:34

    Submitted it to Slashdot. I wonder if this has any relation to that ‘My hovercraft is full of eels’ sketch?

  • Rob 2008/03/04, 17:39

    Hmmm… I wonder. Manuel, on Faulty Towers, talks as if he could have been reading from this phrase book. So maybe there is a Python connection.

  • Rob Brown 2008/03/04, 17:39

    I’d go further, and say that with one endpoint (the binary file) and the mapping, the other endpoint becomes arbitrary.

    I fail to understand how the “existing corpus” of documents could not be represented 100% faithfully in Braille, Morse code, cave drawings, or even (shock) ISO26300 ODF, as long as the mappings are available, complete, and unambiguous.

  • Rob Brown 2008/03/04, 17:55

    Oh, crud… that last comment came out all wrong. It should have ended with something like “as long as any complete and unambiguous mapping is available, then any other should be easy to create. And by not providing any usable mapping, OOXML prevents any faithful representation.

    /RobBrown crawls back under his rock

  • Anonymous 2008/03/05, 08:06

    In addition the feature mapping there has been no information provided on which features support which versions.

    For example, it would be impossible to round-trip a file with DrawingML to an Office 2000 binary as there is no equivalent.

    NZ-0051 said that “The Specification contains the commands for every single feature that any Office version ever had, but does not tell the user which version.”

    In response Ecma said “it is not possible to create a simple one-to-one mapping between specific Office Open XML functionalities and specific versions of Microsoft Office applications because the features have evolved over time with each version of each application.”

    Now this statement of Ecmas’ is just factually incorrect.

    Here’s the simple logic,

    1. All features appeared at a particular time in Microsoft Office files. These can be documented, categorised.

    2. If a file-format feature’s usage changed (using the same semantics) then this can be documented too. Eg, in Office ’97 font weight “bold” meant an OpenType weight of 300, but in Office 2003 it had an OpenType weight of 400 (this is a hypothetical example).

    Without such a mapping the stated goals of OOXML appear more as advertising fluff than engineering criteria.

  • James 2008/03/05, 08:55

    Or that other classic monty sketch “your wife has lovely boobies… bouncy bouncy” perhaps?

    Certainly the contains of that poor owner’s dictionary was about as much use to him as the current DIS29500 is to any office application implementor.

  • Matthew Holloway 2008/03/05, 10:42

    Oh, by the way. If you’re looking to visualize voting scenarios on OOXML try my http://iso-vote.com site. It lets you toggle votes on different countries to see how this might play out.

    (I’m the same guy who talked about the NZ-0051 comment)

  • Anonymous 2008/03/05, 13:02

    Other modern examples of this effect may come from poorly translated user manuals of imported products.

    In Canada it is frequent that the French portion of the manual is so totally mangled that French Canadians must rely on the English version to make any sense of it. In many cases some French Canadians are offended and refuse to purchase the product.

    It is also a huge source of unintentional humor. A classic is the “Made in Turkey” label translated as “Fabriqué en dinde”, confusing the country for the well known edible bird.

  • Anonymous 2008/03/05, 16:05

    Brazil shares these concerns and provide context what happened at the BRM regarding this mapping:

    “I have never seen a person so nervous and ashamed in my life… He said that Microsoft should have this mapping and if we want, we can ask it to Microsoft but not ask it to ECMA.”

    http://homembit.com/2008/03/at-the-end-what-we-did-in-geneva.html

  • John Mann 2008/03/05, 23:37

    I think the that the MS binary file documentation gives you a forward mapping .doc => description.

    And MSOOXML gives you a forward mapping .docx => description.

    This isn’t sufficient to create a quality forward+reverse mapping from .doc => anything => .docx.

    Similarly, if Pedro Carolino only had a Portuguese=>French phrasebook and a English=>French phrasebook (but no French=>English phrasebook) his task would have been even harder.

  • Jay, writer Memberspeed.com 2008/03/10, 20:26

    Well, you certainly put that into terms people would understand. Honestly, I’m not really familiar with computer codes and the like but the story has sort of created an idea inside my head. I’m still trying to grasp the concepts. Still, good introduction.

Leave a Comment