Lost in Translation

In the last installment I looked at the way the ODF Add-in for Word 2007 integrates into the Word UI. Now let’s drill down into an actual conversion and see what fidelity we get.

I downloaded the code from SourceForce and installed on a machine running the Office 2007 beta 2. The Add-in pre-reqs the .NET 2.0 runtime, an additional 22MB download. The current version only supports reading ODF documents, not writing, and only handles the word processor ODF format.

Now for fidelity. Since you may not all have Office 2007 beta 2 installed, I’m going to show you the fidelity via PDF exports. In all cases I manually verified that the PDF output was identical to what I saw on the screen, every error is real, nothing introduced by the PDF export process.

First up is a document I call “the sampler”. It has a little bit of all the basic word processor formatting, fonts, alignment, nested tables, graphics, other character sets, headers/footers, images, captions, etc. It is not intended to be a particularly hard test of document conversion, but a basic test of core functionality.

So, here is the sampler, in the original ODF format, as well as the PDF rendering of it in OpenOffice 2.0.3, where it was originally created.

I then exported that file from OpenOffice to Word format. This demonstrates the quality of conversion users already get when running OpenOffice. Here is is in DOC and PDF exported after loaded the DOC file in Word 2007 beta 2.

Good, but not perfect. Some differences:

the bullet point size larger in Word than in OpenOffice
the nested table collapsed into main table in Word
the above table problem causes the table to take up more vertical space, pushing the graphic onto a second page

Again, that is the OpenOffice –> Word conversion we all have available for free today in open source code. Since DOC is a proprietary binary format with inadequate publicly-available documentation, this level of fidelity is impressive. So moving from ISO ODF to Draft Office Open XML should be that much easier, especially since the target format is voluminously documented (4,000 pages and growing), and the writers of the translator are receiving technical assistance from Microsoft.

Let’s take a look. From within Word 2007 (beta 2) I use the ODF Add-in to load the sampler ODF file, and get something that looks like this PDF.

I won’t characterize it but to say it fared less well than I expected. Problems include:

headers/footers dropped (data loss)
bullet list indentation ignored
number list indentation ignored
table dimensions messed up
caption for the graphics sized and positioned incorrectly

Whether these are all bugs or merely functional limitations is an interesting question. There is a Functional Specification document available on SourceForge for the Add-in which lists these requirement:

2.1.1.1. Basic Formatting

Here is the list of formatting items that the Add-in and command line translator would keep intact. The first 10 in the list are must haves and the last 4 (number 11 to 14) are good to have items of formatting.

Bold

Italics

Underline

Bulleting

Numbering

Indentation

Alignment (Left, Center, Right)

Font size

Font face

Tabs

Tables

Font color

Highlights

Background colors

Tables are “nice to have”? I’d hope so! This does not give me the impression that full fidelity is in their plans. Forget about scripts and macros. They are not even planning on tables or font colors. I hope I am wrong or misinterpreting their plans here, but that is the requirements document they have posted.

Comments

Ben Langhinrichs says

2006/07/16 at 8:29 am

Excellent post. Although, having spent a lot of my recent years writing conversions (Notes rich to text HTML/XHTML and back, etc.), I can tell you that it can be a very difficult task even for those who want to make it happen. For those who don’t…

By the way, you RSS feed seems to be going to a different blog, perhaps a personal blog?

Rob says

2006/07/16 at 3:56 pm

Thanks. Looks like Blogger’s default site link, done via a meta <$BlogSiteFeedUrl$> resolves to some poor soul’s personal blog of the same name. I’ve hardcoded the link to point to the correct one. If anyone has subscribed in the last couple of days, you’ll want to resubscribe.

You are correct that rich text is difficult to convert, especially where conceptual models differ. I did some work processing Notes RTF via the C++ API years ago.

However I don’t have much patience for those who blame it on unstated deficiencies of the format, and then fail to accomplish the basic level of fidelity already achieved by open source software.

Lost in Translation

Comments

Trackbacks

Leave a Reply Cancel reply

Reader Interactions

Comments

Trackbacks

Leave a Reply Cancel reply