ODF

The OOXML Compatibility Pack

2006/09/06 By Rob

Just saw something worth noting. I was on a machine running Office XP and tried to open an Office Open XML (OOXML) formatted document. I don’t know why I tried that, but I did.

Word was smart enough to put up the following dialog:

Now, that is something I hadn’t seen before. I think we all knew that Microsoft was planning a compatibility pack for enabling OOXML on Office 2003 and Office XP. But my 2002 version of Windows XP knows about OOXML? I guess this wisdom must have come down in a previously downloaded Office patch.

In any case, if you click yes, you are directed to this page where you are offered a download of “Microsoft Office Compatibility Pack for Word, Excel, and PowerPoint 2007 File Formats (Beta 2)”. I had the pre-req’s, which included Windows XP SP2 and Office XP SP 3. So downloading a few file conversion filters should be simple and small, right?

Well, simple, but not so small. I was suprised to see that the convertors download was 43MB. That seems a bit large. In comparison, you can download a complete copy of OpenOffice.org, with included support for ODF documents and the Office binary formats, and the entire product is only a 93MB download. The 0.2 ODF Add-in for Word is only 1MB in size. So why does adding OOXML support to Office XP require a 43MB download?

In any case, once it is downloaded and installed, the integration with Office appears seemless. You can open OOXML files from the Windows Explorer by double-clicking on them, you can browse and load them as expected from the File Open dialog in Office, you can re-save files in OOXML format via the File Save, you can create a new document and save it as OOXML, you can even configure Word XP so the OOXML formats are the default format for all saved documents in Word. In fact, you can do all of those things that the Microsoft-supported ODF Add-in is not doing.

As reported earlier, the Microsoft support for ODF puts this ISO standard at a distinct disadvantage, providing no shell integration, removing it from its expected place in the File/Open and File/Save menus, and preventing users from making it the default format in Office.

So, let’s update the file format support matrix:

Criterion	DOC Format in OpenOffice	ODF Format in Word 2007	OOXML Format in Word XP
1. Format supported in default install	Yes.	No. Requires a download and install of separate, unsupported Add-in.	No, but you are prompted to download a free converter pack the first time you attempt to open an OOXML file
2. File Open integration	Yes.	No. ODF is not listed in the default File Open dialog and doing a Control-O will not show ODF documents. However, ODF import is available in a separate menu item elsewhere in the menu system.	Yes.
3. Save new document integration	Yes.	No. In fact no ODF save ability exists in the current version of the Add-in. There is a place holder for the ODF save operation, though it is on its own menu, and would not be shown when doing a simple Control-S to save a new document.	Yes.
4. Can be made the default format	Yes.	No. Although other non-Microsoft formats, such as “Plain Text” can be made the default format, ODF cannot.	Yes.
5. Simple round-tripping	Yes.	No. When an ODF document is loaded, its name is automatically changed and it is made read-only. So loading sampler.odt results in Word having a read-only version of sampler_tmp.docx. Attempting a simple Control-S to save will give an error.	Yes.
6. Shell integration	Yes.	No.	Yes.

I tip my hat to Microsoft for the way they have provided OOXML support in earlier versions of Office. Aside from the size of the download, the process was simple and the integration was seamless. That’s the way it should be. But what makes them think that customers using ODF format would want anything less than this? That fact that they’ve been able to integrate OOXML so well only increases the shame in having integrated ODF so poorly.

A Tale of Two Formats

2006/08/22 By Rob 16 Comments

As he stood staring at them, they asked him no questions, for his face told them everything.

‘I cannot find it,’ said he, ‘and I must have it. Where is it?’

His head and throat were bare, and, as he spoke with a helpless look straying all around, he took his coat off, and let it drop on the floor.

‘Where is my bench? I’ve been looking everywhere for my bench, and I can’t find it. What have they done with my work? Time presses: I must finish those shoes.’

They looked at one another, and their hearts died within them.

Charles Dickens, a careful student of human nature, provides us here a vivid portrait of Dr. Alexandre Manette, who, after being held 18 years in the Bastille, is released, but is unable to adjust to his new freedom, and in times of stress lapses back to the familiarity of his prison labors, making shoes.

We all have been prisoners of Microsoft Office and their proprietary file formats. You may no longer recognize it as a prison, because this cell has been your home for the past 15 years, but here is what it looks like:

Editing a document requires Microsoft Office.
Since Office runs only on Windows, you also require Windows
These restrictions lead to a purely heavy-client view of document processing.
This also leads to a model of programmability that emphasizes storing executable code (macros/script) inside of the document, resulting in years of security nightmares. Here is a typical recital of the known dangers.
If you don’t want to put script inside your document, you could access the data via Office automation API’s, but this again required a machine running Windows and Office.
It also emphasizes a view of WYSIWYG which emphasizes early formatting and layout decisions and de-emphasized semantic richness in documents. For example, see “What has WYSIWYG Done to Us?”.
The tools that were created for us to record our thoughts instead now constrain or even substitute for our thoughts. For example, “PowerPoint Panders to our Weaker Points” in the Guardian, and Tufte’s “PowerPoint is Evil”.
The above also lead to a stifled the market for 3rd party document processing tools. We will never see the value of what was never allowed to occur, but the opportunity cost of the innovation that did not happen in this single-vendor world is enormous.
This also lead to general lack of competition in the productivity editor market, leading to a decade of buggy products with little innovation. Is the “Ribbon” the most we can look forward to?
We’ve been locked into a one-size-fits-all offerings of bloated applications. Many people are over-served by Office and therefor are over-paying for functionality they do not need, while others are under-served by the resulting products they cannot afford.
Functionality has been arbitrarily segregated into three and only three application classes, “Spreadsheet”, “Word Processor” and “Presentation Graphics”.

The move from proprietary binary formats to new standard formats, like OpenDocument Format (ODF), is a movement from imprisonment to freedom. The technical constraints have been lifted, but have we really made the mental adjustments necessary to engage our new freedom? Or are we still silently pacing a 10-foot cell in our minds? If we merely recreate our cell walls in XML, then we are still prisoners.

I am a creature of habit and have been as much a prisoner as you have, so don’t look to me for all the answers. But I do have a few thoughts on what this new freedom might look like.

Instead of being opaque black boxes that can only be used on one vendor’s system, documents will be transparent. Anyone can access them using whatever operating system and whatever tools they want, and for any purpose they want. Python on Linux, REXX on AS/400, and C# on Windows will all have equal opportunity.

This also implies that document processing will no longer be restricted, technically or by license, to the desktop. Innovative things will occur on servers. We’re starting to see some of that with Google Docs and wikiCalc. But that is only the beginning. We will see search engines that can intelligently search content for specific MathML expressions, spiders that will collect and aggregate slides from presentations and allow you to share them, document repositories that will automatically check citations in papers and calculate the intellectual social networks these imply, stock brokers that will allow you to download your statements formatted in a spreadsheet, with additional analytics calculated via spreadsheet formulas. Creating, editing, reading, viewing, storing, collaborating will be able to be done anywhere, from your cellphone to the largest servers.

Since the server typically has access not only to your own documents, but your organization’s as well, as well as easy access to other information about the users, such as your role and group via LDAP, an application can drive workflows that relate the contents of the document to similar content, as well as to you organizational role, and to your business. The companies that unlock the knowledge stored by your knowledge workers in your organization’s documents will be the companies leading us into the next decade.

The old walls will fall that once segregated functionality into the arbitrarily defined boundaries of “Spreadsheet”, “Word processor”, and “Presentation graphics”. Dan Bricklin is leading the way with his wikiCalc. Is it a Spreadsheet or is it a Wiki? If you have to ask the question then you are still a prisoner. The point is wikiCalc is whatever Dan Bricklin wants it to be. That is freedom to innovate. We will see the arbitrary divisions between application genres become fuzzy and fall away as we all recognize our new freedom.

Document programmability will be turned inside-out. Instead of putting code inside of the document, turning documents into virus vectors, the code will be carefully segregated. Once the code and the data are distinct, we can put the code on the server, where it can be more easily managed, maintained, and secured. This clean separation of code and data will be as important to system stability and security as was protected-mode in the 80286 processor when it first enforced this data/code separation at operating system level. I see macro viruses becoming a thing of the past, like smallpox, because the importance of data/code separation will finally be enforced, and users will not be emailing around code disguised in documents.

We will start thinking of documents as data, and as inputs to modules that process data. I see visual design tools that will allow you to drag and drop a document template onto a design surface and expose various fields in the document which can be wired up to databases, web services or other data sources.

I see financial analysts creating financial models in spreadsheets, then converting the spreadsheet into a web application that can then be deployed anywhere to provide browser-based access and execution of the model via any browser.

I see a variety of productivity editors available at a variety of price points, from free, open source ones, to commercial offerings for desktop and other devices, to specialized offerings with extra features for vertical markets, like legal, medical, academic, or scientific uses.

I see an escape from documents-as-pictures, where users sweat over pixel-perfection and pray that the applications don’t screw them up. Today the end user doesn’t worry about font kerning. We rely on the font managers to get this right, and we accept the results, and concentrate on what we, the authors, add to the document. We are freed from that mental burden of kerning. But why stop there? With smarter applications, we will be freed of most or all formatting burdens. We will concentrate on writing, not on styling, and rely on the applications to get the appearance right. This will free our time to give an increased emphasis on semantic richness, putting our knowledge and experience and outlooks and opinions into the document, and encoding it in an way that allows new modes of collaboration and redefines what a document is.

That is a gimpse at what freedom looks like to me. But let’s not forget that being freed is not the same as being free. There are those out there who are attempting to merely recreate the same single-vendor closed system we’ve had for the past 10 years, and recoding it in XML. This may be a comfortable choice to those who have known no other way. But is it really freedom? I look out and see the jailer offering to sell 10-foot apartments to those just released from their 10-foot prison cells. Will you follow?

Change Log

1/30/2007 — updated wikiCalc link, made other assorted wording changes at my whim, corrected a spelling error, changed to curly quotes.

Four Shorts

2006/08/21 By Rob 2 Comments

I. OpenOffice.org Conference (OOOoCon 2006) in comming up, September 11-13th in Lyon, France. The last day starts with a panel discussion of ODF topics, and follows with a track dedicated to ODF. I’m on at 14:00 with a presentation with the exciting title, “A Technical Comparison: ISO/IEC 26300 vs Microsoft Office Open XML (Ecma International TC45 OOXML WD 1.3)”.

The abstract is:

Two XML office file formats have been pressing upon our attention, the OASIS OpenDocument Format, recently standardized by ISO, and the Draft Ecma Office Open XML. This presentation will review history of each, the process that created them, and examine each format to compare and contrast how they deal with issues such extensibility, modularization, expressivity, performance, reuse of standards, programability, ease of use, and application/OS neutrality.

II. KDE enthusiasts get together two weeks later, in Dublin, for their aKademy 2006. Tuesday the 26th will be OpenDocument Day. I’ll be there, and will give a lighting talk on something, probably related to some ODF programmability API ideas I’ve been having.

III. If you didn’t see it yet, Rick “Schemetron” Jelliffe has an interesting post over at O’Reilly looking at ODF and OOXML documents from the perspective of XML complexity metrics. This is a topic which Rick has done a good deal of work with in the past, so it is interesting to hear what he has to say. Did I see something there about OpenOffice loading documents faster than Office?

IV. The ODF Formula Subcommittee has set up a wiki page on our work defining OpenFormula. A lot of good information is there. This page will be updated with the latest status, so you’ll want to make it the first place to go for the latest info on our progress.

A Demo: Mathematica, MathML and ODF

2006/08/20 By Rob 6 Comments

Here’s a short tutorial on exchanging MathML between Mathematica and OpenOffice, showing what is possible today, and offering some suggestions for closer integration.

First, start with a new ODF document in OpenOffice. It is often easier to modify an existing document, inheriting its structure and default styles, than to create a new document from scratch. So I believe that a lot of interesting projects with ODF will start with an existing document as a template, and then add or replace content in it.

So, here’s what I made, a simple file with a formula describing the Euclidean metric, our old friend the Pythagorean Theorom. Click the image to load the ODF file.

If you rename the ODF file to a .zip extension, and unzip it, you can see the XML files it contains. Always start with the manifest.xml , for your convenience here, to which I draw your attention to the entry with the type “application/vnd.oasis.opendocument.formula”. This, according to Appendix C of the ODF 1.0 specification, is the registered MIME type of an ODF formula document. So that sounds like what we want. Let’s replace that equation with something else.

So into Mathematica we go. Suppose I want to calculate the indefinite double integral of the Euclidean metric. Why not? This is something I’d rather not do by hand, but I know Mathematica can quickly give me the answer:

Now I really don’t want to retype that result into OpenOffice. So, what can I do? I can use Mathematica’s ExpressionToMathML function to turn the above into MathML. When I do that I get MathML like this.

Let’s see now what happens if I simply drop that content in as a replacement for the original content.xml in the ODF file. Here’s what I get (click the image to open the ODF file):

So we got something, but it is not quite right. I’m seeing some little hollow boxes, usually an indication of an unprintable character. What’s up with this?

A closer look at the XML generated from Mathematica shows that these boxes are being displayed whenever the MathML uses the XML character entities corresponding to section 6.2.4 “Non-Marking Characters” of the MathML specification. This includes things like “InvisibleTimes” which handles cases where adjacency represents multiplication (xy == x*y). Using these characters provides hints to the application that can help it optimize its rendering and editing, but they should not be displayed.

In any case there appears to be a bug in OpenOffice 2.0.3 where it tries to display these characters and finds they don’t map to any printable Unicode character. No big deal, I will enter a bug report on that later. But for now I can easily clean this up by defining a new function in Mathematica, ExpressionToOO, defined as follows:

(Note I didn’t name this “ExpressionToODF”, since strictly speaking the ODF specification allows MathML 2.0, including the non-marking characters. This function is specifically to work around an OpenOffice bug. It outputs valid MathML, simply removing the non-marking characters which OO doesn’t understand.)

So, back to Mathematica, I run ExpressionToOO, grab that XML and inject that XML into the ODF document, and we get the following (click to open the ODF file):

That’s what we want! For those who are interested, the complete Mathematica notebook is here: Session.nb.

As you can see, this isn’t rocket science, though no doubt it may be useful to rocket scientists. Consider this a little “proof of concept”. Real end users will not be going around unzipping ODF documents and copying XML around. There needs to be some additional integration work to make this process simple and joyful. For example:

A Mathematica function that automatically inserts a formula into an ODF document
A OpenOffice add-in that lets the user automatically browser formulas from Mathematica and insert them into the current working document.
Clipboard level exchange of MathML between OpenOffice and Mathematica
An export filter from OpenOffice to export to the XHTML+MathML+SVG profile defined to the W3C. This, combined with Firefox, would provide kickass scientific publishing using open standards and tools.

Note that I’m using here Mathematica just as an example. There are over 100 MathML supporting applications out there, both commercial and open source. I’d be interested in hearing what other ideas people have for workflows involving ODF editors and other tools that work with the standards ODF includes, not just MathML, but SVG, XForms, etc. Let’s demonstrate the value of open standards working together.

Follow the Leader

2006/08/03 By Rob 13 Comments

David Wheeler, the chair of the OASIS ODF Formula Subcommittee has a good status update on our work defining the details of the expression language and supporting functions used in spreadsheet formulas. I’d also like to point out some cool work by Daniel Carrera, who put together some code that post-processes the OpenFormula specification (in ODF format, of course), extracts the details of the embedded test cases, then automatically generates an ODF spreadsheet file which executes the spreadsheet functions and verifies correct results. This resulting spreadsheet allows an implementation to automatically test their compliance to the spec. This gives us a self-testing specification, a great labor savings, as well as a demonstration of the innovative things you can do with ODF. Details are here.

(I note in passing that although the OASIS ODF TC does all of its working documents in ODF format, the Ecma TC45 does none of its working documents in OOXML. They continue to use the old proprietary Microsoft binary formats as their working format on the TC. A suggestion — If they are unable for some reason to use OOXML, then I encourage TC45 to use ISO ODF. They can then download Daniel’s code to help generate test cases from their spreadsheet formula documentation and this, I promise you, will save implementors a lot of time.)

Of course, malcontents will never be pleased by our progress, and will portray this progress as proof that we are yet imperfect, and therefore not useful. The first point is obvious, but the second is dubious.

Stephen McGibbon’s blog entry of a couple weeks ago seems to be the Urquelle of this particular line of reasoning. Here’s one small quote:

I mentioned that in my opinion, Sun were completely aware that ODF wasn’t sufficiently defined to support spreadsheet interoperability as long ago as February 2005 and that the realpolitik inside OASIS was to take advantage of the EU IDA’s request to standardise by rushing to be first despite knowing the ODF specification was deficient in at least this area.

Read the rest of his article and you’ll walk away with two misconceptions:

The lack of a spreadsheet formula definition in a file format documentation is unusual, defective and prevents interoperability
Spreadsheet formulas were left out because ODF standardization was rushed, for political reasons

Let’s take a look at each of these in turn.

First, let’s look at the state of the art in spreadsheet file format documentation over the years, with particular attention to how spreadsheet formulas have been documented. As the following table shows, Excel formulas have never been publicly specified, even though Microsoft has been producing file format documentation for various binary, HTML, XHTML and XML Excel formats for over 9 years. It was only after the ODF TC decided to document our spreadsheet formulas and formed a Subcommittee to do so that Ecma TC45 decided to follow. The FUD followed soon after.

Date	Format version	Formula status
1997	Excel 97 Developers Kit (Microsoft Press, 1997)	not defined
ca 1998	MSDN CD’s in this era had Office file format documentation	not defined
Jan 1999	Office 2000’s XHTML formats for Excel	not defined
May 2001	Office XP’s XMLSS format for spreadsheets	not defined
Nov 2003	Office 2003’s XML Schemas	not defined
Dec 2005	Microsoft submits initial “base document” to Ecma	not defined
January 2006	Ecma TC45’s Working Draft 1.1	not defined
February 2006	The OASIS ODF Formula Subcommittee is formed to add formula definition to the ODF specification
April 2006	Ecma TC45’s Working Draft 1.2	not defined
May 2006	Ecma TC45’s Working Draft 1.3	Mirabile dictu! After 9 years of ignoring it, Microsoft finally decides to start defining their spreadsheet formula language.

So the statement that the lack of a formula language specification is unusual or makes interoperability impossible falls down in the face of 9 years of contrary evidence. Over the years, the industry has managed to have interoperable spreadsheet formulas between different versions of Office as well as between Excel and competing spreadsheets, including 1-2-3, Quattro Pro, OpenOffice, StarOffice, etc., all without ever having a formula specification.

Even though every other spreadsheet file format specification in the past decade failed to document a spreadsheet formula language, the ODF TC knew that we could and should do better. That is why we took the lead and formed a Subcommittee to define, in great detail, with test cases, how spreadsheet formulas, expressions and functions should be interpreted. This is not fixing a problem. This is advancing the state of the art in file format specifications.

They say that imitation is the sincerest form of flattery. If so the ODF community should be blushing with all of this flattery heaped on it. If it wasn’t for the continual market pressure that our innovations bring, Microsoft would never have 1) issued a patent covenant for OOXML, 2)brought OOXML before a standards body, 3) started to document their spreadsheet formula language or 4) started to create an ODF Add-in for Office.

So, then what about the statement that ODF was rushed through the standardization process?

Let’s look at the numbers. Both ODF and OOXML are derived from pre-existing formats . This is not necessarily a bad thing. This is one source of “implementation experience” and this is beneficial to any standard to have this. But only once the “base document” is submitted to a multi-vendor open standards development organization (SDO) does the true work of standardization begin, including deep technical review of the specification to confirm completeness, conciseness, lack of ambiguity, correct use of formal specification language, ensuring platform independence, encourage flexibility and extensibility, etc. So, I’ll start the clock when the base specification is submitted to the SDO, and stop the clock when the SDO approves the standard.

The ODF numbers are clear enough since the 1.0 version is complete. The OOXML numbers require some estimation, since they are not complete, but I’ll justify my estimates this way:

The OOXML Working Draft 1.3 is currently 4,081 pages long. At the SC34 meeting in June we were told by the Ecma Secretary General that more material was coming and that this draft was only 2/3 complete. By my calculations, this gives a final size estimate of around 6,000 pages.
Predicting the completion date is harder. But we do know that Ecma specifications can only be approved twice a year at Ecma General Assembly which are in June and December. If I were Microsoft I’d really really really want OOXML approved in time for the Office 2007 launch, so I’m predicting Ecma approval will be sought at the December Ecma General Assembly.

Of course I could be wrong on either or both of those estimates, but let’s see where the logic takes us. The following table summarizes the time under standardization as well as the rate of standardization (pages/day) for each specification.

Standard	Submitted to SDO	Standard issued	Days elapsed	Standard length	Rate of work
ODF	12 Dec 2002	1 May 2005	867	706 pages	0.8 pages/day
OOXML	15 Dec 2005	31 Dec 2006 (est)	381 (est)	6000 pages (est)	15.6 pages/day (est)

Now I ask you, who is rushing? ODF took 2 ½ years to standardize 700 pages. Microsoft is trying to standardize a 6,000 page behemoth in just 1 year. I think the argument that ODF was rushed through under political pressure just doesn’t stand up to even cursory examination. Honestly, I think this FUD is being spread around as a smoke screen to hide the fact that OOXML is the one that is really being rushed.