ODF

ODF Plugfest — Brussels

2010/10/28 By Rob 4 Comments

A couple of weeks ago I was in Brussels to participate in the 4th ODF Plugfest. I planned on writing up a nice long post about it. But right when I started to draft this blog post, I came across an excellent article in LWN.net by Koen Vervloesem (Twitter @koenvervloesem): ODF Plugfest: Making office tools interoperable. Since his article is far better than what I would have written, I recommend that you go and read that article first, and then come back here for what meager additional scraps of insight I can add.

Go ahead. I can wait. I’ll be here when you get back.

The ODF Plugfest format is a two-day event. On one day engineers from the vendors work together, peer-to-peer, on interoperability testing, debugging, resolving issues, etc. This is done in a closed session, with no press present, and with a gentlemen’s agreement not to use information from this session to attack other vendors. We want the Plugfest to be a “safe zone” where vendors can do interoperability work where it is most needed, using unreleased software, alpha or beta code in some cases. For this to work we need an environment where engineers can do this work, without fear that each bug in their beta product will be instantly maligned on the web. This would be anti-productive, since it would repel the very products we need most to attend Plugfests.

I think customers would be proud to see their vendors putting their differences aside for the day to work on interoperability. Although among the vendors and organizations present there were several fierce competitors, two parties to a prominent patent infringement lawsuit and both sides of a prominent fork of a popular open source project, you would not have guessed this if you watched the engineers collaborating at the Plugfest. A key part to this neutrality is that the Plugfests are sponsored and hosted by public sector parties and universities and non-profits. In this case we were hosted by the Flemish government.

So that was the first day of the Plugfest, and for the details I can say no more, for the reasons I’ve stated.

On the 2nd day we have a public session, with vendors, but also the press, local public sector IT people, local IT companies, etc. The program and presentations are posted. My presentation on ODF 1.2 is also up on my publications page.

A few leftover notes that I have not seen mentioned elsewhere:

We had great participation from AbiWord, where developers apparently have funding to work on their ODF 1.2 support.
DIaLOGIKa announced that after their next release they will no longer have funding from Microsoft to continue work on their ODF Add-in for Office. The code, however, will remain as open source. Since Oracle has commercialized the previously free Sun ODF Plugin, this means that there is no longer any free, actively developed means of getting ODF support on Office 2003. If you want ODF support on Office, you must upgrade to Office 2007 or Office 2010.
Some good demos of new ODF-supporting software, including LetterGen, OFS Collaboration Suite, ODT2EPub and odt2braille.
Itaapy announced that they were close to releasing a C++ version of their popular lpOD library (already available in Python)

In standards work, on committees with endless conference calls and endless draft specifications and the minutia of clause and phrase, it is too easy to mistakenly view that narrow world as your customer. So when I attend events like this and see the rapid growth of ODF-supporting software and the innovative work that is happening among implementors, I return reinvigorated. These are the real customers. This is what it is all about. I’m already looking forward to the next ODF Plugfest.

ODF Ingredients

2010/10/05 By Rob 2 Comments

I think you will enjoy this graphic. Click for a larger view. This is a chart of all of the standards that ODF 1.2 refers to, what we standards geeks call “normative references”. A normative reference takes definitions and requirements from one standard and uses it, by reference, in another. It is a form of reuse, reusing the domain analysis, specification and review work that went into creating the other standard. Each reference is color coded and grouped by the organization that owns the referenced standard, W3C, IETF, ISO, etc., and placed on a time line according to when that standard was published

I’m sure each reader will note interesting patterns on their own, but a few things stood out in my mind when looking at this chart:

ODF is very much built on top of web and internet standards from the W3C and IETF. That is where the bulk of our references are from. This is true not only of the older stuff from the web’s initial standardization effort in 1998-2000, but also for more recent work like GRDDL, RDFa and XForms 1.1. As documents start living more of a dual-life, on the desktop and on the web (and even mobile), this web standards heritage of ODF will continue to open new doors for ODF implementors and users.
Except for a few bedrock standards like Unicode, ISO just doesn’t register. They simply are not doing a lot of relevant work in this area.
A good response when you are faced with critics who claim that ODF is just based on what OpenOffice.org does. You can point out that OpenOffice was first released as open source in 2000 and via StarOffice had a proprietary history going back to 1984. So if ODF is merely a dump of what OpenOffice does, then why is ODF built on so many standards that did not exist in 2000? Does time travel explain it? Or maybe clairvoyance? Or maybe, just maybe it is just good engineering to reference relevant standards in your domain rather than reinvent a proprietary version of everything?

Is ODF Green?

2010/10/03 By Rob 11 Comments

Green IT is concerned with approaches to information technology that reduce the environmental impact from the manufacture, use and disposal of computers and peripherals. Occasionally I am asked whether Open Document Format (ODF) has any relationship to “Green IT”. This is an interesting question, and the fact that the question is asked at all suggests that Green IT goals are increasing playing a central role in decision making.

When an organization migrates from Microsoft Office and their binary file formats (DOC/XSL/PPT) and moves to ODF, they will immediately notice that ODF documents are much smaller than the corresponding Microsoft format documents. This is a benefit of the ZIP compression applied to the contents of ODF documents. It also reflects that fact that Microsoft-format documents, especially ones that have been edited and saved many times, tend to accumulate unused blocks in the file, blocks which are not used, but still bloat the file’s storage.

As an experiment I went to a prominent government web site (the US President’s www.whitehouse.gov) and downloaded all DOC files that were at the site, 293 documents total. Then I converted each document into ODF format. The percent reduction from moving to ODF was 66% on average. Smaller documents means less disk storage required, less bandwidth required to transfer documents, less bloating of mail files with document attachments, etc.

Looking at the results in more detail, however, shows a more complex picture. The following chart shows that although the average size reduction from moving to ODF was 66%, some documents were compressed 80% or more, while others were hardly compressed at all:

What is going on here? A look at a scatter plot of original DOC size versus ODF size more clearly shows the pattern:

You can see here two trend lines, one of documents that are barely compressed at all, and another one where the compression rate is high. Manual inspection of the poorly compressed documents indicates what is going on. Some of the documents are dominated by the size of embedded image files with high color depth and resolution. These images were already compressed, and so could not be compressed further, at least not by ODF’s ZIP compression. However, in some cases the image files were of a resolution unnecessary for screen or casual print output. Screen resolution is typically only 75 dpi. Attaching images at 300 dpi or more wastes space, unless you know you are targeting high-resolution photo-quality output. I think we’ve all been on the receiving end of an improbably large document, that when loaded contains relatively little content. Often the culprit is a multi-megabyte image, with only a small cropped portion showing, but the entire image is stored. There is nothing a document format can do to prevent user actions like this, but an intelligent editor (or plugin) could detect this and prompt the user to convert the image to a more appropriate resolution when saving.

So in summary, yes, a move to ODF will cause your documents to be far smaller than they were before, and that has advantages in terms of storage and bandwidth consumption. But let’s be honest, when it comes to disk storage and bandwidth documents are not your biggest problem. Graphics and video are far larger.

But if we look broader we see that the bigger Green advantage of ODF comes not only from the document size reduction, but from the alternatives ODF enables:

Replace a paper-based workflow with an all-electronic workflow
Replace a car or plane trip with electronic document-based collaboration
Use a word processor that can run on your existing hardware rather than upgrading everyone to new hardware so they can run the latest MS Windows/MS Office.
Use a less expensive word processor and by doing so free up resources to fund other Green initiatives in your workplace.

Postscript

So what about OOXML? Honestly, no one asked me that question before. I think is a testament to the intelligence of my associates. “Is it Green to throw out your 2005 laptop, buy a new, likely high-energy consumption one, pay for Windows 7 and Office 2010, just so you can do the same work you did before?” I think the answer is obvious. Of course not. For 99% of us the limitation on our productivity is not whether we have the latest software and hardware . The limitation is our own skills and our working habits. A word processor with a flashier interface doesn’t make you write better or write faster. To think otherwise is to be like the amateur golf player who thinks that their game will improve, if only they have the latest (and most expensive) gear.

But to satisfy the curiosity of those who care about OOXML, let me give you the results of the same documents, as converted to the DOCX format. ODF still wins in this case. The ODF files are 18% smaller on average than the equivalent OOXML ones.

LibreOffice: The newest member of the ODF family

2010/09/28 By Rob 3 Comments

By now I’m sure you have all heard the news of the Document Foundation and LibreOffice. Personally, I’m still sorting this out. I have good friends, as well as good professional relations, on both sides of this split. They’re all “good guys” in my book and I’m proud to have worked with all of them over the years. I hope we can figure out some way for this collaboration to continue well into the future. But if forced to take sides, then my loyalties are clearly going to fall to to ODF rather than to any one implementation. The ODF open standard transcends implementations and code bases. It is bigger than any one product. ODF is what enables the user to have choice.

So I am very pleased to read in their press release that the Document Foundation is firmly committed to the ODF standard. I encourage them to turn those words into actions and to join the OASIS ODF TC and to participate in the ODF Plugfests. As OASIS ODF TC Chair, I extend to them a warm welcome.

Both OpenOffice.org and LibreOffice are open source products under LGPL and like any fork there will initially be little difference between the products. But the open source communities behind them are very different. The Document Foundation has announced a more open community. This increased openness could enable great things, for example a better product, but this is not guaranteed. The challenge for the Document Foundation will be to take their greater openness and to rapidly grow a diverse membership of talented contributors and to evolve their open source product in a way that distinguishes itself from alternatives — open source and proprietary — on the market today. The key milestone I think will be if someday the Document Foundation can claim a headcount of developers that equals or exceeds that which Oracle has working on OpenOffice.org. In the end code talks, and developers write code.

This will be an interesting test of openness in action. This is as close as we have seen to “twins separated at birth”, a rare but key subject for studying the relative contribution of hereditary and environmental factors on the development of personal traits. With LibreOffice and OpenOffice.org we have a similar “experiment”, a separation of identical code bases, with the same license, only varying the openness of the community. However this may turn out we will learn much from it.

On the other hand, I am also mindful that behind every set of twins separated at birth there is a sad story, and science’s gain comes sometime from misfortune. This is true as well for LibreOffice and OpenOffice.org. Although we will learn much from the parallel evolution of these two projects, I think it would have been far better if this split had not been necessary, if circumstances had allowed us to all work together on the goals that we, for the most part, all share.

ODF 1.2 Word Clouds

2010/07/29 By Rob 3 Comments

I’ve been playing around today with a preview build of the ODF Java API ODFDOM 0.9. One of the capabilities we’re adding is a simple text extraction API.

The idea is to have a very simple API, a single function call in fact, that will allow you to extract the plain text from an ODF document. So strip all formatting, all layout and just return the text. At first you might think this is rather useless, but further reflection shows that it has myriad uses, including accessibility, search indexing, collaborative filtering, and text analytics in general.

Extracting text from ODF is pretty simple. There are a handful of special cases to watch out for. One example is a single word that has mixed styles, e.g.: ODFDOM. In ODF this looks like:
<text:span text:style-name="style1">ODF</text:span> <text:span text:style-name="style2">DOM</text:span>

We want text extraction to come out as “ODFDOM” not “ODF DOM” with a space.

On the other hand, there are other examples of adjacent elements, like with footnote citations, where we need to insert a space to prevent two adjacent strings from being conflated.

Overall, the build I used looks pretty good, and works the same across text, spreadsheets and presentations.

So I was looking this afternoon for something I could use to demo this new capability. I thought of using Jonathan Feinberg’s excellent Wordle applet (which I wrote about a while back). This applet creates a word cloud, based on word frequency of text you feed it. As a torture test I decided to feed it the text of ODF 1.2 Committee Draft 05, the version that is currently out for public review.

This is what I got for results.

Part 1 is the annotations the schema for ODF. As expected, the key words are those referring to XML markup concepts like “attribute” and “element”:

Part 2: is OpenFormula, the spreadsheet formula express language. No XML in this part. In fact, this looks more like what I’d expect from an excerpt from a programming language specification, which pretty much what OpenFormula is.

And Part 3 is the packaging specification.

In the end text extraction is just the data preparation step. The real fun happens after, with the analysis and visualization techniques that can be applied to the text once extracted.

If anyone is interested in trying out the text extraction module, please let me know. We’re aiming for a release of ODF 0.9 toward the end of August, but I can probably get you a preview, if you are interested in testing. And let me know if you have any brilliant ideas of what to do with the extracted text. I’m always looking for good demo material.