Tuesday, May 13, 2008
Spreadsheet file format performance
I've been doing some performance timings of file format support, comparing MS Office and OpenOffice. Most of the results are as expected, but some are surprising, and one in particular is quite disappointing.
But first, a little details of my setup. All timings, done by stopwatch, were from Office 2003 and OpenOffice 2.4.0 running on Windows XP, with all current service packs and patches. The machine is a Lenova T60p, dual-core Intel 2.16 Ghz and 2 GB of RAM. I took all the standard precautions -- disk was defragmented, and test files were confirmed as defragmented using contig. No other applications were running and background tasks were all shut down.
For test files, I went back to an old favorite, George Ou's (at the time with ZDNet) monster 50MB XLS file from his series of tests back in 2005. This file, although very large, is very simple. There are no formulas, indeed no formatting or styles. It is just text and numbers, treating a spreadsheet like a giant data table. So tests of this file will emphasize the raw throughput of the applications. Real world spreadsheets will typically be worse than this due to additional overhead from process styles, formulas, etc.
A test of a single file is not really that interesting. We want to see trends, see patterns. So I made a set of variations on George's original file, converting it into ODF, XLS and OOXML formats, as well as making scaled down versions of it. In total I made 12 different sized subsets of the original file, ranging down to a 437KB version, and created each file in all three formats. I then tested how long it took to load each file in each of the applications. In the case of MS Office, I installed the current versions of the translators for those formats, the Compatibility Pack for OOXML, and the ODF Add-in for the ODF support.
I find it convenient to report numbers per 100,000 spreadsheet cells. You could equally well use the original XLS spreadsheet size, or the number of rows of data, or any other correlated variable as the ordinate, but values per 100K cells is simple for anyone to understand.
I'll spare you all the pretty picture. If you want to make some, here is the raw data (CSV format). But I will give some summary observations.
For document sizes, the results are as follows:
Any ideas?
For load time, the times for processing the binary XLS files were:
So what about the new XML formats? There has been recent talk about the "Angle Bracket Tax" for XML formats. How bad is it?
OK. So what are we missing. Ah, yes, ODF format in MS Office, using their ODF Add-in.
(I was not able to test files larger than this using the ODF Add-in since they all crashed .)
(Update: Since it is the question everyone wants to know, the beta OpenOffice 3.0 opens the OOXML version of that file in 49.4 seconds, over 10x faster than MS Office loads the ODF document.)
This is one reason why I think file format translation is a poor engineering approach to interoperability. When OpenOffice wants to read an legacy XLS file, it does not approach the problem by translating the XLS into an ODF document and then loading the ODF file. Instead they simply load the XLS file, via a file filter, into the internal memory model of OpenOffice.
What is a file filter? It is like 1/2 of a translator. Instead of translating from one disk format to another disk format, it simply loads the disk format and maps it into an application-specific memory model that the application logic can operate directly on. This is far more efficient than translation. This is the untold truth that the layperson does not know. But this is how everyone does it. That is how we support formats in SmartSuite. That is how OpenOffice does it. And that is how MS Office does it for the file formats they care about. In fact, that is the way that Novell is now doing it now, since they discovered that the Microsoft approach is doomed to performance hell.
So it is with some amusement that I watch Microsoft and others propose translation as a solution to interoperability, creating reports about translation, even a proposal for a new work item in JTC1/SC34 concerning file format translation, when the single concrete attempt at translation is such an abysmal failure. It may look great on paper, but it is an engineering disaster. What customers need is direct, internal support for ODF in MS Office, via native code, in a file filter, not a translator that takes 10 minutes to load a file.
The astute engineer will agree with the above, but will also feel some discomfort at the numbers. There is more here than can be explained simply by the use of translators versus import filters. That choice might explain a 2x difference in performance. A particularly poor implementation might explain a 5x difference. But none of this explains why MS Office is almost 40x slower in processing ODF files. Being that much slower is hard to do accidentally. Other forces must be at play.
Any ideas?
But first, a little details of my setup. All timings, done by stopwatch, were from Office 2003 and OpenOffice 2.4.0 running on Windows XP, with all current service packs and patches. The machine is a Lenova T60p, dual-core Intel 2.16 Ghz and 2 GB of RAM. I took all the standard precautions -- disk was defragmented, and test files were confirmed as defragmented using contig. No other applications were running and background tasks were all shut down.
For test files, I went back to an old favorite, George Ou's (at the time with ZDNet) monster 50MB XLS file from his series of tests back in 2005. This file, although very large, is very simple. There are no formulas, indeed no formatting or styles. It is just text and numbers, treating a spreadsheet like a giant data table. So tests of this file will emphasize the raw throughput of the applications. Real world spreadsheets will typically be worse than this due to additional overhead from process styles, formulas, etc.
A test of a single file is not really that interesting. We want to see trends, see patterns. So I made a set of variations on George's original file, converting it into ODF, XLS and OOXML formats, as well as making scaled down versions of it. In total I made 12 different sized subsets of the original file, ranging down to a 437KB version, and created each file in all three formats. I then tested how long it took to load each file in each of the applications. In the case of MS Office, I installed the current versions of the translators for those formats, the Compatibility Pack for OOXML, and the ODF Add-in for the ODF support.
I find it convenient to report numbers per 100,000 spreadsheet cells. You could equally well use the original XLS spreadsheet size, or the number of rows of data, or any other correlated variable as the ordinate, but values per 100K cells is simple for anyone to understand.
I'll spare you all the pretty picture. If you want to make some, here is the raw data (CSV format). But I will give some summary observations.
For document sizes, the results are as follows:
- Binary XLS format = 1,503 KB per 100K cells
- OOXML format = 491 KB per 100K cells
- ODF format = 117 KB per 100K cells
Any ideas?
For load time, the times for processing the binary XLS files were:
- Microsoft Office 2003 = 0.03 seconds per 100K cells
- OpenOffice 2.4.0 = 0.4 seconds per 100K cells
So what about the new XML formats? There has been recent talk about the "Angle Bracket Tax" for XML formats. How bad is it?
- Microsoft Office 2003 with OOXML = 1.5 seconds per 100K cells
- OpenOffice 2.4.0 with ODF = 2.7 seconds per 100K cells
OK. So what are we missing. Ah, yes, ODF format in MS Office, using their ODF Add-in.
- Microsoft Office 2003 with ODF, using the ODF Add-in = 74.6 seconds per 100K cells
- Microsoft Office 2003 in XLS format = 0.75 seconds
- OpenOffice 2.4.0 in XLS format = 3.03 seconds
- Microsoft Office 2003 in OOXML format = 8.28 seconds
- OpenOffice 2.4.0 in ODF format = 14.09 seconds
- Microsoft Office 2003 in ODF format = 515.60 seconds
(I was not able to test files larger than this using the ODF Add-in since they all crashed .)
(Update: Since it is the question everyone wants to know, the beta OpenOffice 3.0 opens the OOXML version of that file in 49.4 seconds, over 10x faster than MS Office loads the ODF document.)
This is one reason why I think file format translation is a poor engineering approach to interoperability. When OpenOffice wants to read an legacy XLS file, it does not approach the problem by translating the XLS into an ODF document and then loading the ODF file. Instead they simply load the XLS file, via a file filter, into the internal memory model of OpenOffice.
What is a file filter? It is like 1/2 of a translator. Instead of translating from one disk format to another disk format, it simply loads the disk format and maps it into an application-specific memory model that the application logic can operate directly on. This is far more efficient than translation. This is the untold truth that the layperson does not know. But this is how everyone does it. That is how we support formats in SmartSuite. That is how OpenOffice does it. And that is how MS Office does it for the file formats they care about. In fact, that is the way that Novell is now doing it now, since they discovered that the Microsoft approach is doomed to performance hell.
So it is with some amusement that I watch Microsoft and others propose translation as a solution to interoperability, creating reports about translation, even a proposal for a new work item in JTC1/SC34 concerning file format translation, when the single concrete attempt at translation is such an abysmal failure. It may look great on paper, but it is an engineering disaster. What customers need is direct, internal support for ODF in MS Office, via native code, in a file filter, not a translator that takes 10 minutes to load a file.
The astute engineer will agree with the above, but will also feel some discomfort at the numbers. There is more here than can be explained simply by the use of translators versus import filters. That choice might explain a 2x difference in performance. A particularly poor implementation might explain a 5x difference. But none of this explains why MS Office is almost 40x slower in processing ODF files. Being that much slower is hard to do accidentally. Other forces must be at play.
Any ideas?
Labels: ODF, OOXML, Performance
Wednesday, May 07, 2008
Achieving the impossible

Unadulterated copy of James Clark's Relax NG validator jing. Unadulterated copy of Kohsuke Kawaguchi's Sun Multi-Schema Validator msv. Unadulterated copy of the ODF 1.0 Relax NG schema. Unadulterated copy of the ODF 1.0 Standard, in ODF format.
No errors from either validator.
msv is so good as to tell us "the document is valid". But jing indicates success with only silence. So will I.
Labels: ODF
Monday, May 05, 2008
The Challenge
<?xml version="1.0" encoding="UTF-8"?>
<office:document-content
xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
office:version="1.0">
<office:body>
<office:text>
<text:p>Dear Alex Brown. Please prove that I am invalid ODF 1.0 (ISO 26300:2006). I do not think that I am. In fact I think that your statement that there are no valid ISO ODF documents in the world, and that there cannot be, is a brash, irresponsible and indefensible piece of bombast that you should retract.</text:p>
<text:p>(Please note that this document contains no ID, IDREF or IDREFS attributes. Nor does it contain custom content.)</text:p>
</office:text>
</office:body>
</office:document-content>
<office:document-content
xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0"
xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0"
office:version="1.0">
<office:body>
<office:text>
<text:p>Dear Alex Brown. Please prove that I am invalid ODF 1.0 (ISO 26300:2006). I do not think that I am. In fact I think that your statement that there are no valid ISO ODF documents in the world, and that there cannot be, is a brash, irresponsible and indefensible piece of bombast that you should retract.</text:p>
<text:p>(Please note that this document contains no ID, IDREF or IDREFS attributes. Nor does it contain custom content.)</text:p>
</office:text>
</office:body>
</office:document-content>
Friday, May 02, 2008
ODF Validation for Dummies
[Updated 4 May 2008, with additional rebuttal at the end]
Alex Brown has a problem. He can't figure out how to validate ODF documents. Unfortunately, when he couldn't figure it out, he didn't ask the OASIS ODF TC for help, which would have been the normal thing to do. Indeed, the ODF TC passed a resolution back in February 2007 that said, in part:
So it is rather uncollegial of Alex to refuse such an open, transparent way of getting his questions answered. But Alex didn't avail himself of that avenue. He just assumed if he couldn't figure out how to validate ODF then it simply couldn't be done, and that ODF was to blame. This is presumptuous. Does he think that in the three years since ODF 1.0 became a standard, that no one has tried to validate a document?
Alex is so sure of himself that he publicly exults on the claimed significance of his findings:
Of course, I've been known to pontificate as well. There is nothing necessarily wrong with that. The difference here is that Alex Brown is totally wrong.
But let's see if we can help show Alex, or anyone else similarly confused, the correct way to validate an ODF document.
First start with an ODF document. When Alex tested OOXML, he used the Ecma-376 OOXML specification. Let's do the analogous test and validate the ODF 1.0 text. You can download it from the OASIS ODF web site. You'll want this version of the text, ODF 1.0 (second edition), which is the source document for the ISO version of ODF.
You'll also want to download the Relax NG schema files for OASIS ODF 1.0, which you can download in two pieces: the main schema, and the manifest schema.
Next you'll need to get a Relax NG validator. Alex recommends James Clark's jing, so we'll use that. I downloaded jing-20030619.zip the main distribution for use with the Java Runtime Environment. Unzip that to a directory and we're almost there.
Since jing operates on XML files and knows nothing about the Zip package structure of an ODF file, you'll need to extract the XML contents of the ODF file. There are many ways to do this. My preference, on Windows, is to associate WinZip with the ODF file extensions (ODT, ODS and ODP) so I can right-click on these files unzip them. When you unzip you will have the following XML files, along with directories for images files and other non-XML resources you can ignore:
java -jar c:/jing/bin/jing.jar OpenDocument-schema-v1.0-os.rng content.xml
(Your command may vary, depending on where you put jing, the ODF schema files and the unzipped ODF files)
The result is a whole slew of error messages:
C:\temp\odf\OpenDocument-schema-v1.0-os.rng:17658:18: error: conflicting ID-types for attribute "targetElement" from namespace "urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0" of element "command" from namespace "urn:oasis:names:tc:opendocument:xmlns:animation:1.0"
C:\temp\odf\OpenDocument-schema-v1.0-os.rng:10294:22: error: conflicting ID-types for attribute "targetElement" from namespace "urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0" of element "command" from namespace "urn:oasis:names:tc:opendocument:xmlns:animation:1.0"
Oh no! Emergency, emergency, everyone to get from street!
I wonder if this is one of the things that tripped Alex up? Take a deep breath. These in fact are not Relax NG (ISO/IEC 19757-2) errors at all, but errors generated by jing's default validation of a different set of constraints, defined in the Relax NG DTD Compatibility specification which has the status of a Committee Specification in OASIS. It is not part of ISO/IEC 19757-2.
Relax NG DTD Compatibility provides three extensions to Relax NG: default attribute values, ID/IDREF constaints and a documentation element. The Relax NG DTD Compatibility specification is quite clear in section 2 that "Conformance is defined separately for each feature. A conformant implementation can support any combination of features." And in fact, ODF 1.0, in section 1.2 does just that: "The schema language used within this specification is Relax-NG (see [RNG]). The attribute default value feature specified in [RNG-Compat] is used to provide attribute default values".
It is best to simple disable the checking of Relax NG DTD Compatibility constraints by using the documented "-i" flag in jing. If you want to validate ID/IDREF cross-references, then you'll need to do that in application code, and not using jing in Relax NG DTD Compatibility mode. Note that jing was not complaining about any actual ID/IDREF problem in the ODF document.
So, false alarm. You can walk safely on the streets now.
(That said, if we can make some simple changes to the ODF schemas that will allow it to work better with the default settings of jing, or other popular tools, then I'm certainly in favor of that. Alex's proposed changes to the schema are reasonable and should be considered.)
So, let's repeat the validation with the -i flag:
java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng content.xml
Zero errors, zero warnings.
java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng styles.xml
Zero errors, zero warnings.
java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng meta.xml
Zero errors, zero warnings.
java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng settings.xml
Zero errors, zero warnings.
java -jar c:/jing/bin/jing.jar -i OpenDocument-manifest-schema-v1.0-os.rng META-INF/manifest.xml
Zero errors, zero warnings.
So, there you have it, an example that shows that there is at least one document in the universe that is valid to the ODF 1.0 schema, disproving Alex's statement that "there are no XML documents in existence which are valid to ISO ODF."
The directions are complete and should allow anyone to validate the ODF 1.0 specification, or any other ODF 1.0 document. Now that we have the basics down, let's work on some more advanced topics.
First, the reader should note that there are two versions of the ODF schema, the original 1.0 from 2005, and the updated 1.1 from 2007. (This is also a third version underway, ODF 1.2, but that needn't concern us here.)
An application, when it creates an ODF document, indicates which version of the ODF standard it is targeting. You can find this indication if you look at the office:version attribute on the root element of any ODF XML file. The only values I would expect to see in use today would be "1.0" and "1.1". Eventually we'll also see "1.2".
It is important to use the appropriate version of the ODF schema to validate a particular document. Our goal, as we evolve ODF, is that an application that knows only about ODF 1.0 should be able to adapt and "degrade gracefully" when given an ODF 1.1 document, by ignoring the features it does not understand. But an application written to understand ODF 1.1 should be able to fully understand ODF 1.0 documents without any additional accommodation.
Put differently, from the document perspective, a document that conforms to ODF 1.0 should also conform to ODF 1.1. But the reverse direction is not true.
To accomplish this, as we evolve ODF, within the 1.x family of revisions, we try to limit ourselves to changes that widen the schema constraints, by adding new optional elements, or new attribute values, or expanding the range of values permitted. Constraint changes that are logically narrowing, like removing elements, making optional elements mandatory, or reducing the range of allowed values, would break this kind of document compatibility.
Now of course, at some point we may want to make bolder changes to the schema, but this would be in a major release, like a 2.0 version. But within the ODF 1.x family we want this kind of compatibility.
The net of this is, an ODF 1.1 document should only be expected to be valid to the ODF 1.1 schema, but an ODF 1.0 document should be valid to the ODF 1.0 and the ODF 1.1 schemas.
That's enough theory! Let's take a look now at the test that Alex actually ran. It is a rather curious, strangely biased kind of test, but the bad thinking is interesting enough to devote some time to examine in some detail.
When he earlier tested OOXML, Alex used the OOXML standard itself, a text on which Microsoft engineers had lavished many person-years of attention for the past 18 months, and he validated it with the current version of the OOXML schema. That is pretty much the best case, testing a document that has never been out of Microsoft's sight for 18 months and testing it with the current version of the schema. I would expect that this document would have been a regular test case for Microsoft internally, and that its validity has been repeatedly and exhaustively tested over the past 18 months. I know that I personally tested it when Ecma-376 was first released, since it was the only significant OOXML document around. So, essentially Alex gave OOXML the softest of all soft pitches.
I think Microsoft's response, that the validity errors detected by Alex are due to changes made to the schema at the BRM, is a reasonable and accurate explanation. The real story on OOXML standardization is not how many changes were made that were incompatible with Office 2007, but how few. It appears that very few changes, perhaps only one, will be required to make Office 2007's output be valid OOXML.
So when testing ODF, what did Alex do? Did he use the ODF 1.0 specification as a test case, a document that the OASIS TC might have had the opportunity to give a similar level of attention to? No, he did not, although that would have validated perfectly, as I've demonstrated above. Instead, Alex uses the OOXML specification, a document which by his own testing is not valid OOXML, then converts it into the proprietary .DOC binary format, then translates that binary format into ODF and then tries to validate the results with the ODF 1.0 schema (i.e., the wrong version of the ODF schema since OpenOffice 2.4.0's output is clearly declared as ODF 1.1), and then applies a non-applicable, non-standard DTD Compatibility constraint test during the Relax NG validation.
Does anyone see something else wrong with this testing methodology?
Aside from the obvious bias of using an input document that Microsoft has spent 18 months perfecting, and using the wrong schemas and validator settings, there is another, more subtle problem.
Alex's test of OOXML and ODF are testing entirely different things. With OOXML, he took a version N (Ecma-376) OOXML document and tried to validate it with a version N+1 (ISO/IEC 29500) version of the OOXML schema.
But what he did with ODF was take a version N+1 (ODF 1.1) document and tried to validate it with an version N (ODF 1.0) of the ODF schema.
These are entirely different operations. One test is testing the backwards compatibility of the schema, the other is testing the backwards compatibility of document instances. It takes no genius to figure out that if ODF 1.1 adds new elements, then an ODF 1.1 document instance will not validate with the ODF 1.0 schema. We don't ordinarily expect backwardly compatible validity of document instances. Again, Alex's tests are biased in OOXML's favor, giving ODF a much more difficult, even impossible task, compared the the versions ran for OOXML.
If we want to compare apples to apples, it is quite easy to perform the equivalent test with ODF. I gave it a try, taking a version N document (the ODF 1.0 standard itself, per above) and validated it with the version N+1 schema (ODF 1.1 in this case). It worked perfectly. No warnings, no errors.
In any case, in his backwards test Alex reports 7,525 errors, "mostly of the same type (use of an undeclared soft-page-break element)" when validating the OOXML text with ODF 1.0 schema. Indeed, all but 39 of these errors are reports of soft-page-break.
Soft page breaks are a new feature introduced in ODF 1.1. It has two primary advantages for accessibility. First it allows easier collaboration between people using different technologies to read a document. Not all documents are deeply structured, with formal divisions like section 3.2.1, etc. Most business documents are loosely structured, and collaboration occurs by referring to "2nd paragraph on page 23" or "the bottom of page 18". But when using different assistive technologies, from larger fonts, to braille, to audio renderings, the page breaks (if the assistive technology even has the concept of a page break) are usually located differently from the page breaks in the original authoring tool. This makes collaboration difficult. So, ODF 1.1 added the ability for applications to write out "soft" page breaks, indicating where the page breaks occurred when the original source document was saved.
Although this feature was added for accessibility reasons, like curb cuts, its likely future applications are more general. We will all benefit. For example, a convertor for translating from ODF to HTML would ordinarily only be able to calculate the original page breaks by undertaking complex layout calculations. But with soft page breaks recorded, even a simple XSLT script can use this information to insert indications of page breaks, or to generate accurate page numbering, etc. Although the addition of this feature hinders Alex's idiosyncratic attempt to validate ODF 1.1 documents with the ODF 1.0 schema, I think the fact that this feature helps blind and visually impaired users, and generally improves collaboration makes it a fair trade-off.
Wouldn't you agree?
That leaves 39 validation errors in Alex's test. 12 of them are reports of invalid values in an xlink:href attribute value. This appears to be an error in the original DOCX file. Garbage In, Garbage Out. For example, in one case the original document has HYPERLINK field that contains a link to content in Microsoft's proprietary CHM format (Compiled HTML). The link provided in the original document does not match the syntax rules required for an XML Schema anyURI (the URL ends with "##" rather than "#") Maybe it is correct for markup like this, with non-standard, non-interoperable URI's, to give validation errors. This is not the first time that OOXML has been found polluting XML with proprietary extensions. But realize that OpenOffice 2.4.0 did not create this error. OpenOffice is just passing the error along, as Office 2007 saved it. It is interesting to note that this error was not caught in MS Office, and indeed is undetectable with OOXML's lax schema. But the error was caught with the ODF schema. This is a good thing, yes? It might be a good idea for OpenOffice to add an optional validation step after importing Microsoft Office documents, to filter out such data pollution.
For the remaining validation errors, they are 27 instances of style:with-tab. Honestly, I have no explanation for this. This attribute does not exist in ODF 1.0 or ODF 1.1. That it is written out appears to be a bug in OpenOffice. Maybe someone there can tell us why the story is on this? But I don't see this problem in all documents, or even most documents.
For fun I tried processing this OOXML document another way. Instead of the multi-hop OOXML-to-DOC-to-ODF conversion Alex did, why not go directly from OOXML to ODF in one step, using the convertor that Microsoft/CleverAge created? This should be much cleaner, since it doesn't have all the legacy code or messiness of the binary formats or legacy application code. It is just a mapping from one markup to another markup, written from scratch. Getting the output to be valid should be trivial.
So I download the "OpenXML/ODF Translator Command Line Tools" from SourceForge. According to their web page, this tool targets ODF 1.0, so we'll be validating against the ODF 1.0 schemas.
This tool is very easy to use once you have the .NET prerequisites installed. The command line was:
odfconvertor /I "Office Open XML Part 4 - Markup Language Reference.docx"
The convertor then chugs along for a long, long, long time. I mean a long time. The conversion from OOXML to ODF eventually finished, after 11 hours, 10 minutes and 41 seconds! And this was on a Thinkpad T60p with dual-core Intel 2.16Ghz processor and 2.0 GB of RAM.
I then rang jing, using the validation command lines from above. It reported 376 validation errors, which fell into several categories:
In the end we should put this in perspective. Can OpenOffice produce valid ODF documents? Yes, it can, and I have given an example. Can OpenOffice produce invalid documents? Yes, of course. For example when it writes out a .DOC binary file, it is not even well-formed XML. And we've seen one example, where via a conversion from OOXML, it wrote out an ODF 1.1 document that failed validation. But conformance for an application does not require that it is incapable of writing out an invalid document. Conformance requires that it is capable of writing out a valid document. And of course, success for an ODF implementation requires that its conformance to the standard is sufficient to deliver on the promises of the standard, for interoperability.
It is interesting to recall the study that Dagfinn Parnas did a few years ago. He analyzed 2.5 million web pages. He found that only 0.7% of them were valid markup. Depending on how you write the headlines, this is either an alarming statement on the low formal quality of web content, or a reassuring thought on the robustness of well-designed applications and systems. Certainly the web seems to have thrived in spite of the fact that almost every web page is in error according to the appropriate web standards. In fact I promise you that the page you are reading now is not valid, and neither is Alex Brown's, nor SC34's, nor JTC1's, nor Ecma's, nor ISO's, nor the IEC's.
So I suggest that ODF has a far better validation record than HTML and the web have, and that is an encouraging statement. In any case, Alex Brown's dire pronouncements on ODF validity have been weighed in the balance and found wanting.
4 May 2008
Alex has responded on his blog with "ODF validation for cognoscneti". He deals purely with the ID/IDREF/IDREFS questions in XML. He does not justify his biased and faulty testing methodology, not does he reiterate his bold claims that there are no valid ODF 1.0 documents in existence.
Since Alex's blog does not seem to be allowing me to comment, I'll put here what I would have put there. I'll be brief because I have other fish to fry today.
Alex, no one doubts that ID/IDREF/IDREFS constraints must be respected by valid ODF document instances. I never suggested otherwise. But what I do state is that this is not a concern of a Relax NG validator. You can read James Clark saying the same thing in his 2001 "Guidelines for using W3C XML Schema Datatypes with RELAX NG", which says in part:
Validation of ID/IDREF/IDREFS cross-reference semantics is not the job of Relax NG, and you are incorrect to suggest otherwise. Your logic is also deficient when you take my statement of that fact and derive the false statement that I believe that ID/IDREF semantics do not apply to ODF. One does not follow from the other.
You know, as much as anyone, that conformance is a complex topic. One does not ordinarily expect, except in trivial XML formats, that the complete set of conformance constraints will be expressed in the schema. Typically a multi-layered approach is used, with some syntax and structural constraints expressed in XML Schema or Relax NG, some business constraints in Schematron, and maybe even some deeper semantic constraints that are expressed only in the text of the standard and can only be tested by application logic.
For example, a document that defines a cryptographic algorithm might need to store a prime number. The schema might define this as an integer. The fact that the schema does not state or guarantee that it is a prime number is not the fault of the schema. And the inability of a Relax NG validator to test primality is not a defect in Relax NG. The primality test would simply need to be carried out at another level, with application logic. But the requirement for primality in document instances can still be a conformance requirement and it is still testable, albeit with some computational effort, in application logic.
I believe that is the source of your confusion. The initial errors you saw when running jing with the Relax NG DTD Compatibility flag enabled were not errors in the ODF document instances. What you saw was jing reporting that it could not apply the Relax NG DTD Compatibility ID/IDREF/IDREFS constraint checks using the ODF 1.0 schema. That in no way means that the constraints defined in XML 1.0 are not required on ODF document instances. It simply indicates that you would need to verify these constraints using means other than Relax NG DTD Compatibility.
So I wonder, have you actually found ODF document instances, say written from OpenOffice 2.4.0, which have ID/IDREF/IDREFS usage which violates the constraints expressed in ODF 1.0?
Finally, in your professional judgment, do you maintain that this is a accurate statement: "For ISO/IEC 26300:2006 (ODF) in general, we can say that the standard itself has a defect which prevents any document claiming validity from being actually valid. Consequently, there are no XML documents in existence which are valid to ISO ODF."
Alex Brown has a problem. He can't figure out how to validate ODF documents. Unfortunately, when he couldn't figure it out, he didn't ask the OASIS ODF TC for help, which would have been the normal thing to do. Indeed, the ODF TC passed a resolution back in February 2007 that said, in part:
That the ODF TC welcomes any questions from ISO/IEC JTC1/SC34 and
member NB's regarding OpenDocument Format, the functionality it
describes, the planned evolution of this standard, and its relationship
to other work on the technical agenda of JTC1/SC34. Questions and
comments can be directed to the TC chair and secretary whose email
addresses are given at
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
or through the comments facility at
http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office
So it is rather uncollegial of Alex to refuse such an open, transparent way of getting his questions answered. But Alex didn't avail himself of that avenue. He just assumed if he couldn't figure out how to validate ODF then it simply couldn't be done, and that ODF was to blame. This is presumptuous. Does he think that in the three years since ODF 1.0 became a standard, that no one has tried to validate a document?
Alex is so sure of himself that he publicly exults on the claimed significance of his findings:
I think you agree that these are bold pronouncements, especially coming from someone so prominent in SC34, the Convenor of the ill-fated OOXML BRM, someone who is currently arguing that SC34 should own the maintenance of OOXML and ODF, indeed someone who would be well served if he could show that all consortia standards are junk, and that only SC34 (and he himself) could make them good.
- For ISO/IEC 26300:2006 (ODF) in general, we can say that the standard itself has a defect which prevents any document claiming validity from being actually valid. Consequently, there are no XML documents in existence which are valid to ISO ODF.
- Even if the schema is fixed, we can see that OpenOffice.org 2.4.0 does not produce valid XML documents. This is to be expected and is a mirror-case of what was found for MS Office 2007: while MS Office has not caught up with the ISO standard, OpenOffice has rather bypassed it (it aims at its consortium standard, just as MS Office does).
Of course, I've been known to pontificate as well. There is nothing necessarily wrong with that. The difference here is that Alex Brown is totally wrong.
But let's see if we can help show Alex, or anyone else similarly confused, the correct way to validate an ODF document.
First start with an ODF document. When Alex tested OOXML, he used the Ecma-376 OOXML specification. Let's do the analogous test and validate the ODF 1.0 text. You can download it from the OASIS ODF web site. You'll want this version of the text, ODF 1.0 (second edition), which is the source document for the ISO version of ODF.
You'll also want to download the Relax NG schema files for OASIS ODF 1.0, which you can download in two pieces: the main schema, and the manifest schema.
Next you'll need to get a Relax NG validator. Alex recommends James Clark's jing, so we'll use that. I downloaded jing-20030619.zip the main distribution for use with the Java Runtime Environment. Unzip that to a directory and we're almost there.
Since jing operates on XML files and knows nothing about the Zip package structure of an ODF file, you'll need to extract the XML contents of the ODF file. There are many ways to do this. My preference, on Windows, is to associate WinZip with the ODF file extensions (ODT, ODS and ODP) so I can right-click on these files unzip them. When you unzip you will have the following XML files, along with directories for images files and other non-XML resources you can ignore:
- content.xml
- styles.xml
- meta.xml
- settings.xml
- META-INF/manifest.xml
java -jar c:/jing/bin/jing.jar OpenDocument-schema-v1.0-os.rng content.xml
(Your command may vary, depending on where you put jing, the ODF schema files and the unzipped ODF files)
The result is a whole slew of error messages:
C:\temp\odf\OpenDocument-schema-v1.0-os.rng:17658:18: error: conflicting ID-types for attribute "targetElement" from namespace "urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0" of element "command" from namespace "urn:oasis:names:tc:opendocument:xmlns:animation:1.0"
C:\temp\odf\OpenDocument-schema-v1.0-os.rng:10294:22: error: conflicting ID-types for attribute "targetElement" from namespace "urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0" of element "command" from namespace "urn:oasis:names:tc:opendocument:xmlns:animation:1.0"
Oh no! Emergency, emergency, everyone to get from street!
I wonder if this is one of the things that tripped Alex up? Take a deep breath. These in fact are not Relax NG (ISO/IEC 19757-2) errors at all, but errors generated by jing's default validation of a different set of constraints, defined in the Relax NG DTD Compatibility specification which has the status of a Committee Specification in OASIS. It is not part of ISO/IEC 19757-2.
Relax NG DTD Compatibility provides three extensions to Relax NG: default attribute values, ID/IDREF constaints and a documentation element. The Relax NG DTD Compatibility specification is quite clear in section 2 that "Conformance is defined separately for each feature. A conformant implementation can support any combination of features." And in fact, ODF 1.0, in section 1.2 does just that: "The schema language used within this specification is Relax-NG (see [RNG]). The attribute default value feature specified in [RNG-Compat] is used to provide attribute default values".
It is best to simple disable the checking of Relax NG DTD Compatibility constraints by using the documented "-i" flag in jing. If you want to validate ID/IDREF cross-references, then you'll need to do that in application code, and not using jing in Relax NG DTD Compatibility mode. Note that jing was not complaining about any actual ID/IDREF problem in the ODF document.
So, false alarm. You can walk safely on the streets now.
(That said, if we can make some simple changes to the ODF schemas that will allow it to work better with the default settings of jing, or other popular tools, then I'm certainly in favor of that. Alex's proposed changes to the schema are reasonable and should be considered.)
So, let's repeat the validation with the -i flag:
java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng content.xml
Zero errors, zero warnings.
java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng styles.xml
Zero errors, zero warnings.
java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng meta.xml
Zero errors, zero warnings.
java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng settings.xml
Zero errors, zero warnings.
java -jar c:/jing/bin/jing.jar -i OpenDocument-manifest-schema-v1.0-os.rng META-INF/manifest.xml
Zero errors, zero warnings.
So, there you have it, an example that shows that there is at least one document in the universe that is valid to the ODF 1.0 schema, disproving Alex's statement that "there are no XML documents in existence which are valid to ISO ODF."
The directions are complete and should allow anyone to validate the ODF 1.0 specification, or any other ODF 1.0 document. Now that we have the basics down, let's work on some more advanced topics.
First, the reader should note that there are two versions of the ODF schema, the original 1.0 from 2005, and the updated 1.1 from 2007. (This is also a third version underway, ODF 1.2, but that needn't concern us here.)
An application, when it creates an ODF document, indicates which version of the ODF standard it is targeting. You can find this indication if you look at the office:version attribute on the root element of any ODF XML file. The only values I would expect to see in use today would be "1.0" and "1.1". Eventually we'll also see "1.2".
It is important to use the appropriate version of the ODF schema to validate a particular document. Our goal, as we evolve ODF, is that an application that knows only about ODF 1.0 should be able to adapt and "degrade gracefully" when given an ODF 1.1 document, by ignoring the features it does not understand. But an application written to understand ODF 1.1 should be able to fully understand ODF 1.0 documents without any additional accommodation.
Put differently, from the document perspective, a document that conforms to ODF 1.0 should also conform to ODF 1.1. But the reverse direction is not true.
To accomplish this, as we evolve ODF, within the 1.x family of revisions, we try to limit ourselves to changes that widen the schema constraints, by adding new optional elements, or new attribute values, or expanding the range of values permitted. Constraint changes that are logically narrowing, like removing elements, making optional elements mandatory, or reducing the range of allowed values, would break this kind of document compatibility.
Now of course, at some point we may want to make bolder changes to the schema, but this would be in a major release, like a 2.0 version. But within the ODF 1.x family we want this kind of compatibility.
The net of this is, an ODF 1.1 document should only be expected to be valid to the ODF 1.1 schema, but an ODF 1.0 document should be valid to the ODF 1.0 and the ODF 1.1 schemas.
That's enough theory! Let's take a look now at the test that Alex actually ran. It is a rather curious, strangely biased kind of test, but the bad thinking is interesting enough to devote some time to examine in some detail.
When he earlier tested OOXML, Alex used the OOXML standard itself, a text on which Microsoft engineers had lavished many person-years of attention for the past 18 months, and he validated it with the current version of the OOXML schema. That is pretty much the best case, testing a document that has never been out of Microsoft's sight for 18 months and testing it with the current version of the schema. I would expect that this document would have been a regular test case for Microsoft internally, and that its validity has been repeatedly and exhaustively tested over the past 18 months. I know that I personally tested it when Ecma-376 was first released, since it was the only significant OOXML document around. So, essentially Alex gave OOXML the softest of all soft pitches.
I think Microsoft's response, that the validity errors detected by Alex are due to changes made to the schema at the BRM, is a reasonable and accurate explanation. The real story on OOXML standardization is not how many changes were made that were incompatible with Office 2007, but how few. It appears that very few changes, perhaps only one, will be required to make Office 2007's output be valid OOXML.
So when testing ODF, what did Alex do? Did he use the ODF 1.0 specification as a test case, a document that the OASIS TC might have had the opportunity to give a similar level of attention to? No, he did not, although that would have validated perfectly, as I've demonstrated above. Instead, Alex uses the OOXML specification, a document which by his own testing is not valid OOXML, then converts it into the proprietary .DOC binary format, then translates that binary format into ODF and then tries to validate the results with the ODF 1.0 schema (i.e., the wrong version of the ODF schema since OpenOffice 2.4.0's output is clearly declared as ODF 1.1), and then applies a non-applicable, non-standard DTD Compatibility constraint test during the Relax NG validation.
Does anyone see something else wrong with this testing methodology?
Aside from the obvious bias of using an input document that Microsoft has spent 18 months perfecting, and using the wrong schemas and validator settings, there is another, more subtle problem.
Alex's test of OOXML and ODF are testing entirely different things. With OOXML, he took a version N (Ecma-376) OOXML document and tried to validate it with a version N+1 (ISO/IEC 29500) version of the OOXML schema.
But what he did with ODF was take a version N+1 (ODF 1.1) document and tried to validate it with an version N (ODF 1.0) of the ODF schema.
These are entirely different operations. One test is testing the backwards compatibility of the schema, the other is testing the backwards compatibility of document instances. It takes no genius to figure out that if ODF 1.1 adds new elements, then an ODF 1.1 document instance will not validate with the ODF 1.0 schema. We don't ordinarily expect backwardly compatible validity of document instances. Again, Alex's tests are biased in OOXML's favor, giving ODF a much more difficult, even impossible task, compared the the versions ran for OOXML.
If we want to compare apples to apples, it is quite easy to perform the equivalent test with ODF. I gave it a try, taking a version N document (the ODF 1.0 standard itself, per above) and validated it with the version N+1 schema (ODF 1.1 in this case). It worked perfectly. No warnings, no errors.
In any case, in his backwards test Alex reports 7,525 errors, "mostly of the same type (use of an undeclared soft-page-break element)" when validating the OOXML text with ODF 1.0 schema. Indeed, all but 39 of these errors are reports of soft-page-break.
Soft page breaks are a new feature introduced in ODF 1.1. It has two primary advantages for accessibility. First it allows easier collaboration between people using different technologies to read a document. Not all documents are deeply structured, with formal divisions like section 3.2.1, etc. Most business documents are loosely structured, and collaboration occurs by referring to "2nd paragraph on page 23" or "the bottom of page 18". But when using different assistive technologies, from larger fonts, to braille, to audio renderings, the page breaks (if the assistive technology even has the concept of a page break) are usually located differently from the page breaks in the original authoring tool. This makes collaboration difficult. So, ODF 1.1 added the ability for applications to write out "soft" page breaks, indicating where the page breaks occurred when the original source document was saved.
Although this feature was added for accessibility reasons, like curb cuts, its likely future applications are more general. We will all benefit. For example, a convertor for translating from ODF to HTML would ordinarily only be able to calculate the original page breaks by undertaking complex layout calculations. But with soft page breaks recorded, even a simple XSLT script can use this information to insert indications of page breaks, or to generate accurate page numbering, etc. Although the addition of this feature hinders Alex's idiosyncratic attempt to validate ODF 1.1 documents with the ODF 1.0 schema, I think the fact that this feature helps blind and visually impaired users, and generally improves collaboration makes it a fair trade-off.
Wouldn't you agree?
That leaves 39 validation errors in Alex's test. 12 of them are reports of invalid values in an xlink:href attribute value. This appears to be an error in the original DOCX file. Garbage In, Garbage Out. For example, in one case the original document has HYPERLINK field that contains a link to content in Microsoft's proprietary CHM format (Compiled HTML). The link provided in the original document does not match the syntax rules required for an XML Schema anyURI (the URL ends with "##" rather than "#") Maybe it is correct for markup like this, with non-standard, non-interoperable URI's, to give validation errors. This is not the first time that OOXML has been found polluting XML with proprietary extensions. But realize that OpenOffice 2.4.0 did not create this error. OpenOffice is just passing the error along, as Office 2007 saved it. It is interesting to note that this error was not caught in MS Office, and indeed is undetectable with OOXML's lax schema. But the error was caught with the ODF schema. This is a good thing, yes? It might be a good idea for OpenOffice to add an optional validation step after importing Microsoft Office documents, to filter out such data pollution.
For the remaining validation errors, they are 27 instances of style:with-tab. Honestly, I have no explanation for this. This attribute does not exist in ODF 1.0 or ODF 1.1. That it is written out appears to be a bug in OpenOffice. Maybe someone there can tell us why the story is on this? But I don't see this problem in all documents, or even most documents.
For fun I tried processing this OOXML document another way. Instead of the multi-hop OOXML-to-DOC-to-ODF conversion Alex did, why not go directly from OOXML to ODF in one step, using the convertor that Microsoft/CleverAge created? This should be much cleaner, since it doesn't have all the legacy code or messiness of the binary formats or legacy application code. It is just a mapping from one markup to another markup, written from scratch. Getting the output to be valid should be trivial.
So I download the "OpenXML/ODF Translator Command Line Tools" from SourceForge. According to their web page, this tool targets ODF 1.0, so we'll be validating against the ODF 1.0 schemas.
This tool is very easy to use once you have the .NET prerequisites installed. The command line was:
odfconvertor /I "Office Open XML Part 4 - Markup Language Reference.docx"
The convertor then chugs along for a long, long, long time. I mean a long time. The conversion from OOXML to ODF eventually finished, after 11 hours, 10 minutes and 41 seconds! And this was on a Thinkpad T60p with dual-core Intel 2.16Ghz processor and 2.0 GB of RAM.
I then rang jing, using the validation command lines from above. It reported 376 validation errors, which fell into several categories:
- text:s element not allowed in this context
- bad value for text:style:name
- bad value for text:outline-level
- bad value for svg:x
- bad value for svg:y
- element tetx:tracked-changes not allowed in this context
- "text not allowed here"
In the end we should put this in perspective. Can OpenOffice produce valid ODF documents? Yes, it can, and I have given an example. Can OpenOffice produce invalid documents? Yes, of course. For example when it writes out a .DOC binary file, it is not even well-formed XML. And we've seen one example, where via a conversion from OOXML, it wrote out an ODF 1.1 document that failed validation. But conformance for an application does not require that it is incapable of writing out an invalid document. Conformance requires that it is capable of writing out a valid document. And of course, success for an ODF implementation requires that its conformance to the standard is sufficient to deliver on the promises of the standard, for interoperability.
It is interesting to recall the study that Dagfinn Parnas did a few years ago. He analyzed 2.5 million web pages. He found that only 0.7% of them were valid markup. Depending on how you write the headlines, this is either an alarming statement on the low formal quality of web content, or a reassuring thought on the robustness of well-designed applications and systems. Certainly the web seems to have thrived in spite of the fact that almost every web page is in error according to the appropriate web standards. In fact I promise you that the page you are reading now is not valid, and neither is Alex Brown's, nor SC34's, nor JTC1's, nor Ecma's, nor ISO's, nor the IEC's.
So I suggest that ODF has a far better validation record than HTML and the web have, and that is an encouraging statement. In any case, Alex Brown's dire pronouncements on ODF validity have been weighed in the balance and found wanting.
4 May 2008
Alex has responded on his blog with "ODF validation for cognoscneti". He deals purely with the ID/IDREF/IDREFS questions in XML. He does not justify his biased and faulty testing methodology, not does he reiterate his bold claims that there are no valid ODF 1.0 documents in existence.
Since Alex's blog does not seem to be allowing me to comment, I'll put here what I would have put there. I'll be brief because I have other fish to fry today.
Alex, no one doubts that ID/IDREF/IDREFS constraints must be respected by valid ODF document instances. I never suggested otherwise. But what I do state is that this is not a concern of a Relax NG validator. You can read James Clark saying the same thing in his 2001 "Guidelines for using W3C XML Schema Datatypes with RELAX NG", which says in part:
The semantics defined by [W3C XML Schema Datatypes] for the ID, IDREF and IDREFS datatypes are purely lexical and do not include the cross-reference semantics of the corresponding [XML 1.0] datatypes. The cross-reference semantics of these datatypes in XML Schema comes from XML Schema Part 1. Furthermore, the [XML 1.0] cross-reference semantics of these datatypes do not fit into the RELAX NG model of what a datatype is. Therefore, RELAX NG validation will only validate the lexical aspects of these datatypes as defined in [W3C XML Schema Datatypes].
Validation of ID/IDREF/IDREFS cross-reference semantics is not the job of Relax NG, and you are incorrect to suggest otherwise. Your logic is also deficient when you take my statement of that fact and derive the false statement that I believe that ID/IDREF semantics do not apply to ODF. One does not follow from the other.
You know, as much as anyone, that conformance is a complex topic. One does not ordinarily expect, except in trivial XML formats, that the complete set of conformance constraints will be expressed in the schema. Typically a multi-layered approach is used, with some syntax and structural constraints expressed in XML Schema or Relax NG, some business constraints in Schematron, and maybe even some deeper semantic constraints that are expressed only in the text of the standard and can only be tested by application logic.
For example, a document that defines a cryptographic algorithm might need to store a prime number. The schema might define this as an integer. The fact that the schema does not state or guarantee that it is a prime number is not the fault of the schema. And the inability of a Relax NG validator to test primality is not a defect in Relax NG. The primality test would simply need to be carried out at another level, with application logic. But the requirement for primality in document instances can still be a conformance requirement and it is still testable, albeit with some computational effort, in application logic.
I believe that is the source of your confusion. The initial errors you saw when running jing with the Relax NG DTD Compatibility flag enabled were not errors in the ODF document instances. What you saw was jing reporting that it could not apply the Relax NG DTD Compatibility ID/IDREF/IDREFS constraint checks using the ODF 1.0 schema. That in no way means that the constraints defined in XML 1.0 are not required on ODF document instances. It simply indicates that you would need to verify these constraints using means other than Relax NG DTD Compatibility.
So I wonder, have you actually found ODF document instances, say written from OpenOffice 2.4.0, which have ID/IDREF/IDREFS usage which violates the constraints expressed in ODF 1.0?
Finally, in your professional judgment, do you maintain that this is a accurate statement: "For ISO/IEC 26300:2006 (ODF) in general, we can say that the standard itself has a defect which prevents any document claiming validity from being actually valid. Consequently, there are no XML documents in existence which are valid to ISO ODF."
Labels: ODF
Wednesday, April 16, 2008
Suggesting ODF Enhancements
There is a good post by Mathias Bauer on Sun Hamburg's GullFOSS blog. He deals with the practical importance of OASIS's "Feedback License" that governs any public feedback OASIS receives from non-TC members.
The ODF TC receives ideas for new features from many places. Many of the ideas come from our TC members themselves, where we have representation from most of the major ODF vendors, from open source projects, interest groups, as well as from individual contributors.
Other ideas come from other vendors or open source projects, from organizations that the TC has a liaison relationship with (like ISO/IEC JTC1/SC34), or individual members of the public.
Contributions from OASIS TC members are already covered by the OASIS IPR Policy. The TC member who contributes written proposals to the TC is obliged from the time of contribution. And other TC members are obliged if they have been TC members for at least 60 days and remain a member 7 days after approval of any Committee Draft. You can see the participation status of TC members here.
For everyone else, those who are not members of the ODF TC, the rules require that proposals, feedback, comments, ideas, etc., come through our comment mailing list. But before you can post to the comment list you must first accept the terms of the Feedback License.
Is this extra step annoying? Yes, it is. But this pain is what is necessary to keep our IP pedigree clean and protect the rights of everyone to implement and use ODF. It is part of the price we pay for open standards. Free does not mean free from vigilance.
One of my responsibilities on the ODF TC is to monitor and process the public comments we receive. Regretfully this is a duty which I've neglected for too long. So I spent some time this week getting caught up on the comments, entering them all into a tracking spreadsheet. We have a total of 180 public comments since ODF 1.0 was approved by OASIS, covering everything from new feature proposals to reports of typographical errors.
The largest single source of comments is from the Japanese JTC1/SC34 mirror committee, where they have been translating the ODF 1.0 standard into Japanese. As you know, you will get no closer reading of a text than when attempting translation, so we're glad to receive this scrutiny. I'll look forward to adding the Japanese translation of ODF along side the existing Russian and Chinese translations soon.
For comments that are in the nature of a defect report, i.e., reporting an editorial or technical error in the standard, we will include a fix in the ODF 1.0 errata document we are preparing. For comments that are in the nature of a new feature proposal, we will discuss on a TC call, and decide whether or not to include it in ODF 1.2.
A sample of some of the feature proposals from the comment list are:
Of course, general comments are always welcome on this blog.
The ODF TC receives ideas for new features from many places. Many of the ideas come from our TC members themselves, where we have representation from most of the major ODF vendors, from open source projects, interest groups, as well as from individual contributors.
Other ideas come from other vendors or open source projects, from organizations that the TC has a liaison relationship with (like ISO/IEC JTC1/SC34), or individual members of the public.
Contributions from OASIS TC members are already covered by the OASIS IPR Policy. The TC member who contributes written proposals to the TC is obliged from the time of contribution. And other TC members are obliged if they have been TC members for at least 60 days and remain a member 7 days after approval of any Committee Draft. You can see the participation status of TC members here.
For everyone else, those who are not members of the ODF TC, the rules require that proposals, feedback, comments, ideas, etc., come through our comment mailing list. But before you can post to the comment list you must first accept the terms of the Feedback License.
Is this extra step annoying? Yes, it is. But this pain is what is necessary to keep our IP pedigree clean and protect the rights of everyone to implement and use ODF. It is part of the price we pay for open standards. Free does not mean free from vigilance.
One of my responsibilities on the ODF TC is to monitor and process the public comments we receive. Regretfully this is a duty which I've neglected for too long. So I spent some time this week getting caught up on the comments, entering them all into a tracking spreadsheet. We have a total of 180 public comments since ODF 1.0 was approved by OASIS, covering everything from new feature proposals to reports of typographical errors.
The largest single source of comments is from the Japanese JTC1/SC34 mirror committee, where they have been translating the ODF 1.0 standard into Japanese. As you know, you will get no closer reading of a text than when attempting translation, so we're glad to receive this scrutiny. I'll look forward to adding the Japanese translation of ODF along side the existing Russian and Chinese translations soon.
For comments that are in the nature of a defect report, i.e., reporting an editorial or technical error in the standard, we will include a fix in the ODF 1.0 errata document we are preparing. For comments that are in the nature of a new feature proposal, we will discuss on a TC call, and decide whether or not to include it in ODF 1.2.
A sample of some of the feature proposals from the comment list are:
- A request to support embedded fonts in ODF documents
- A request to support multiple versions of the same document in the same file
- A request to allow vertical text justification
- A proposal for enhanced string processing spreadsheet functions
- A proposal for spreadsheet values to allow units, which would help prevent calculation errors due to mixing units, i.e., adding mm to kg would be flagged as an error.
- A proposal for allowing spreadsheet named ranges to have namespaces, with each sheet in a workbook having its own namespace.
- A proposal to allow a document to have a "portable" flag to allow it to self-identify that it contains only portable ODF content with no proprietary extensions.
- Proposal for adding FFT support to spreadsheet
- Proposal for adding overline text attribute
Of course, general comments are always welcome on this blog.
Labels: ODF
Saturday, February 16, 2008
Fast Track versus PAS
Years ago I read an interesting article about the encyclopedia entry for the keyword "Longitude". According to the article, the entry merely said "See Latitude". With that short, two-word sentence the encyclopedia author conflated these two concepts as mere orthogonal dimensions, lumped together, each as boring as the other. This ignored the fact that latitude is boring, easy, trivial, known to the ancients and as easy to calculate as measuring the altitude of Polaris. But longitude, there lies an epic adventure, something fiendishly difficult to calculate accurately, something that propelled a great seafaring nation to a search for accurate timepieces that would work at sea, just in order to more accurately calculate longitude. Books have been written about longitude, lives lost, fortunes made. But latitude -- latitude is for children.
So when I hear people lump Fast Track and PAS process in JTC1 together, I roll my eyes and think... If only they knew how different they really are.
Let's give it a try, starting with PAS.
PAS stands for "Publicly Available Specification" and the PAS process in JTC1 allows an existing standard from outside of JTC1 to be submitted, reviewed and approved in an accelerated review cycle. An organization that wishes to make a PAS submission (typically a standards consortium) must first seek recognition as a PAS Submitter. This requires that they submit to JTC1 for approval a list of standards they wish to submit, as well as documentation that explains their organizational qualifications. The long list of organizational acceptance criteria are outlined in JTC1 Directives, Annex M:
Once an organization has Recognized PAS Submitter status, it may now propose a PAS submission. Such a submission must be within scope of the Submitter's original application, and must be accompanied by an Explanatory Report that speaks to JTC1's strategic interests in Interoperability, Cultural and Linguistic Adaptability, as well as the following document-related acceptance criteria:
The Explanatory Report also sets the maintenance regime for the submission, if approved
The proposed standard, along with the Explanatory Report is then distributed to JTC1 NB's for a 6-month ballot. Approval criteria is 2/3 approval of voting P-members, and no more than 25% disapproval in total. At the end of the ballot a Ballot Resolution Meeting may be held if needed.
So, that is PAS process, in brief. PAS process is how ODF was approved back in 2006, with OASIS as the Recognized PAS Submitter.
Fast Track process, is almost the same from the time the ballot is issued. The six-month period is split into a 30-day "contradiction period" and a 5-month ballot. (That is an odd difference, with no clear reason). But the voting criteria, the BRM process, etc., this is all the same between the two. What is different (and there are critical differences) is everything that happens before the ballot.
Who can submit a Fast Track? Any JTC1 P-member, or any Class A Liaison can propose a Fast Track.
We all know about P-members. They are NB's, typically the highest standardization committee in any country. A P-member used to also mean that you had a broad interest in many or most JTC1 matters. But now it may mean merely that Microsoft asked you to join as a P-member.
Class A Liaison are "Organisations which make an effective contribution to and participate actively in the work of JTC 1 or its SCs for most of the questions dealt with by the committee". Any organization can apply to be a Class A Liaison and be voted in via a letter ballot or at a meeting. There are no formal organization qualifications, no requirement to state an interest in eventually making Fast Tracks, or to answer any of the types of questions that PAS Submitters must answer.
Further, once approved as a Class A Liaison, the status lasts forever. There is no requirement to renew or reapply. In fact JTC1 Directives even lack a documented procedure for removing a Class A Liaison.
So what about the proposals for Fast Track submission. What is required of them? No Explanatory Report is required. No checklist of document-related criteria must be answered. JTC1 Directives say merely "The criteria for proposing an existing standard for the fast-track procedure is a matter for each proposer to decide." That's it. It is at the sole discretion of the Class A Liaison.
So you can see what great power Ecma has over JTC1 -- they can submit any standard they want for Fast Track, and no one in JTC1 can stop them, or even remove their right to submit more Fast Tracks.
This may explain why Ecma is able to command such high membership fees. A full voting membership in OASIS, which would allow a company to help produce an OASIS Standard for later submission to JTC1 under the arduous PAS process, this costs $1,100 for a small company. To join the US NB and be able to lobby for a Fast Track submission from the US, this will cost you $9,500. But to join Ecma as a voting member (what they call an "Ordinary Member") this will cost you 70,000 Swiss Francs, or $64,000. That is what no-questions-asked Fast Track service is worth. I think that, from Microsoft's perspective, the extra $62,900 is money well spent. But what about from JTC1's perspective? They don't get this extra money. So what's their excuse for having these permissive Fast Track procedures that give Ecma so much control?
In any case, that is why I roll my eyes when people lump PAS and Fast Track together, and say that they are essentially the same process. They clearly aren't. PAS Submitters like OASIS are given intense scrutiny, and are required to document in great detail how their organization and their proposals meet JTC1 criteria. The scrutiny never ends, as a new Explanatory Report is required for every submission, and their status as Recognized PAS Submitter only lasts for a few years before requiring re-approval.
Fast Track submitters, as Class A Liaisons, on the other hand, are the monarchs of JTC1. They serve for life and are answerable to no one. They can submit a Fast Track on any subject they want, at any time. So a standards consortium like Ecma, with primary expertise in optical disk standards, but never having produced an XML standard before, can rubber stamp the world's largest XML standard and submit it for Fast Track processing to JTC1. And no one can do a thing about it.
So when I hear people lump Fast Track and PAS process in JTC1 together, I roll my eyes and think... If only they knew how different they really are.
Let's give it a try, starting with PAS.
PAS stands for "Publicly Available Specification" and the PAS process in JTC1 allows an existing standard from outside of JTC1 to be submitted, reviewed and approved in an accelerated review cycle. An organization that wishes to make a PAS submission (typically a standards consortium) must first seek recognition as a PAS Submitter. This requires that they submit to JTC1 for approval a list of standards they wish to submit, as well as documentation that explains their organizational qualifications. The long list of organizational acceptance criteria are outlined in JTC1 Directives, Annex M:
M7.3 Organisation Acceptance CriteriaOnce this documentation is provided, a three-month JTC1 ballot is held on the question of whether to approved the applicant as a Recognized PAS Submitter. If approved, this status last for 2 years, but may be renewed by reapplying with updated organizational documentation. Renewals must also be approved by a 3-month letter ballot.
M7.3.1 Co-operative Stance (M)
There should be evidence of a co-operative attitude toward open dialogue, and a stated objective of pursuing standardisation in the JTC 1 arena. The JTC 1 community will reciprocate in similar ways, and in addition, will recognise the organisation's contribution to international standards.
It is JTC 1's intention to avoid any divergence between the JTC 1 revision of a transposed PAS and a version published by the originator. Therefore, JTC 1 invites the submitter to work closely with JTC 1 in revising or amending a transposed PAS.
There should be acceptable proposals covering the following categories and topics.
M.7.3.1.1 Commitment to Working Agreement(s)M.7.3.1.2 Ongoing Maintenance
- What working agreements have been provided, how comprehensive are they?
- How manageable are the proposed working agreements (e.g. understandable, simple, direct, devoid of legalistic language except where necessary)?
- What is the attitude toward creating and using working agreements?
- What is the willingness and resource availability to conduct ongoing maintenance, interpretation, and 5 year revision cycles following JTC 1 approval (see also M6.1.5)?
- What level of willingness and resources are available to facilitate specification progression during the transposition process (e.g. technical clarification and normal document editing)?
M.7.3.1.3 Changes during transposition
- What are the expectations of the proposer toward technical and editorial changes to the specification during the transposition process?
- How flexible is the proposing organisation toward using only portions of the proposed specification or adding supplemental material to it?
M.7.3.1.4 Future Plans
- What are the intentions of the proposing organisation toward future additions, extensions, deletions or modifications to the specification? Under what conditions? When? Rationale?
- What willingness exists to work with JTC 1 on future versions in order to avoid divergence? Note that the answer to this question is particularly relevant in cases where doubts may exist about the openness of the submitter organisation.
- What is the scope of the organisation activities relative to specifications similar to but beyond that being proposed?
M7.3.2 Characteristics of the Organisation (M)
The PAS should have originated in a stable body that uses reasonable processes for achieving broad consensus among many parties. The PAS owner should demonstrate the openness and non-discrimination of the process which is used to establish consensus, and it should declare any ongoing commercial interest in the specification either as an organisation in its own right or by supporting organisations such as revenue from sales or royalties.
M.7.3.2.1 Process and Consensus:
- What processes and procedures are used to achieve consensus, by small groups and by the organisation in its entirety?
- How easy or difficult is it for interested parties, e.g. business entities, individuals, or government representatives to participate?
- What criteria are used to determine "voting" rights in the process of achieving consensus?
M.7.3.2.2 Credibility and Longevity:
- What is the extent of and support from (technical commitment) active members of the organisation? b) How well is the organisation recognised by the interested/affected industry?
- How long has the organisation been functional (beyond the initial establishment period) and what are the future expectations for continued existence?
- What sort of legal business entity is the organisation operating under?
M7.3.3 Intellectual Property Rights: (M)
The organisation is requested to make known its position on the items listed below. In particular, there shall be a written statement of willingness of the organisation and its members, if applicable, to comply with the ISO/IEC patent policy in reference to the PAS under consideration.
Note: Each JTC 1 National Body should investigate and report the legal implications of this section.
M.7.3.3.1 Patents:
- How willing are the organisation and its members to meet the ISO/IEC policy on these matters?
- What patent rights, covering any item of the proposal, is the PAS owner aware of?
M.7.3.3.2 Copyrights:M.7.3.3.3 Distribution Rights:
- What copyrights have been granted relevant to the subject specification(s)?
- What copyrights, including those on implementable code in the specification, is the PAS originator willing to grant?
- What conditions, if any, apply (e.g. copyright statements, electronic labels, logos)?
- What distribution rights exist and what are the terms of use?
- What degree of flexibility exists relative to modifying distribution rights; before the transposition process is complete, after transposition completion?
- Is dual/multiple publication and/or distribution envisaged, and if so, by whom?
M.7.3.3.4 Trademark Rights:
- What trademarks apply to the subject specification?
- What are the conditions for use and are they to be transferred to ISO/IEC in part or in their entirety?
M.7.3.3.5 Original Contributions:
- What original contributions (outside the above IPR categories) (e.g. documents, plans, research papers, tests, proposals) need consideration in terms of ownership and recognition?
- What financial considerations are there?
- What legal considerations are there?
Once an organization has Recognized PAS Submitter status, it may now propose a PAS submission. Such a submission must be within scope of the Submitter's original application, and must be accompanied by an Explanatory Report that speaks to JTC1's strategic interests in Interoperability, Cultural and Linguistic Adaptability, as well as the following document-related acceptance criteria:
M7.4 Document Related Criteria
M7.4.1 Quality
Within its scope the specification shall completely describe the functionality (in terms of interfaces, protocols, formats, etc) necessary for an implementation of the PAS. If it is based on a product, it shall include all the functionality necessary to achieve the stated level of compatibility or interoperability in a product independent manner.
M.7.4.1.1 Completeness (M):
- How well are all interfaces specified?
- How easily can implementation take place without need of additional descriptions?
- What proof exists for successful implementations (e.g. availability of test results for media standards)?
M.7.4.1.2 Clarity:
- What means are used to provide definitive descriptions beyond straight text?
- What tables, figures, and reference materials are used to remove ambiguity?
- What contextual material is provided to educate the reader?
M.7.4.1.3 Testability (M)
The extent, use and availability of conformance/interoperability tests or means of implementation verification (e.g. availability of reference material for magnetic media) shall be described, as well as the provisions the specification has for testability.
The specification shall have had sufficient review over an extended time period to characterise it as being stable.
M.7.4.1.4 Stability (M):
- How long has the specification existed, unchanged, since some form of verification (e.g. prototype testing, paper analysis, full interoperability tests) has been achieved?
- To what extent and for how long have products been implemented using the specification?
- What mechanisms are in place to track versions, fixes, and addenda?
M.7.4.1.5 Availability (M):
- Where is the specification available (e.g. one source, multinational locations, what types of distributors)?
- How long has the specification been available?
- Has the distribution been widespread or restricted? (describe situation)
- What are the costs associated with specification availability?
M7.4.2 Consensus (M)
The accompanying report shall describe the extent of (inter)national consensus that the document has already achieved.
M.7.4.2.1 Development Consensus:
- Describe the process by which the specification was developed.
- Describe the process by which the specification was approved.
- What "levels" of approval have been obtained?
M.7.4.2.2 Response to User Requirements:
- How and when were user requirements considered and utilised?
- To what extent have users demonstrated satisfaction?
M.7.4.2.3 Market Acceptance:
- How widespread is the market acceptance today? Anticipated?
- What evidence is there of market acceptance in the literature?
M.7.4.2.4 Credibility:
- What is the extent and use of conformance tests or means of implementation verification?
- What provisions does the specification have for testability?
M7.4.3 Alignment
The specification should be aligned with existing JTC 1 standards or ongoing work and thus complement existing standards, architectures and style guides. Any conflicts with existing standards, architectures and style guides should be made clear and justified.
M.7.4.3.1 Relationship to Existing Standards:
- What international standards are closely related to the specification and how?
- To what international standards is the proposed specification a natural extension?
- How is the specification related to emerging and ongoing JTC 1 projects?
M.7.4.3.2 Adaptability and Migration:
- What adaptations (migrations) of either the specification or international standards would improve the relationship between the specification and international standards?
- How much flexibility do the proponents of the specification have?
- What are the longer-range plans for new/evolving specifications?
M.7.4.3.3 Substitution and Replacement:
- What needs exist, if any, to replace an existing international standard? Rationale?
- What is the need and feasibility of using only a portion of the specification as an international standard?
- What portions, if any, of the specification do not belong in an international standard (e.g. too implementation specific)?
M.7.4.3.4 Document Format and Style
- What plans, if any, exist to conform to JTC 1 document styles?
The Explanatory Report also sets the maintenance regime for the submission, if approved
The proposed standard, along with the Explanatory Report is then distributed to JTC1 NB's for a 6-month ballot. Approval criteria is 2/3 approval of voting P-members, and no more than 25% disapproval in total. At the end of the ballot a Ballot Resolution Meeting may be held if needed.
So, that is PAS process, in brief. PAS process is how ODF was approved back in 2006, with OASIS as the Recognized PAS Submitter.
Fast Track process, is almost the same from the time the ballot is issued. The six-month period is split into a 30-day "contradiction period" and a 5-month ballot. (That is an odd difference, with no clear reason). But the voting criteria, the BRM process, etc., this is all the same between the two. What is different (and there are critical differences) is everything that happens before the ballot.
Who can submit a Fast Track? Any JTC1 P-member, or any Class A Liaison can propose a Fast Track.
We all know about P-members. They are NB's, typically the highest standardization committee in any country. A P-member used to also mean that you had a broad interest in many or most JTC1 matters. But now it may mean merely that Microsoft asked you to join as a P-member.
Class A Liaison are "Organisations which make an effective contribution to and participate actively in the work of JTC 1 or its SCs for most of the questions dealt with by the committee". Any organization can apply to be a Class A Liaison and be voted in via a letter ballot or at a meeting. There are no formal organization qualifications, no requirement to state an interest in eventually making Fast Tracks, or to answer any of the types of questions that PAS Submitters must answer.
Further, once approved as a Class A Liaison, the status lasts forever. There is no requirement to renew or reapply. In fact JTC1 Directives even lack a documented procedure for removing a Class A Liaison.
So what about the proposals for Fast Track submission. What is required of them? No Explanatory Report is required. No checklist of document-related criteria must be answered. JTC1 Directives say merely "The criteria for proposing an existing standard for the fast-track procedure is a matter for each proposer to decide." That's it. It is at the sole discretion of the Class A Liaison.
So you can see what great power Ecma has over JTC1 -- they can submit any standard they want for Fast Track, and no one in JTC1 can stop them, or even remove their right to submit more Fast Tracks.
This may explain why Ecma is able to command such high membership fees. A full voting membership in OASIS, which would allow a company to help produce an OASIS Standard for later submission to JTC1 under the arduous PAS process, this costs $1,100 for a small company. To join the US NB and be able to lobby for a Fast Track submission from the US, this will cost you $9,500. But to join Ecma as a voting member (what they call an "Ordinary Member") this will cost you 70,000 Swiss Francs, or $64,000. That is what no-questions-asked Fast Track service is worth. I think that, from Microsoft's perspective, the extra $62,900 is money well spent. But what about from JTC1's perspective? They don't get this extra money. So what's their excuse for having these permissive Fast Track procedures that give Ecma so much control?
In any case, that is why I roll my eyes when people lump PAS and Fast Track together, and say that they are essentially the same process. They clearly aren't. PAS Submitters like OASIS are given intense scrutiny, and are required to document in great detail how their organization and their proposals meet JTC1 criteria. The scrutiny never ends, as a new Explanatory Report is required for every submission, and their status as Recognized PAS Submitter only lasts for a few years before requiring re-approval.
Fast Track submitters, as Class A Liaisons, on the other hand, are the monarchs of JTC1. They serve for life and are answerable to no one. They can submit a Fast Track on any subject they want, at any time. So a standards consortium like Ecma, with primary expertise in optical disk standards, but never having produced an XML standard before, can rubber stamp the world's largest XML standard and submit it for Fast Track processing to JTC1. And no one can do a thing about it.
Tuesday, February 12, 2008
Punct Contrapunct
The recent Burton Group report, What's Up, .DOC? by Guy Creese and Peter O'Kelly was made available free to the public for a stated purpose:
The degree of expanded debate achieved may be estimated by noting that Microsoft is sending this report to every JTC1 national body involved in the OOXML ballot, from Pakistan to Ecuador, and has invited Peter O'Kelly to speak on this paper both at the recent OOXML press event in Washington as well as this week's Office Developers Conference.
Much could be said of this report, but I'll limit myself to commenting on a single passage:
It should be noted that, before making this statement, the authors neither contacted OASIS nor the OASIS ODF TC in order to check their facts.
The ODF Alliance published a rebuttal of this report, and in particular took umbrage at that passage, saying:
Now, back to the Burton Group, where Guy Creese responds on the Burton Group blog:
Guy, excuse me, did you say "conflicts of interest"? Please explain. Or maybe when Peter O'Kelly comes back from speaking at Microsoft's Office Developers Conference he can explain it for us?
In any case, the factual errors in your report with respect to the control of ODF have been clearly demonstrated, but instead of simply admitting and correcting the error, you hide beyond anonymous sources and further impugn OASIS by charging some sort of "conflict of interest".
To follow your logic further demonstrates the absurdity of it. If you believe that the fact that IBM and Sun "collectively control 70% of the votes in the ODF TC" lends weight to your argument, then what is shown by the equally true mathematical fact that IBM plus independent members also control 70% of the votes? Why is this equally true fact not mentioned? This is the nature of plurality, that there are many different combinations of votes that could make a majority position. Further, note that these groups in practice do not always vote as a bloc. We've had votes where the independent members split their vote, and we even had a vote where the IBM members did not all vote alike. So much for your simplistic control theory.
I will not question whether your anonymous sources indeed misled you. For sake of argument, I will accept unquestioningly that you indeed had sources and that they said exactly what you claim they said. However, having sources does not excuse you, as an analyst, from doing basic fact checking. The rules of OASIS and the voting composition of the ODF TC are facts, not opinions, and the correct information was sitting there, on public web sites, for you to check. It is not your fault that you were misled by sources, but it is your fault that you did not verify their claims. To publish controversial statements based on anonymous sources without fact checking, this is not something that represents the Burton Group's finest work.
The Burton Group has denigrated the work and the members of the OASIS Open Document Format Technical Committee (of which I am Co-Chair) with published statements that have been shown to be false. The Burton Group owes us an apology and an immediate retraction.
Waiting until after February, after the DIS 29500 process concludes, to make corrections is unacceptable. Since your stated purpose in making this report public was to "advance the debate" in the current OOXML ISO process, withholding factual corrections until after that process concludes would imply that you and the Burton Group see no problems with knowingly persisting in influencing an ISO ballot with false information published under the Burton Group name. I don't believe that is the image that the Burton Group would want to project. So I urge that a correction is in order now.
We’ve made the overview available for free (I must admit I'm not sure for how long), as we believe this topic warrants expanded industry debate before a February, 2008 ISO ballot on OOXML, and we want to help catalyze and advance the debate.
The degree of expanded debate achieved may be estimated by noting that Microsoft is sending this report to every JTC1 national body involved in the OOXML ballot, from Pakistan to Ecuador, and has invited Peter O'Kelly to speak on this paper both at the recent OOXML press event in Washington as well as this week's Office Developers Conference.
Much could be said of this report, but I'll limit myself to commenting on a single passage:
[S]everal vendors interviewed for this overview indicated that it's essentially impossible to get ODF proposals approved if they're not also supported in OpenOffice.org, and further noted that Sun closely controls OpenOffice.org (much as it also holds control over Java).
It should be noted that, before making this statement, the authors neither contacted OASIS nor the OASIS ODF TC in order to check their facts.
The ODF Alliance published a rebuttal of this report, and in particular took umbrage at that passage, saying:
This is demonstrably false, and the use of unnamed “vendors” as sources does not eliminate the need for doing basic fact checking on such claims. Rumors and innuendo do not objective analysis make.
First, on the control aspect, note that ODF 1.0, the standard, is owned and controlled by OASIS, a standards consortium of over 600 member organizations. Sun is just one company among many members. Indeed, for most of the development of ODF, Microsoft was on the Board of Directors of OASIS.
Second, OASIS is a corporation. It is legally bound to its Bylaws. There is no arbitrary control by member corporations.
The ODF TC is co-chaired by an IBM employee and a Sun employee, and is regulated by the OASIS TC Process document, which is publicly readable by all and has clear rules of procedure and appeal.
The ODF TC has three subcommittees. The Accessibility SC is co-chaired by IBM and Sun, while the Formula Subcommittee and the Metadata Subcommittee are each chaired by individual members of OASIS who are not affiliated with any large corporations.
Voting rights in the ODF TC, for accepting or rejecting features, is currently as follows:
- Sun – 3 voting members
- IBM – 4 voting members
- Individuals – 3 voting members
This can easily be verified at the OASIS ODF TC website.
Is sharing the chair position on the TC and on 1 of 3 subcommittees considered “closely controlling”? Is having 30% of the votes considered “closely controlling”?
As for proposals being accepted into ODF, we note that all three major features for ODF 1.2, RDF metadata, OpenFormula, and enhanced accessibility, are new proposals which have not been yet implemented in OpenOffice. Moreover, the ODF TC is currently processing a set of features requested by the KOffice open source project. So the assertion that it is “essentially impossible” to get new features into ODF if they are not already supported by OpenOffice is not true. This error is unfortunate and needs correcting through rigorous fact checking, as do the others, in our opinion.
Oddly enough, this particular error occurs in several places. A search of the report for the word “control” shows it used six times, once in reference to “Chinese communists” and five times in reference to Sun Microsystems. Note, however, that no mention is ever made of the strong direct control Microsoft asserts over OOXML, its having sole chairmanship of the Ecma TC45, and its having secured a committee charter that prevents any changes to OOXML that are not compatible with Microsoft Office.
Again, we're puzzled by the inaccuracy on one hand and the lack of balance on the other.
Now, back to the Burton Group, where Guy Creese responds on the Burton Group blog:
We were not expecting to be told that Sun had significant sway over the standard, but several people told us that (spread across more than one ODF-oriented vendor), which is why we noted it in the report. As the ODF Alliance notes, IBM and Sun—two of Microsoft’s most powerful productivity application archrivals today (as well as partners to Microsoft in myriad other domains, e.g., Web services-related standards initiatives)—collectively control 70% of the votes in the ODF TC which determines if proposals will be accepted or rejected. This suggests there is ample opportunity for conflicts of interest.
Guy, excuse me, did you say "conflicts of interest"? Please explain. Or maybe when Peter O'Kelly comes back from speaking at Microsoft's Office Developers Conference he can explain it for us?
In any case, the factual errors in your report with respect to the control of ODF have been clearly demonstrated, but instead of simply admitting and correcting the error, you hide beyond anonymous sources and further impugn OASIS by charging some sort of "conflict of interest".
To follow your logic further demonstrates the absurdity of it. If you believe that the fact that IBM and Sun "collectively control 70% of the votes in the ODF TC" lends weight to your argument, then what is shown by the equally true mathematical fact that IBM plus independent members also control 70% of the votes? Why is this equally true fact not mentioned? This is the nature of plurality, that there are many different combinations of votes that could make a majority position. Further, note that these groups in practice do not always vote as a bloc. We've had votes where the independent members split their vote, and we even had a vote where the IBM members did not all vote alike. So much for your simplistic control theory.
I will not question whether your anonymous sources indeed misled you. For sake of argument, I will accept unquestioningly that you indeed had sources and that they said exactly what you claim they said. However, having sources does not excuse you, as an analyst, from doing basic fact checking. The rules of OASIS and the voting composition of the ODF TC are facts, not opinions, and the correct information was sitting there, on public web sites, for you to check. It is not your fault that you were misled by sources, but it is your fault that you did not verify their claims. To publish controversial statements based on anonymous sources without fact checking, this is not something that represents the Burton Group's finest work.
The Burton Group has denigrated the work and the members of the OASIS Open Document Format Technical Committee (of which I am Co-Chair) with published statements that have been shown to be false. The Burton Group owes us an apology and an immediate retraction.
Waiting until after February, after the DIS 29500 process concludes, to make corrections is unacceptable. Since your stated purpose in making this report public was to "advance the debate" in the current OOXML ISO process, withholding factual corrections until after that process concludes would imply that you and the Burton Group see no problems with knowingly persisting in influencing an ISO ballot with false information published under the Burton Group name. I don't believe that is the image that the Burton Group would want to project. So I urge that a correction is in order now.
Thursday, January 31, 2008
The Case for Harmonization
Depending on who you ask, document standard harmonization is either impossible or inevitable, anathema or nirvana. Let's dig a little into this question and see if the two sides are really that far apart.
First note that many JTC1 NB's raised the issue of harmonization in their DIS 29500 ballot comments last September. Some merely requested harmonization, such as Korea, South Africa, Belgium, Peru, Switzerland, or the Czech Republic, while others in addition outlined ways to achieve harmonization. For example, AFNOR, the French NB stated:
New Zealand's proposal was similar:
Ecma rejected every single one of these requests. They did not argue that the requested features were unreasonable. They did not argue that the requested feature was not needed. Their argument was that harmonization of the formats was not necessary because there exist tools that will translate between OOXML and ODF. In other words, they rejected these requests merely because they were pro-harmonization, regardless of the underlying merit or need of the feature. Ironically, Microsoft's conversion tools are restricted in their fidelity because of the lack of these very features.
On the question of harmonization, we are either moving toward it, or we are moving away. There is no time better than the present to harmonize. Waiting will only make matters worse, as we will then need to consider legacy OOXML documents as well as legacy binary and legacy ODF documents. The Ecma response does not move us toward harmonization, but starts down the road toward further divergence, a long and costly divergence.
Tim Bray made the critical observation back in 2005, “The world does not need two ways to say 'This paragraph is in 12-point Arial with 1.2em leading and ragged-right justification'.”
Microsoft likes to claim that harmonization is impossible, that slapping together the features of both standards would lead to a messy, impenetrable mess. Of course, but only an idiot would suggest that as an approach to harmonization. So why do they always bring that up as their strawman?
A look at OpenOffice and Microsoft Office shows a huge degree of functional overlap. Harmonization starts from looking at this functional overlap – and there is a significant, perhaps 90%+ area where they do overlap – and expresses the functional overlap identically, using the same xml schema. In other words, harmonization identifies the commonalities at the functional level and finds a common representation for that commonality.
It would also be expected that the common functionality between ODF and OOXML would also include a common extensibility mechanism, a way for a vendor to express application-specific features that are outside of the harmonized standard.
The remaining 10% of the functionality would be the focus of the harmonization work, the area that requires the most attention. Some portion of that 10% will represent general-purpose features that we can imagine multiple application supporting. We take those features and add them to ODF. That remaining portion of the 10%, which only serves one vendor's needs, such as flags for deprecated legacy formatting options, could be represented using the common extensibility mechanism.
Does this sound impossible? That's not what Microsoft says. Gray Knowlton, Group Product Manager for Microsoft Office, was candid to PC World a couple of weeks ago:
So we've agreed that this approach is technically feasible. We're also agreed that extending ODF outside of the standards process is not a good idea. So the obvious solution is to extend ODF within the standards process. So, let's do it! What are we waiting for?
There is no reason why, by a harmonization process, all of the functionality of Microsoft Office cannot be represented on a base of ISO 26300 OpenDocument Format. I personally, as Co-Chair of the OASIS ODF TC, stand ready and willing to sponsor such a harmonization effort in OASIS. So let's start harmonization now, and avoid further divergence.
My read of NB comments indicates that there is a sizable bloc, perhaps even a decisive bloc, of NB's who are in favor of harmonization. Lets push on this and articulate a roadmap along the lines of the proposals by France and New Zealand, that accomplishes this.
First note that many JTC1 NB's raised the issue of harmonization in their DIS 29500 ballot comments last September. Some merely requested harmonization, such as Korea, South Africa, Belgium, Peru, Switzerland, or the Czech Republic, while others in addition outlined ways to achieve harmonization. For example, AFNOR, the French NB stated:
After 5 months of extensive discussions between stakeholders in the field of revisable document formats, AFNOR, in the aim to obtain a single standard for XML office document formats within 3 years, makes the following proposal:(Note that a Technical Specification, in ISO process, is for proposals which lack insufficient support for approval as an International Standard, but for which publication is still desired. This may be appropriate for OOXML.)
- Split the current ECMA 376 standard in 2 parts in order to differentiate the essential OOXML core functions necessary for easy implementation from those functionalities that are needed for the exchange of legacy office file formats;
- Incorporate the technical comments below and those in the attached comment table submitted to the Fast Track;
- Attribute the status of Technical Specification to both parts;
- Establish a process of convergence between ODF (already standardized as ISO/IEC 26300) and the above mentioned OOXML core. ISO/IEC shall invite parties involved to commit themselves to initiate simultaneously the revisions of the existing ODF v1.0 and the OOXML core in order to obtain at the end of the revision process a standard as universal as possible.
New Zealand's proposal was similar:
- OOXML should be considered by JTC 1 for publication as a Type 2 Technical Report.
- Seek to harmonize with the existing ODF standard to reduce the cost of interoperability, cost of having two standards, and cost of support/maintenance .
- to have more than 63 columns in a table
- to have background images in tables
- to have font weights beyond “normal” and “bold”.
Ecma rejected every single one of these requests. They did not argue that the requested features were unreasonable. They did not argue that the requested feature was not needed. Their argument was that harmonization of the formats was not necessary because there exist tools that will translate between OOXML and ODF. In other words, they rejected these requests merely because they were pro-harmonization, regardless of the underlying merit or need of the feature. Ironically, Microsoft's conversion tools are restricted in their fidelity because of the lack of these very features.
On the question of harmonization, we are either moving toward it, or we are moving away. There is no time better than the present to harmonize. Waiting will only make matters worse, as we will then need to consider legacy OOXML documents as well as legacy binary and legacy ODF documents. The Ecma response does not move us toward harmonization, but starts down the road toward further divergence, a long and costly divergence.
Tim Bray made the critical observation back in 2005, “The world does not need two ways to say 'This paragraph is in 12-point Arial with 1.2em leading and ragged-right justification'.”
Microsoft likes to claim that harmonization is impossible, that slapping together the features of both standards would lead to a messy, impenetrable mess. Of course, but only an idiot would suggest that as an approach to harmonization. So why do they always bring that up as their strawman?
A look at OpenOffice and Microsoft Office shows a huge degree of functional overlap. Harmonization starts from looking at this functional overlap – and there is a significant, perhaps 90%+ area where they do overlap – and expresses the functional overlap identically, using the same xml schema. In other words, harmonization identifies the commonalities at the functional level and finds a common representation for that commonality.
It would also be expected that the common functionality between ODF and OOXML would also include a common extensibility mechanism, a way for a vendor to express application-specific features that are outside of the harmonized standard.
The remaining 10% of the functionality would be the focus of the harmonization work, the area that requires the most attention. Some portion of that 10% will represent general-purpose features that we can imagine multiple application supporting. We take those features and add them to ODF. That remaining portion of the 10%, which only serves one vendor's needs, such as flags for deprecated legacy formatting options, could be represented using the common extensibility mechanism.
Does this sound impossible? That's not what Microsoft says. Gray Knowlton, Group Product Manager for Microsoft Office, was candid to PC World a couple of weeks ago:
Also, if individual governments mandate the use of ODF instead of Open XML, Microsoft would adapt, Knowlton said. The company would then implement the missing functionality that ODF doesn't support. However, those extensions would be custom-designed and outside of the standard, which is counter to the idea of an open document standard, Knowlton said.
So we've agreed that this approach is technically feasible. We're also agreed that extending ODF outside of the standards process is not a good idea. So the obvious solution is to extend ODF within the standards process. So, let's do it! What are we waiting for?
There is no reason why, by a harmonization process, all of the functionality of Microsoft Office cannot be represented on a base of ISO 26300 OpenDocument Format. I personally, as Co-Chair of the OASIS ODF TC, stand ready and willing to sponsor such a harmonization effort in OASIS. So let's start harmonization now, and avoid further divergence.
My read of NB comments indicates that there is a sizable bloc, perhaps even a decisive bloc, of NB's who are in favor of harmonization. Lets push on this and articulate a roadmap along the lines of the proposals by France and New Zealand, that accomplishes this.
Wednesday, November 21, 2007
PDF, The Waste Land, and Monica's Blue Dress
Adobe's PDF Architect, James King, has recently started an "Inside PDF" blog which is well worth subscribing to. I'd especially draw your attention to his post "Submission of PDF to ISO" which has a lot of useful information on the process they are going through in ISO, a process that is slightly different than that used by ODF or OOXML in JTC1. (Note in particular that ISO Fast Track is not exactly the same as JTC1 Fast Track.)
In a more recent post, Archiving Documents, James wonders aloud why anyone would use ODF or OOXML for archiving, compared to PDF or PDF/A, saying "After all, archiving means preserving things, and usually you want to preserver the total look of a document. PDF/A does that."
I recommend reading the Archiving Documents post in full, and then return here for an alternate point of view.
.
.
.
We say the word "archive" quite easily and cover a large number of activities by that name, and in doing so risk blurring a number of different activities into one over-generalization. Before you are told that format X or format Y is best for archiving it is fair to ask what I mean by "archiving" and ask who does the archiving, for what purpose and under what constraints.
In some cases what must be preserved, and for how long, is spelled out in detail for you, by statute, regulation or court order. Or, a company, in anticipation of such requests may require preservation as part of a corporate-wide records retention policy for certain categories of employees or certain categories of documents.
An example of the range of materials that may be included can be seen this this preservation order:
I would pay particular attention to the part at the end, "...drafts; jottings; and notes. Information that serves to identify, locate, or link such material, such as file inventories, file folders, indices, and metadata".
Similarly, consider government and academic archives, that are preserving documents for the long term. The archivist tries to anticipate what questions future researchers will have, and then tries to preserve the document in such a way that it can best answer those questions.
A PDF version of a document answers a single question, and answers it quite well: "What did this document look like when printed?" But this is not the only question that one might have of a document. Some other questions that may be asked include:
Let's take a analogous case. T.S. Eliot's 1922 poem The Waste Land is a landmark of 20th century literature. Not only is it important from an artistic and critical perspective, but it is also important from a technology perspective -- it is perhaps the first major poem to have been composed at the typewriter. What was published was, like a PDF, what the author intended, what he wanted the world to see. That is all the world knew until around 1970, after the poet's death, when the rest of the story emerged in the form of typewritten draft versions of the poem, with handwritten comments by Ezra Pound.

This provided pages and pages of marked up text that showed the nature and degree of the collaboration between Eliot and Pound far more than had been previously known. This is what researchers want to read. The final publication is great, but the working copy tells us so much more about the process. History is so much more than asking "What?". It continues by asking "How?" and eventually asking "Why?" -- this is where the real insight occurs, going beyond the mere collection of facts and moving on to interpretation. PDF answers the "What?" question admirably. I'm glad we have PDF as a tool for this purpose. But we need to make sure that when archiving documents we allow future research to ask and receive answers to the other questions as well.
Flash forward to the technology of today. We're not all writing great poetry, but we are collaborating on authoring and reviewing and commenting on documents. But instead of doing it via handwritten notes, we're doing it via review & comment features of our word processors. Although the final resulting document may be easily exportable as a PDF document, that is really just a snapshot of what the document looks like today. It loses the record of the collaboration. I don't think that is what we want to archive, or at least not exclusively. If you archive PDF, then you've lost the collaborative record.
Another example, take a spreadsheet. You have cells with formulas and these formulas calculate results which are then displayed. When you make a PDF version of the spreadsheet you have a record of what it "looked like", but this isn't the same as "what it is". You cannot look at the formulas in the PDF. They don't exist. Future researchers may want to check your spreadsheeet's assumptions, the underlying model. There may also be the question of whether your spreadsheet had errors, whether from a mis-copied formula, or from an underlying bug in the application. If you archive exclusively as PDF, no one will ever be able to answer these questions.
One more example, going back to 1998 and the Clinton/Lewinsky scandal. Kenneth Starr's report on the case was written in WordPerfect format, distributed to the House, which converted it to HTML form and released it on the web. But due to a glitch in the HTML translation process, footnotes that had been marked as deleted in the WordPerfect file reappeared in the HTML version. So we ended up with an official published Starr Report, as well as an unofficial HTML version which had additional footnotes.
Imagine you are an archivist responsible for the Starr Report. What do you do? Which version(s) do you preserve? Is your job to record the official version, as-published? Or is your job to preserve the record for future researchers? Depending on your job description, this might have a clear-cut answer. But if I were a future historian, I would sure hope that someone someplace had the foresight to archive the original WordPerfect version. It answers more questions than the published version does.
So, to sum it up: What you archive determines what questions you can later ask of a document. If you archive as PDF, you have a high-fidelity version of what the final document looked like. This can answer many, but not all, questions. But for the fullest flexibility in what information you can later extract from the document, you really have no choice but to archive the document in its original authoring format.
An intriguing idea is whether we can have it both ways. Suppose you are in an ODF editor and you have a "Save for archiving..." option that would save your ODF document as normal, but also generate a PDF version of it and store it in the zip archive along with ODF's XML streams. Then digitally sign the archive along with a time stamp to make it tamper-proof. You would need to define some additional access conventions, but you could end up with a single document that could be loaded in an ODF editor (in read-only mode) to allow examination of the details of spreadsheet formulas, etc., as well as loaded in a PDF reader to show exactly how it was formated.
In a more recent post, Archiving Documents, James wonders aloud why anyone would use ODF or OOXML for archiving, compared to PDF or PDF/A, saying "After all, archiving means preserving things, and usually you want to preserver the total look of a document. PDF/A does that."
I recommend reading the Archiving Documents post in full, and then return here for an alternate point of view.
.
.
.
We say the word "archive" quite easily and cover a large number of activities by that name, and in doing so risk blurring a number of different activities into one over-generalization. Before you are told that format X or format Y is best for archiving it is fair to ask what I mean by "archiving" and ask who does the archiving, for what purpose and under what constraints.
In some cases what must be preserved, and for how long, is spelled out in detail for you, by statute, regulation or court order. Or, a company, in anticipation of such requests may require preservation as part of a corporate-wide records retention policy for certain categories of employees or certain categories of documents.
An example of the range of materials that may be included can be seen this this preservation order:
"Documents, data, and tangible things" is to be interpreted broadly to include writings; records; files; correspondence; reports; memoranda; calendars; diaries; minutes; electronic messages; voicemail; E-mail; telephone message records or logs; computer and network activity logs; hard drives; backup data; removable computer storage media such as tapes, disks, and cards; printouts; document image files; Web pages; databases; spreadsheets; software; books; ledgers; journals; orders; invoices; bills; vouchers; checks; statements; worksheets; summaries; compilations; computations; charts; diagrams; graphic presentations; drawings; films; charts; digital or chemical process photographs; video; phonographic tape; or digital recordings or transcripts thereof; drafts; jottings; and notes. Information that serves to identify, locate, or link such material, such as file inventories, file folders, indices, and metadata, is also included in this definition.
--Pueblo of Laguna v. U.S. // 60 Fed. Cl. 133 (Fed. Cir. 2004).
I would pay particular attention to the part at the end, "...drafts; jottings; and notes. Information that serves to identify, locate, or link such material, such as file inventories, file folders, indices, and metadata".
Similarly, consider government and academic archives, that are preserving documents for the long term. The archivist tries to anticipate what questions future researchers will have, and then tries to preserve the document in such a way that it can best answer those questions.
A PDF version of a document answers a single question, and answers it quite well: "What did this document look like when printed?" But this is not the only question that one might have of a document. Some other questions that may be asked include:
- What was the nature of collaboration that lead to this document? How many people worked on it? Who contributed what?
- How did the document evolve from revision to revision?
- In the case of a spreadsheet, what was the underlying model and assumptions? In other words, what are the formulas behind the cells?
- In the case of a presentation, how did the document interact with embedded media such as audio, animation, video?
- How was technology used to create this document? In what way did the technology help or impede the author's expression? (Note that researchers in the future may be as interested in the technology behind the document as the contents of the document itself.)
Let's take a analogous case. T.S. Eliot's 1922 poem The Waste Land is a landmark of 20th century literature. Not only is it important from an artistic and critical perspective, but it is also important from a technology perspective -- it is perhaps the first major poem to have been composed at the typewriter. What was published was, like a PDF, what the author intended, what he wanted the world to see. That is all the world knew until around 1970, after the poet's death, when the rest of the story emerged in the form of typewritten draft versions of the poem, with handwritten comments by Ezra Pound.

This provided pages and pages of marked up text that showed the nature and degree of the collaboration between Eliot and Pound far more than had been previously known. This is what researchers want to read. The final publication is great, but the working copy tells us so much more about the process. History is so much more than asking "What?". It continues by asking "How?" and eventually asking "Why?" -- this is where the real insight occurs, going beyond the mere collection of facts and moving on to interpretation. PDF answers the "What?" question admirably. I'm glad we have PDF as a tool for this purpose. But we need to make sure that when archiving documents we allow future research to ask and receive answers to the other questions as well.
Flash forward to the technology of today. We're not all writing great poetry, but we are collaborating on authoring and reviewing and commenting on documents. But instead of doing it via handwritten notes, we're doing it via review & comment features of our word processors. Although the final resulting document may be easily exportable as a PDF document, that is really just a snapshot of what the document looks like today. It loses the record of the collaboration. I don't think that is what we want to archive, or at least not exclusively. If you archive PDF, then you've lost the collaborative record.
Another example, take a spreadsheet. You have cells with formulas and these formulas calculate results which are then displayed. When you make a PDF version of the spreadsheet you have a record of what it "looked like", but this isn't the same as "what it is". You cannot look at the formulas in the PDF. They don't exist. Future researchers may want to check your spreadsheeet's assumptions, the underlying model. There may also be the question of whether your spreadsheet had errors, whether from a mis-copied formula, or from an underlying bug in the application. If you archive exclusively as PDF, no one will ever be able to answer these questions.
One more example, going back to 1998 and the Clinton/Lewinsky scandal. Kenneth Starr's report on the case was written in WordPerfect format, distributed to the House, which converted it to HTML form and released it on the web. But due to a glitch in the HTML translation process, footnotes that had been marked as deleted in the WordPerfect file reappeared in the HTML version. So we ended up with an official published Starr Report, as well as an unofficial HTML version which had additional footnotes.
Imagine you are an archivist responsible for the Starr Report. What do you do? Which version(s) do you preserve? Is your job to record the official version, as-published? Or is your job to preserve the record for future researchers? Depending on your job description, this might have a clear-cut answer. But if I were a future historian, I would sure hope that someone someplace had the foresight to archive the original WordPerfect version. It answers more questions than the published version does.
So, to sum it up: What you archive determines what questions you can later ask of a document. If you archive as PDF, you have a high-fidelity version of what the final document looked like. This can answer many, but not all, questions. But for the fullest flexibility in what information you can later extract from the document, you really have no choice but to archive the document in its original authoring format.
An intriguing idea is whether we can have it both ways. Suppose you are in an ODF editor and you have a "Save for archiving..." option that would save your ODF document as normal, but also generate a PDF version of it and store it in the zip archive along with ODF's XML streams. Then digitally sign the archive along with a time stamp to make it tamper-proof. You would need to define some additional access conventions, but you could end up with a single document that could be loaded in an ODF editor (in read-only mode) to allow examination of the details of spreadsheet formulas, etc., as well as loaded in a PDF reader to show exactly how it was formated.
Sunday, November 18, 2007
Document Format FUD: A Guide for the Perplexed
I've decided to put together a list of misconceptions that I hear, generally on the topic of document formats. I'll try to update this list to keep it current, with the most recent entries at the top. Readers are invited to submit the FUD they observe as comments, and I'll include it where I can.
This inaugural edition is dedicated to the fallout from the recent supernova we know as the OpenDocument Foundation, that in one final act of self-immolation swelled from obscurity to overwhelming brilliance, but then slowly faded away, ever fainter and more erratic, little more than hot gas, the dimming embers no longer sustainable.
Q: Now that the originator and primary supporter of OpenDocument Format has ended its support for ODF, does this mean the end for the ODF standard? (18 Nov 2007)
A: This question is based on a mistaken premise, namely that the OpenDocument Foundation was the originator or steward of the ODF standard. This is an erroneous notion.
The ODF standard is owned by the OASIS standards consortium, with over 600 member organizations and individual members. The committee in OASIS that that does the technical working of maintaining the ODF standard is called the OpenDocument TC. It has 15 organization members as well as 7 individual members. Until recently the OpenDocument Foundation was a member of the ODF TC, one voice among many.
The adoption of the ODF standard is promoted by several organizations, most prominently the ODF Alliance (with over 400 organizational members in 52 countries), the OpenDocument Fellowship (around 100 individual members) and the OpenDoc Society (a new group with a Northern European focus, with around 50 organizational members). To put this in perspective, the OpenDocument Foundation, before it changed its mission and dissolved, had only 3 members.
When you consider the range of ODF adoption, especially in Europe and Asia, the strong continuing work on ODF 1.2 in OASIS, and the strong corporate, government and organizational participation demonstrated in the global ODF User Workshop recently held in Berlin, we seem to be making a disproportionate amount of noise over the hysterics of the disintegrating 3-person OpenDocument Foundation.
A number of analysts/journalists/bloggers didn't check their facts and seem to have fallen into the trap, and ascribed a far greater importance to the actions of the Foundation. Curiously, these articles all quoted the same Microsoft Director of Corporate Standards. I hope this correlation does not prove to be a persistent contrary indicator for accuracy in future file format stories.
Luckily for us, David Berlind over at ZDNet has penetrated the confusion and gets it right:
11/27/2009 Update: Berlind did further research and interviews on this topic and followed up with a podcast and new blog post OpenDocument Format Community steadfast despite theatrics of now impotent ‘Foundation’ on this subject.
Q: The Open Document Foundation has a document, a "Universal Interoperability Framework" that on its title page says "Submitted to the OASIS Office Technical Committee by The OpenDocument Foundation October 16, 2007". What is the status of this proposal in the ODF TC? (18 Nov 2007)
A: No such document has been submitted to the OASIS TC, on this date or any other date. OASIS policy states that "Contributions, as defined in the OASIS IPR Policy, shall be made by sending to the TC's general email list either the contribution, or a notice that the contribution has been delivered to the TC’s document repository". A look at the ODF TC's list archive for October shows that there was no such contribution.
Q: The Foundation claims that the W3C's CDF format has better interoperability with MS Office than ODF has. Is this true? (18 Nov 2007)
A: The Foundation's claims have not been demonstrated, or even competently argued at a technical level that would allow expert evaluation. I cannot fully critique what is essentially vaporware. However, those who know CDF better than I do have commented on the mismatch between CDF and office documents, for example the recent interview with the W3C's Chris Lilley in Andy Updegrove's blog.
Q: So, does IBM then oppose CDF in favor of ODF? (18 Nov 2007)
A: No. IBM supports both the development of ODF and CDF and has a leadership role in both working groups. These are two good standards for two different things.
The W3C, over the years has produced a number of reusable, modular core standards for things like vector graphics (SVG), mathematical notation (MathML), forms (XForms), etc. To use a cooking analogy, these are like ingredients that can be combined to make a dish. ODF has taken a number of W3C standards and combined them to make a format for expressing conventional office documents, the familiar word processor, spreadsheet and presentation documents. ODF is an OASIS and ISO standard.
But just as eggs, butter and flour form the base of many recipes, the core W3C standards can be assembled in different ways for different purposes. This is a good thing.
CDF is not so much a final dish, but an intermediate step, like a roux (flour + butter) is when making a sauce. You don't use a roux directly, but build upon it, e.g., add mik to make a béchamel, add cheese for a cheese sauce, etc., CDF itself s not directly consumable. You need to add a WICD profile, something like WICD Mobile 1.0, before you have something a user agent can process.
This inaugural edition is dedicated to the fallout from the recent supernova we know as the OpenDocument Foundation, that in one final act of self-immolation swelled from obscurity to overwhelming brilliance, but then slowly faded away, ever fainter and more erratic, little more than hot gas, the dimming embers no longer sustainable.
Q: Now that the originator and primary supporter of OpenDocument Format has ended its support for ODF, does this mean the end for the ODF standard? (18 Nov 2007)
A: This question is based on a mistaken premise, namely that the OpenDocument Foundation was the originator or steward of the ODF standard. This is an erroneous notion.
The ODF standard is owned by the OASIS standards consortium, with over 600 member organizations and individual members. The committee in OASIS that that does the technical working of maintaining the ODF standard is called the OpenDocument TC. It has 15 organization members as well as 7 individual members. Until recently the OpenDocument Foundation was a member of the ODF TC, one voice among many.
The adoption of the ODF standard is promoted by several organizations, most prominently the ODF Alliance (with over 400 organizational members in 52 countries), the OpenDocument Fellowship (around 100 individual members) and the OpenDoc Society (a new group with a Northern European focus, with around 50 organizational members). To put this in perspective, the OpenDocument Foundation, before it changed its mission and dissolved, had only 3 members.
When you consider the range of ODF adoption, especially in Europe and Asia, the strong continuing work on ODF 1.2 in OASIS, and the strong corporate, government and organizational participation demonstrated in the global ODF User Workshop recently held in Berlin, we seem to be making a disproportionate amount of noise over the hysterics of the disintegrating 3-person OpenDocument Foundation.
A number of analysts/journalists/bloggers didn't check their facts and seem to have fallen into the trap, and ascribed a far greater importance to the actions of the Foundation. Curiously, these articles all quoted the same Microsoft Director of Corporate Standards. I hope this correlation does not prove to be a persistent contrary indicator for accuracy in future file format stories.
Luckily for us, David Berlind over at ZDNet has penetrated the confusion and gets it right:
...the future of the OpenDocument Foundation has nothing to do with the future of the OpenDocument Format. In other words, any indication by anybody that the OpenDocument Format has been vacated by its supporters is pure FUD.
11/27/2009 Update: Berlind did further research and interviews on this topic and followed up with a podcast and new blog post OpenDocument Format Community steadfast despite theatrics of now impotent ‘Foundation’ on this subject.
Q: The Open Document Foundation has a document, a "Universal Interoperability Framework" that on its title page says "Submitted to the OASIS Office Technical Committee by The OpenDocument Foundation October 16, 2007". What is the status of this proposal in the ODF TC? (18 Nov 2007)
A: No such document has been submitted to the OASIS TC, on this date or any other date. OASIS policy states that "Contributions, as defined in the OASIS IPR Policy, shall be made by sending to the TC's general email list either the contribution, or a notice that the contribution has been delivered to the TC’s document repository". A look at the ODF TC's list archive for October shows that there was no such contribution.
Q: The Foundation claims that the W3C's CDF format has better interoperability with MS Office than ODF has. Is this true? (18 Nov 2007)
A: The Foundation's claims have not been demonstrated, or even competently argued at a technical level that would allow expert evaluation. I cannot fully critique what is essentially vaporware. However, those who know CDF better than I do have commented on the mismatch between CDF and office documents, for example the recent interview with the W3C's Chris Lilley in Andy Updegrove's blog.
Q: So, does IBM then oppose CDF in favor of ODF? (18 Nov 2007)
A: No. IBM supports both the development of ODF and CDF and has a leadership role in both working groups. These are two good standards for two different things.
The W3C, over the years has produced a number of reusable, modular core standards for things like vector graphics (SVG), mathematical notation (MathML), forms (XForms), etc. To use a cooking analogy, these are like ingredients that can be combined to make a dish. ODF has taken a number of W3C standards and combined them to make a format for expressing conventional office documents, the familiar word processor, spreadsheet and presentation documents. ODF is an OASIS and ISO standard.
But just as eggs, butter and flour form the base of many recipes, the core W3C standards can be assembled in different ways for different purposes. This is a good thing.
CDF is not so much a final dish, but an intermediate step, like a roux (flour + butter) is when making a sauce. You don't use a roux directly, but build upon it, e.g., add mik to make a béchamel, add cheese for a cheese sauce, etc., CDF itself s not directly consumable. You need to add a WICD profile, something like WICD Mobile 1.0, before you have something a user agent can process.
Friday, October 12, 2007
ODF enters the Semantic Web
Metadata is "data about data". Meta from the Greek, μετά, meaning with or after. I suppose if you wanted to sound grand you could pronounce it hyper-correctly with the stress on the second syllable, met-ah'. I've heard some incorrectly pronounce it meet'-ah, perhaps a false analogy with βῆτα = beta. But you never hear anyone pronounce μέγα = mega as mee-guh, do you?
Metadata is not new. It has been around for centuries. In some cases metadata applies to the overall document, while in other cases it applies to only a portion of the content. Examples of the first case include titles of books, footnotes, ISBN numbers, LOC or Dewey Decimal categorizations, keywords, etc. The various forms of scribal marginalia, whether scholia or glosses in the margins of a manuscript, or personal annotations of the owner of a document, are historic examples of the second kind of metadata.
Marginal notes are frequently used today in business forms. A printed form represents, often imperfectly, a snapshot in time of an organization's view of their own process. But maybe the process was was approximated or the form was imperfectly designed, maybe it quickly became outdated, but somehow reality seems to outgrow the strictures of a form's blanks and checkboxes. So what do, as a user, do? You write notes in the margins or other places between form fields and hope that there is a human in the loop someplace to read your words.
In any case, of all documents, forms (originally called "formulary documents") have the most structured representation of data. Enter your social security number into the nine little boxes provided. Enter your date of birth here, Month first, then day, then two-digit year. Last name first, first name last. Everything is nice and simple, and provided your reality matches that which the form designer envisioned, your data will be easy to consume, whether by another person or, after data entry, by various online processes. Or maybe the form was entered online originally? Even better.
But what about all the other documents in the world, the ones that are not formally structured as forms? What sense can we make of them? Can you tell a social security number in a free-form document, or a date, or a zip code? Perhaps with pattern matching, you can find out some simple things. That is the essence of Microsoft's Smart Tags. (And we had much of this in Lotus Agenda a decade earlier.) But this only works for the most trivial cases. It only takes you so far.
What if I wanted to markup an academic paper, a work-in-progress, to indicate which quotations have been verified and which ones remain to be be verified? Or what if I want to annotate statements in recorded testimony according to which statements contradict and which corroborate another witness's statements? This goes far beyond pattern matching. I need a way to encode my knowledge, my view of the subject, in the document.
We have data in a document -- "Words,words, words" as Hamlet tells Polonius. But for those who work with thoughts, the present constraints of encoding our knowledge as rudimentary linear strings of characters is severe. In general text is multi-layered and hyper-linked in strange and marvelous ways. Your father's word processor and word processor format are inadequate to the task. The concept of a document as being a single store of data that lives in a single place, entire, self-contained and complete is nearing an end. A document is a stream, a thread in space and time, connected to other documents, containing other documents, contained in other documents, in multiple layers of meaning and in multiple dimensions. What we call a traditional document is really just a snapshot in time and space, a projection into print-ready output form of what documents will soon become.
The applications of metadata to business documents are legion. Wherever you have data, you also have the questions of:
OpenDocument Format (ODF) 1.2 will be taking a step into the word of structured metadata with an RDF/XML metadata framework. If that sounds Greek to you, then let's say that a metadata framework enables application developers to create applications that do the above things. A framework doesn't tell you how you must say "This image is provided under a Creative Commons Share-Alike license" but provides a framework for application developers to express concepts like "licensed-under" and "Create Commons Share-Alike", as well a formal structure for expressing subject-predicate-object relationships, where the subject can be any of around 50 ODF document elements, such as paragraphs, footnotes, images, tables, etc.
To read more, here are some places to start:
For general background on the "semantic web", a good intro is 2001 Scientific American article "The Semantic Web" by Tim Berners-Lee, et. al.
For a bit more on RDF, the wikipedia page is pretty good.
Svante Schubert at Sun, also on the ODF Metadata Subcommittee has a recent blog post worth reading: "New Extensible Metadata Support With ODF 1.2.
Bruce D'Arcus, of the Metadata Subcommittee and co-lead of the OpenOffice.org Bibliographic Project also contributes his thoughts on the new ODF 1.2 metadata.
If you want to delve into the particulars of ODF 1.2's new metadata support, you can read the latest draft of the proposed changes to the specification [ODF] and the examples [ODF] document. Of course, any feedback on ODF drafts and published standards are welcome on the ODF TC's comment mailing list.
For a gentle introduction to metadata, ODF, where we are coming from and where we are going, I offer this interview [MP3] with Patrick Durusau, Chair of the ODF Metadata Subcommittee, which I recorded back in July.
Metadata is not new. It has been around for centuries. In some cases metadata applies to the overall document, while in other cases it applies to only a portion of the content. Examples of the first case include titles of books, footnotes, ISBN numbers, LOC or Dewey Decimal categorizations, keywords, etc. The various forms of scribal marginalia, whether scholia or glosses in the margins of a manuscript, or personal annotations of the owner of a document, are historic examples of the second kind of metadata.
Marginal notes are frequently used today in business forms. A printed form represents, often imperfectly, a snapshot in time of an organization's view of their own process. But maybe the process was was approximated or the form was imperfectly designed, maybe it quickly became outdated, but somehow reality seems to outgrow the strictures of a form's blanks and checkboxes. So what do, as a user, do? You write notes in the margins or other places between form fields and hope that there is a human in the loop someplace to read your words.
In any case, of all documents, forms (originally called "formulary documents") have the most structured representation of data. Enter your social security number into the nine little boxes provided. Enter your date of birth here, Month first, then day, then two-digit year. Last name first, first name last. Everything is nice and simple, and provided your reality matches that which the form designer envisioned, your data will be easy to consume, whether by another person or, after data entry, by various online processes. Or maybe the form was entered online originally? Even better.
But what about all the other documents in the world, the ones that are not formally structured as forms? What sense can we make of them? Can you tell a social security number in a free-form document, or a date, or a zip code? Perhaps with pattern matching, you can find out some simple things. That is the essence of Microsoft's Smart Tags. (And we had much of this in Lotus Agenda a decade earlier.) But this only works for the most trivial cases. It only takes you so far.
What if I wanted to markup an academic paper, a work-in-progress, to indicate which quotations have been verified and which ones remain to be be verified? Or what if I want to annotate statements in recorded testimony according to which statements contradict and which corroborate another witness's statements? This goes far beyond pattern matching. I need a way to encode my knowledge, my view of the subject, in the document.
We have data in a document -- "Words,words, words" as Hamlet tells Polonius. But for those who work with thoughts, the present constraints of encoding our knowledge as rudimentary linear strings of characters is severe. In general text is multi-layered and hyper-linked in strange and marvelous ways. Your father's word processor and word processor format are inadequate to the task. The concept of a document as being a single store of data that lives in a single place, entire, self-contained and complete is nearing an end. A document is a stream, a thread in space and time, connected to other documents, containing other documents, contained in other documents, in multiple layers of meaning and in multiple dimensions. What we call a traditional document is really just a snapshot in time and space, a projection into print-ready output form of what documents will soon become.
The applications of metadata to business documents are legion. Wherever you have data, you also have the questions of:
- Who entered the data?
- Where did the data come from?
- Who verified the data?
- Who approved the data? Legal? HR? Business?
- Where is this data destined?
- How old is the data? When does it expire?
- How trustworthy is this data?
- Who must we cite as an authority for this data?
- Who owns this data?
- Who has permissions to see this data?
- Who can set policy for this data?
- Who else can edit this data?
- How does this data connect with my business? Is it a part number? The name of a customer or the name of an employee?
OpenDocument Format (ODF) 1.2 will be taking a step into the word of structured metadata with an RDF/XML metadata framework. If that sounds Greek to you, then let's say that a metadata framework enables application developers to create applications that do the above things. A framework doesn't tell you how you must say "This image is provided under a Creative Commons Share-Alike license" but provides a framework for application developers to express concepts like "licensed-under" and "Create Commons Share-Alike", as well a formal structure for expressing subject-predicate-object relationships, where the subject can be any of around 50 ODF document elements, such as paragraphs, footnotes, images, tables, etc.
To read more, here are some places to start:
For general background on the "semantic web", a good intro is 2001 Scientific American article "The Semantic Web" by Tim Berners-Lee, et. al.
For a bit more on RDF, the wikipedia page is pretty good.
Svante Schubert at Sun, also on the ODF Metadata Subcommittee has a recent blog post worth reading: "New Extensible Metadata Support With ODF 1.2.
Bruce D'Arcus, of the Metadata Subcommittee and co-lead of the OpenOffice.org Bibliographic Project also contributes his thoughts on the new ODF 1.2 metadata.
If you want to delve into the particulars of ODF 1.2's new metadata support, you can read the latest draft of the proposed changes to the specification [ODF] and the examples [ODF] document. Of course, any feedback on ODF drafts and published standards are welcome on the ODF TC's comment mailing list.
For a gentle introduction to metadata, ODF, where we are coming from and where we are going, I offer this interview [MP3] with Patrick Durusau, Chair of the ODF Metadata Subcommittee, which I recorded back in July.
Sunday, October 07, 2007
Cracks in the Foundation
You must admire their tenacity. Gary Edwards and the pseudonymous "Marbux". The mythology of Silicon Valley is filled with stories of two guys and a garage founding great enterprises. And here we have two guys, and through blogs, interviews, and constant attendance at conferences, they have become some of the most-heard voices on ODF. Maybe it is partly due to the power of the name? The "OpenDocument Foundation" sounds so official. Although it has no official role in the ODF standard, this name opens doors. The