ODF Validation for Dummies

2008/05/02 By Rob 32 Comments

[Updated 4 May 2008, with additional rebuttal at the end]

Alex Brown has a problem. He can’t figure out how to validate ODF documents. Unfortunately, when he couldn’t figure it out, he didn’t ask the OASIS ODF TC for help, which would have been the normal thing to do. Indeed, the ODF TC passed a resolution back in February 2007 that said, in part:

That the ODF TC welcomes any questions from ISO/IEC JTC1/SC34 and
member NB’s regarding OpenDocument Format, the functionality it
describes, the planned evolution of this standard, and its relationship
to other work on the technical agenda of JTC1/SC34. Questions and
comments can be directed to the TC chair and secretary whose email
addresses are given at

http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office

or through the comments facility at

http://www.oasis-open.org/committees/tc_home.php?wg_abbrev=office

So it is rather uncollegial of Alex to refuse such an open, transparent way of getting his questions answered. But Alex didn’t avail himself of that avenue. He just assumed if he couldn’t figure out how to validate ODF then it simply couldn’t be done, and that ODF was to blame. This is presumptuous. Does he think that in the three years since ODF 1.0 became a standard, that no one has tried to validate a document?

Alex is so sure of himself that he publicly exults on the claimed significance of his findings:

For ISO/IEC 26300:2006 (ODF) in general, we can say that the standard itself has a defect which prevents any document claiming validity from being actually valid. Consequently, there are no XML documents in existence which are valid to ISO ODF.

Even if the schema is fixed, we can see that OpenOffice.org 2.4.0 does not produce valid XML documents. This is to be expected and is a mirror-case of what was found for MS Office 2007: while MS Office has not caught up with the ISO standard, OpenOffice has rather bypassed it (it aims at its consortium standard, just as MS Office does).

I think you agree that these are bold pronouncements, especially coming from someone so prominent in SC34, the Convenor of the ill-fated OOXML BRM, someone who is currently arguing that SC34 should own the maintenance of OOXML and ODF, indeed someone who would be well served if he could show that all consortia standards are junk, and that only SC34 (and he himself) could make them good.

Of course, I’ve been known to pontificate as well. There is nothing necessarily wrong with that. The difference here is that Alex Brown is totally wrong.

But let’s see if we can help show Alex, or anyone else similarly confused, the correct way to validate an ODF document.

First start with an ODF document. When Alex tested OOXML, he used the Ecma-376 OOXML specification. Let’s do the analogous test and validate the ODF 1.0 text. You can download it from the OASIS ODF web site. You’ll want this version of the text, ODF 1.0 (second edition), which is the source document for the ISO version of ODF.

You’ll also want to download the Relax NG schema files for OASIS ODF 1.0, which you can download in two pieces: the main schema, and the manifest schema.

Next you’ll need to get a Relax NG validator. Alex recommends James Clark’s jing, so we’ll use that. I downloaded jing-20030619.zip the main distribution for use with the Java Runtime Environment. Unzip that to a directory and we’re almost there.

Since jing operates on XML files and knows nothing about the Zip package structure of an ODF file, you’ll need to extract the XML contents of the ODF file. There are many ways to do this. My preference, on Windows, is to associate WinZip with the ODF file extensions (ODT, ODS and ODP) so I can right-click on these files unzip them. When you unzip you will have the following XML files, along with directories for images files and other non-XML resources you can ignore:

content.xml
styles.xml
meta.xml
settings.xml
META-INF/manifest.xml

So now we’re ready to validate! Let’s start with content.xml. The command line for me was:

java -jar c:/jing/bin/jing.jar OpenDocument-schema-v1.0-os.rng content.xml

(Your command may vary, depending on where you put jing, the ODF schema files and the unzipped ODF files)

The result is a whole slew of error messages:

C:\temp\odf\OpenDocument-schema-v1.0-os.rng:17658:18: error: conflicting ID-types for attribute "targetElement" from namespace "urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0" of element "command" from namespace "urn:oasis:names:tc:opendocument:xmlns:animation:1.0" C:\temp\odf\OpenDocument-schema-v1.0-os.rng:10294:22: error: conflicting ID-types for attribute "targetElement" from namespace "urn:oasis:names:tc:opendocument:xmlns:smil-compatible:1.0" of element "command" from namespace "urn:oasis:names:tc:opendocument:xmlns:animation:1.0"

Oh no! Emergency, emergency, everyone to get from street!

I wonder if this is one of the things that tripped Alex up? Take a deep breath. These in fact are not Relax NG (ISO/IEC 19757-2) errors at all, but errors generated by jing’s default validation of a different set of constraints, defined in the Relax NG DTD Compatibility specification which has the status of a Committee Specification in OASIS. It is not part of ISO/IEC 19757-2.

Relax NG DTD Compatibility provides three extensions to Relax NG: default attribute values, ID/IDREF constaints and a documentation element. The Relax NG DTD Compatibility specification is quite clear in section 2 that “Conformance is defined separately for each feature. A conformant implementation can support any combination of features.” And in fact, ODF 1.0, in section 1.2 does just that: “The schema language used within this specification is Relax-NG (see [RNG]). The attribute default value feature specified in [RNG-Compat] is used to provide attribute default values”.

It is best to simple disable the checking of Relax NG DTD Compatibility constraints by using the documented “-i” flag in jing. If you want to validate ID/IDREF cross-references, then you’ll need to do that in application code, and not using jing in Relax NG DTD Compatibility mode. Note that jing was not complaining about any actual ID/IDREF problem in the ODF document.

So, false alarm. You can walk safely on the streets now.

(That said, if we can make some simple changes to the ODF schemas that will allow it to work better with the default settings of jing, or other popular tools, then I’m certainly in favor of that. Alex’s proposed changes to the schema are reasonable and should be considered.)

So, let’s repeat the validation with the -i flag:

java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng content.xml

Zero errors, zero warnings.

java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng styles.xml

Zero errors, zero warnings.

java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng meta.xml

Zero errors, zero warnings.

java -jar c:/jing/bin/jing.jar -i OpenDocument-schema-v1.0-os.rng settings.xml

Zero errors, zero warnings.

java -jar c:/jing/bin/jing.jar -i OpenDocument-manifest-schema-v1.0-os.rng META-INF/manifest.xml

Zero errors, zero warnings.

So, there you have it, an example that shows that there is at least one document in the universe that is valid to the ODF 1.0 schema, disproving Alex’s statement that “there are no XML documents in existence which are valid to ISO ODF.”

The directions are complete and should allow anyone to validate the ODF 1.0 specification, or any other ODF 1.0 document. Now that we have the basics down, let’s work on some more advanced topics.

First, the reader should note that there are two versions of the ODF schema, the original 1.0 from 2005, and the updated 1.1 from 2007. (This is also a third version underway, ODF 1.2, but that needn’t concern us here.)

An application, when it creates an ODF document, indicates which version of the ODF standard it is targeting. You can find this indication if you look at the office:version attribute on the root element of any ODF XML file. The only values I would expect to see in use today would be “1.0” and “1.1”. Eventually we’ll also see “1.2”.

It is important to use the appropriate version of the ODF schema to validate a particular document. Our goal, as we evolve ODF, is that an application that knows only about ODF 1.0 should be able to adapt and “degrade gracefully” when given an ODF 1.1 document, by ignoring the features it does not understand. But an application written to understand ODF 1.1 should be able to fully understand ODF 1.0 documents without any additional accommodation.

Put differently, from the document perspective, a document that conforms to ODF 1.0 should also conform to ODF 1.1. But the reverse direction is not true.

To accomplish this, as we evolve ODF, within the 1.x family of revisions, we try to limit ourselves to changes that widen the schema constraints, by adding new optional elements, or new attribute values, or expanding the range of values permitted. Constraint changes that are logically narrowing, like removing elements, making optional elements mandatory, or reducing the range of allowed values, would break this kind of document compatibility.

Now of course, at some point we may want to make bolder changes to the schema, but this would be in a major release, like a 2.0 version. But within the ODF 1.x family we want this kind of compatibility.

The net of this is, an ODF 1.1 document should only be expected to be valid to the ODF 1.1 schema, but an ODF 1.0 document should be valid to the ODF 1.0 and the ODF 1.1 schemas.

That’s enough theory! Let’s take a look now at the test that Alex actually ran. It is a rather curious, strangely biased kind of test, but the bad thinking is interesting enough to devote some time to examine in some detail.

When he earlier tested OOXML, Alex used the OOXML standard itself, a text on which Microsoft engineers had lavished many person-years of attention for the past 18 months, and he validated it with the current version of the OOXML schema. That is pretty much the best case, testing a document that has never been out of Microsoft’s sight for 18 months and testing it with the current version of the schema. I would expect that this document would have been a regular test case for Microsoft internally, and that its validity has been repeatedly and exhaustively tested over the past 18 months. I know that I personally tested it when Ecma-376 was first released, since it was the only significant OOXML document around. So, essentially Alex gave OOXML the softest of all soft pitches.

I think Microsoft’s response, that the validity errors detected by Alex are due to changes made to the schema at the BRM, is a reasonable and accurate explanation. The real story on OOXML standardization is not how many changes were made that were incompatible with Office 2007, but how few. It appears that very few changes, perhaps only one, will be required to make Office 2007’s output be valid OOXML.

So when testing ODF, what did Alex do? Did he use the ODF 1.0 specification as a test case, a document that the OASIS TC might have had the opportunity to give a similar level of attention to? No, he did not, although that would have validated perfectly, as I’ve demonstrated above. Instead, Alex uses the OOXML specification, a document which by his own testing is not valid OOXML, then converts it into the proprietary .DOC binary format, then translates that binary format into ODF and then tries to validate the results with the ODF 1.0 schema (i.e., the wrong version of the ODF schema since OpenOffice 2.4.0’s output is clearly declared as ODF 1.1), and then applies a non-applicable, non-standard DTD Compatibility constraint test during the Relax NG validation.

Does anyone see something else wrong with this testing methodology?

Aside from the obvious bias of using an input document that Microsoft has spent 18 months perfecting, and using the wrong schemas and validator settings, there is another, more subtle problem.

Alex’s test of OOXML and ODF are testing entirely different things. With OOXML, he took a version N (Ecma-376) OOXML document and tried to validate it with a version N+1 (ISO/IEC 29500) version of the OOXML schema.

But what he did with ODF was take a version N+1 (ODF 1.1) document and tried to validate it with an version N (ODF 1.0) of the ODF schema.

These are entirely different operations. One test is testing the backwards compatibility of the schema, the other is testing the backwards compatibility of document instances. It takes no genius to figure out that if ODF 1.1 adds new elements, then an ODF 1.1 document instance will not validate with the ODF 1.0 schema. We don’t ordinarily expect backwardly compatible validity of document instances. Again, Alex’s tests are biased in OOXML’s favor, giving ODF a much more difficult, even impossible task, compared the the versions ran for OOXML.

If we want to compare apples to apples, it is quite easy to perform the equivalent test with ODF. I gave it a try, taking a version N document (the ODF 1.0 standard itself, per above) and validated it with the version N+1 schema (ODF 1.1 in this case). It worked perfectly. No warnings, no errors.

In any case, in his backwards test Alex reports 7,525 errors, “mostly of the same type (use of an undeclared soft-page-break element)” when validating the OOXML text with ODF 1.0 schema. Indeed, all but 39 of these errors are reports of soft-page-break.

Soft page breaks are a new feature introduced in ODF 1.1. It has two primary advantages for accessibility. First it allows easier collaboration between people using different technologies to read a document. Not all documents are deeply structured, with formal divisions like section 3.2.1, etc. Most business documents are loosely structured, and collaboration occurs by referring to “2nd paragraph on page 23” or “the bottom of page 18”. But when using different assistive technologies, from larger fonts, to braille, to audio renderings, the page breaks (if the assistive technology even has the concept of a page break) are usually located differently from the page breaks in the original authoring tool. This makes collaboration difficult. So, ODF 1.1 added the ability for applications to write out “soft” page breaks, indicating where the page breaks occurred when the original source document was saved.

Although this feature was added for accessibility reasons, like curb cuts, its likely future applications are more general. We will all benefit. For example, a convertor for translating from ODF to HTML would ordinarily only be able to calculate the original page breaks by undertaking complex layout calculations. But with soft page breaks recorded, even a simple XSLT script can use this information to insert indications of page breaks, or to generate accurate page numbering, etc. Although the addition of this feature hinders Alex’s idiosyncratic attempt to validate ODF 1.1 documents with the ODF 1.0 schema, I think the fact that this feature helps blind and visually impaired users, and generally improves collaboration makes it a fair trade-off.

Wouldn’t you agree?

That leaves 39 validation errors in Alex’s test. 12 of them are reports of invalid values in an xlink:href attribute value. This appears to be an error in the original DOCX file. Garbage In, Garbage Out. For example, in one case the original document has HYPERLINK field that contains a link to content in Microsoft’s proprietary CHM format (Compiled HTML). The link provided in the original document does not match the syntax rules required for an XML Schema anyURI (the URL ends with “##” rather than “#”) Maybe it is correct for markup like this, with non-standard, non-interoperable URI’s, to give validation errors. This is not the first time that OOXML has been found polluting XML with proprietary extensions. But realize that OpenOffice 2.4.0 did not create this error. OpenOffice is just passing the error along, as Office 2007 saved it. It is interesting to note that this error was not caught in MS Office, and indeed is undetectable with OOXML’s lax schema. But the error was caught with the ODF schema. This is a good thing, yes? It might be a good idea for OpenOffice to add an optional validation step after importing Microsoft Office documents, to filter out such data pollution.

For the remaining validation errors, they are 27 instances of style:with-tab. Honestly, I have no explanation for this. This attribute does not exist in ODF 1.0 or ODF 1.1. That it is written out appears to be a bug in OpenOffice. Maybe someone there can tell us why the story is on this? But I don’t see this problem in all documents, or even most documents.

For fun I tried processing this OOXML document another way. Instead of the multi-hop OOXML-to-DOC-to-ODF conversion Alex did, why not go directly from OOXML to ODF in one step, using the convertor that Microsoft/CleverAge created? This should be much cleaner, since it doesn’t have all the legacy code or messiness of the binary formats or legacy application code. It is just a mapping from one markup to another markup, written from scratch. Getting the output to be valid should be trivial.

So I download the “OpenXML/ODF Translator Command Line Tools” from SourceForge. According to their web page, this tool targets ODF 1.0, so we’ll be validating against the ODF 1.0 schemas.

This tool is very easy to use once you have the .NET prerequisites installed. The command line was:

odfconvertor /I "Office Open XML Part 4 - Markup Language Reference.docx"

The convertor then chugs along for a long, long, long time. I mean a long time. The conversion from OOXML to ODF eventually finished, after 11 hours, 10 minutes and 41 seconds! And this was on a Thinkpad T60p with dual-core Intel 2.16Ghz processor and 2.0 GB of RAM.

I then rang jing, using the validation command lines from above. It reported 376 validation errors, which fell into several categories:

text:s element not allowed in this context
bad value for text:style:name
bad value for text:outline-level
bad value for svg:x
bad value for svg:y
element tetx:tracked-changes not allowed in this context
“text not allowed here”

In any case, not a lot of errors, but a handful of errors repeated. But it is surprising to see that this single-purpose tool, written from scratch, had more validation errors in it than OpenOffice 2.4.0 does.

In the end we should put this in perspective. Can OpenOffice produce valid ODF documents? Yes, it can, and I have given an example. Can OpenOffice produce invalid documents? Yes, of course. For example when it writes out a .DOC binary file, it is not even well-formed XML. And we’ve seen one example, where via a conversion from OOXML, it wrote out an ODF 1.1 document that failed validation. But conformance for an application does not require that it is incapable of writing out an invalid document. Conformance requires that it is capable of writing out a valid document. And of course, success for an ODF implementation requires that its conformance to the standard is sufficient to deliver on the promises of the standard, for interoperability.

It is interesting to recall the study that Dagfinn Parnas did a few years ago. He analyzed 2.5 million web pages. He found that only 0.7% of them were valid markup. Depending on how you write the headlines, this is either an alarming statement on the low formal quality of web content, or a reassuring thought on the robustness of well-designed applications and systems. Certainly the web seems to have thrived in spite of the fact that almost every web page is in error according to the appropriate web standards. In fact I promise you that the page you are reading now is not valid, and neither is Alex Brown’s, nor SC34’s, nor JTC1’s, nor Ecma’s, nor ISO’s, nor the IEC’s.

So I suggest that ODF has a far better validation record than HTML and the web have, and that is an encouraging statement. In any case, Alex Brown’s dire pronouncements on ODF validity have been weighed in the balance and found wanting.

4 May 2008

Alex has responded on his blog with “ODF validation for cognoscneti“. He deals purely with the ID/IDREF/IDREFS questions in XML. He does not justify his biased and faulty testing methodology, not does he reiterate his bold claims that there are no valid ODF 1.0 documents in existence.

Since Alex’s blog does not seem to be allowing me to comment, I’ll put here what I would have put there. I’ll be brief because I have other fish to fry today.

Alex, no one doubts that ID/IDREF/IDREFS constraints must be respected by valid ODF document instances. I never suggested otherwise. But what I do state is that this is not a concern of a Relax NG validator. You can read James Clark saying the same thing in his 2001 “Guidelines for using W3C XML Schema Datatypes with RELAX NG“, which says in part:

The semantics defined by [W3C XML Schema Datatypes] for the ID, IDREF and IDREFS datatypes are purely lexical and do not include the cross-reference semantics of the corresponding [XML 1.0] datatypes. The cross-reference semantics of these datatypes in XML Schema comes from XML Schema Part 1. Furthermore, the [XML 1.0] cross-reference semantics of these datatypes do not fit into the RELAX NG model of what a datatype is. Therefore, RELAX NG validation will only validate the lexical aspects of these datatypes as defined in [W3C XML Schema Datatypes].

Validation of ID/IDREF/IDREFS cross-reference semantics is not the job of Relax NG, and you are incorrect to suggest otherwise. Your logic is also deficient when you take my statement of that fact and derive the false statement that I believe that ID/IDREF semantics do not apply to ODF. One does not follow from the other.

You know, as much as anyone, that conformance is a complex topic. One does not ordinarily expect, except in trivial XML formats, that the complete set of conformance constraints will be expressed in the schema. Typically a multi-layered approach is used, with some syntax and structural constraints expressed in XML Schema or Relax NG, some business constraints in Schematron, and maybe even some deeper semantic constraints that are expressed only in the text of the standard and can only be tested by application logic.

For example, a document that defines a cryptographic algorithm might need to store a prime number. The schema might define this as an integer. The fact that the schema does not state or guarantee that it is a prime number is not the fault of the schema. And the inability of a Relax NG validator to test primality is not a defect in Relax NG. The primality test would simply need to be carried out at another level, with application logic. But the requirement for primality in document instances can still be a conformance requirement and it is still testable, albeit with some computational effort, in application logic.

I believe that is the source of your confusion. The initial errors you saw when running jing with the Relax NG DTD Compatibility flag enabled were not errors in the ODF document instances. What you saw was jing reporting that it could not apply the Relax NG DTD Compatibility ID/IDREF/IDREFS constraint checks using the ODF 1.0 schema. That in no way means that the constraints defined in XML 1.0 are not required on ODF document instances. It simply indicates that you would need to verify these constraints using means other than Relax NG DTD Compatibility.

So I wonder, have you actually found ODF document instances, say written from OpenOffice 2.4.0, which have ID/IDREF/IDREFS usage which violates the constraints expressed in ODF 1.0?

Finally, in your professional judgment, do you maintain that this is a accurate statement: “For ISO/IEC 26300:2006 (ODF) in general, we can say that the standard itself has a defect which prevents any document claiming validity from being actually valid. Consequently, there are no XML documents in existence which are valid to ISO ODF.”

Comments

Anonymous says

2008/05/02 at 3:50 pm

Why does OpenDocument 1.1 use the same namespaces as ODF 1.1.
This is possibly recipe for disaster in implementations as you will have the same names for different schemas.

Reply
Rob says

2008/05/02 at 4:34 pm

I think you mean to ask, why does ODF 1.0 and ODF 1.1 share the same namespace URI?

Why does XML 1.0, as well as XML 1.0 (second edition), XML 1.0 (third edition), XML 1.0 (forth edition) and soon XML 1.0 (fifth edition) all share the same version attribute value?

The argument, and it isn’t always an easy call to make, is this is better than the alternative. There would be more confusion and more “broken” documents if you updated the XML version attribute with every small change to the standard.

Similarly with ODF, the observation is that ODF 1.1 differs very slightly from ODF 1.0. If the namespace indicates the “type” of a document, then we’re saying that ODF 1.0 and ODF 1.1 can typically be processed as being of the same type. So a robust application that processes ODF 1.0 should be able to handle an ODF 1.1 document without even knowing that it is not ODF 1.0. However, if we changed the namespace URI, then most every ODF 1.0 application would fail when reading ODF 1.1 documents.

An application that does not feel like being robust can always check the office:version attribute and reject document versions that they don’t handle.

When we make more extensive, incompatible changes, say in ODF 2.0 then we’ll surely want to update the namespace.

As for a “recipe for disaster”, ODF 1.1 has been out for over a year now, and is written by applications like OpenOffice and Symphony, while ODF 1.0 is written by applications like Google Docs and the CleverAge Add-ins for Office. If this was disaster-prone, I think we would have seen something by now. But all I’ve seen are errors made by people who didn’t bother to read the standard.

Reply
Paul Merrell ("Marbux") says

2008/05/02 at 8:56 pm

Rob said: “First, the reader should note that there are two versions of ODF, the original OASIS 1.0 from 2005, and the updated ODF 1.1 from 2007.”

Correction here. There is also OASIS ODF 1.0 Second Edition (2006), which may be identical to ISO:26300-2006 except for the ISO/IEC document header. (I have not confirmed this with a diff.)

The second edition is supposed to reflect the changes from OASIS ODF v. 1.0 made by JTC 1, including the dropping of the only interoperability requirements in the specification through substitution of ISO/IEC Directives requirements keywords defintions for the RFC 2119 definitions used by virtually all XML languages other than ODF and OOXML.

It is the OASIS ODF v. 1.0 Second Edition version that you linked to as the appropriate download for the ISO/IEC:26300 spec. The official ISO/IEC version is a free download from this page, near the bottom. http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html (Notice that the standards are arranged numerically on that page.)

Short story: assuming ISO/IEC:26300 and OASIS ODF v. 10 Second Edition are in fact the same, there are three versions of ODF not counting the forthcoming ODF v. 1.2, not two.

Anonymous, a better namespace question is why ODF uses unique OASIS namespaces for incorporated subsets of XSL-FO, SMIL, and SVG (plus an extension to SVG). See ODF section 1.3 (same in all versions).

Rob and IBM were not involved with ODF when those decisions were made. But these namespace issues represent incompatibilities with W3C standards that should be repaired.

Reply
Rob says

2008/05/02 at 9:30 pm

Marbux,

Yes, there is an ODF 1.0 as well as ODF 1.0 (second edition), but there were no schema changes between the two, so in terms of XML validation they are identical.

Reply
Inigo says

2008/05/03 at 3:31 am

Hi Rob,

I suggested in the comments to Alex’s previous Office 2007 conformance post that he used the OOXML document in the test of OpenOffice conformance – while you argue above that the fairest test is to validate against the spec document for each format, I think it’s fairest to validate against the same document in each format, so the same features are exercised.

It’s ironic, but not entirely surprising, that the current best method of interoperability between the two XML formats is via the .doc binary format.

Did the doc you converted from OOXML to ODF via the odfconvertor successfully open in OpenOffice?

Inigo

Reply
Rob says

2008/05/03 at 9:13 am

Hi Inigo,

My assertion is that Alex did not perform an apples to apples test. The relationship between the document instance and the schema are reversed in his ODF test versus his OOXML test.

In the OOXML test, we was testing a more recent version of the schema than the one the document was targeting. This should ordinarily work, since we typically evolve schemas with backwards compatibility as a goal. That is why 99.9% of the changes at the BRM did not introduce incompatibilities.

However with ODF he tested an older version of the schema than the one the ODF document was targeting. This is an entirely different situation. When we evolve schemas we typically don’t expect new document instances to validate with old schemas. Would you expect a typical HTML 4.01 document to be valid according to HTML 3.2? Of course not.

I agree that the fairest test is to test the “same” document in both cases, but I think that would need to account for the relationship between the schema and the document instances as well, as part of the definition of sameness.

Also, you would want to pick a document unseen by either committee. Picking a DOCX file that Microsoft has spent the past 18 months validating is not exactly a fair test case for ODF.

As to your other question, I did load then DOC file into OpenOffice, when I reproduced Alex’s test. But I didn’t load the ODF file that was translated by CleverAge. And I’ve deleted it since. Doh! I can run the conversion overnight again and give it a try.

Reply
Anonymous says

2008/05/03 at 9:38 am

Hi Rob !
I was fascinated by this article. You have a real knack for producing ‘knock-out punches’ for the FUD spewing from the MS camp.

This article got me wondering, though….

So far testing has always been about Microsoft in that the testing I’ve read always starts with a bit Microsoft document that has to be converted (at considerable risk) to ODF in order to compare OOXML & ODF validations. … And then (as you say), MS documents seem to be carefully chosen such that they are known to validate well or such that they expect to validate well, thereby giving MS the ‘home field advantage’ at all times.

What would happen if the validation test started with an ODF document downloaded more-or-less at random, included spreadsheets, powerpoints, or (if a text document) included some of the ‘more powerful’ features that MS is always bragging about and the MS translation tools were used to convert it to docx before validation of DIS29500 ? (In other words a real-world test ?)

My expectation is that this would fail miserably, but having the numbers and results available would be something that could be pointed to to contradict the MS marketing FUD about how MS is ISO-compliant, and would allow Alex to find any flaws in your procedures or to try to spin the results and dig himself further into the ‘credibility hole’ than he already is.

The major advantage I can see to this is that it would give a repeatable benchmark of how ‘interoperable’ the MS converters really are and how effective those converters can be expected to be in real life. Especially since Microsoft is espousing them so loudly…

Reply
Rob says

2008/05/03 at 10:34 am

Interesting idea. But introducing a document conversion distorts the test in its own way. If translators were perfect and did a 1-to-1 lossless conversion, then it would be OK. But as they are today, convertors filter functionality. If double underlined text is not supported in the convertor, then it will be dropped as part of the conversion process. So the converted document has typically less functionality than the original.

Also, all of the convertors today are far less mature than the editor applications, so I’d expect flaws to be introduced in the document during conversion as well.

But an interesting test would be to download a random selection of OOXML files from the web, and random selection of ODF files from the web, and validate each file with the appropriate schema version. You can tell ODF versions from the office:version attribute. There is a similar value in OOXML files as well.

The argument that only the ISO versions of the schema matters or should be considered rings hollow. If approval by ISO/IEC once was an indicator of quality, the recent JTC1 corruption incidents have eroded that brand identity away. As I see it today, the title of OASIS Standard is a far greater quality mark and indicator of an open standard than the title “ISO/IEC”.

Reply
Anonymous says

2008/05/03 at 11:40 am

Hi Rob. Thank you so much for your article.

I finally was able to understand how to validate OpenDocumentFormat files and in fact I just downloaded the RNG’s for ODF 1.1 and validated a few files with no errors following your suggestions. I have two questions:

* which validators beside jing are out there good enough for the task?
* do you have any idea if there is a way in Openoffice.org 2.4 to save the document in ODF 1.0 rather in ODF 1.1? This might be an issue for the ISO believers;)

Reply
Anonymous says

2008/05/03 at 2:01 pm

Hi Rob,

while I agree with your assessment of the ISO/IEC brand, going forward, I’d be less likely to accept the ECMA standards as ‘independent’ than ISO/IEC and less likely to accept ISO/IEC than OASIS.

The reason is that ECMA has demonstrated (even more than ISO/IEC) that it can – and has – been for sale to anyone with sufficient money.

I agree that random samplings of OOXML & ODF documents would be better than converter-output, I’m also concerned that Microsofts much-bally-hoo’d converters be either verified to be of a usable, production quality or that they be exposed for the sham that I would expect them to be.

You see, I expect (and still believe) previous predictions that MS will intentionally corrupt translators such that ODF documents do not appear correctly after OOXML conversion as a way of discrediting ODF. I’ve so far seen no evidence to refute that view, and if it is to be confirmed, the sooner the proof gets out for dissemination, the better. (Perhaps this type of testing is not the way to validate converters, though since as you (rightly) point out, they can (and should) drop features rather than convert them with errors.)

Anyway, I’m quickly becoming a fan of your blog because of the proofs you offer that use real code and real examples to refute the MS camp’s hand-waving and misleading claims based on flawed techniques and flawed understandings.

Keep up the good work !

Reply
Rob says

2008/05/03 at 2:25 pm

I’ve started a page on the OASIS ODF TC’s wiki on how to validate ODF documents. This is a more concise, less polemical version of the instructions in this blog post.

Currently it has instructions on jing. I’ll also add instructions on how to set up the Oxygen XML editor for ODF editing. It uses jing behind the scenes.

I know other TC members uses other validators like Sun’s Multi-Schema Validator. I’ve invited them to add their own instructions, tips and tricks.

The other piece that would be nice is a short piece of Java or Python code to wrap it altogether, take the ODT file directly, extract the XML, perhaps in memory rather than to disk, find the office:version attribute and then invoke jing with the correct schema. Repeat for all of the XML files in an ODF document and report the results, maybe even categorized by error type, so Alex would get a single message saying there were 7,000 instances of unknown soft-page-break rather than 7,000 instances of that message repeated.

I’m not aware of a way to target ODF 1.0 from within OpenOffice. This sounds like a good feature request to submit to the project.

Reply
Jirka Kosek says

2008/05/03 at 3:33 pm

There is no claim whatsoever that a conformant ODF 1.0 document will conform to the ID/IDREF constraints defined in Relax NG DTD Compatibility. It is puzzle to me why anyone would automatically assume otherwise.

Hmm, interesting bending of ODF specification. Why then ODF uses ID attributes in the schema at all, if you are telling us that they should not be treated as ID attributes?

And by the way you are completely messing conformance of RELAX NG validator implementing DTD compatibility with conformance of ODF spec, its schema and implementations of ODF near this part of your article.

Honestly, ODF 1.0 schema is broken in aspect pointed out by Alex and you and ODF TC should simply accept this fact and fix the schema instead doing mental and word gymnastics to hide this fact. The bug in ODF is quite common, and I have to correct it in many other schemas for compound documents which I have reviewed or created.

Reply
Rob says

2008/05/03 at 6:04 pm

Jirka

I do not see this as mental gymnastics. I am simply trying to explain a complex topic in a way that is rigorous as well as readable. I am sorry if you find this difficult.

In any case, you will not find any mention of ID/IDREF in the Relax NG specification. It is not an Relax NG concept. Some good background reading on the topic is James Clark’s 2001 “Guidelines for using W3C XML Schema Datatypes with RELAX NG” which says in part:

“The semantics defined by [W3C XML Schema Datatypes] for the ID, IDREF and IDREFS datatypes are purely lexical and do not include the cross-reference semantics of the corresponding [XML 1.0] datatypes. The cross-reference semantics of these datatypes in XML Schema comes from XML Schema Part 1. Furthermore, the [XML 1.0] cross-reference semantics of these datatypes do not fit into the RELAX NG model of what a datatype is. Therefore, RELAX NG validation will only validate the lexical aspects of these datatypes as defined in [W3C XML Schema Datatypes].”

You (and Alex) are conflating two different questions:

1) Are the ID/IDREF/IDREFS semantics of XML 1.0 applicable to ODF 1.0 document instances?

The answer is yes. Since ODF is XML, I would expect it to obey the syntax and semantics of the XML 1.0 (third edition) Recommendation, unlike, say, OOXML, which corrupts strings with illegal control characters.

2) Is the Relax NG schema that is included in ODF 1.0 written in such a way that will allow checking of these constraints using non-standard extensions to the Relax NG processing model, like Relax NG DTD Compatbility?

Empirically, the answer is no. But there is nothing in the ODF 1.0 standard that suggests otherwise. It isn’t required for XML conformance and it isn’t required for Relax NG conformance.

As I said in the post, I’m entirely in favor of changing this if it helps with XML tools like jing, so they can run in their default mode, thereby saving users the horrible time and expense of adding the -i flag to disable the non-standard Relax NG DTD Compatibility processing.

But the main purpose of my post is to respond to Alex’s rant that “There are no XML documents in existence which are valid to ISO ODF” and “The standard itself has a defect which prevents any document claiming validity from being actually valid”. I believe I’ve put that to rest. In the end, Alex’s post had a reasonable enhancement proposal, prefaced by a bunch of indefensible tripe.

Reply
hAl says

2008/05/04 at 4:31 am

Why would for RelaxNG you prefer the ISO version over an OASIS version, of which a part (DTD compatibility) was not submitted to ISO?
Why would on the other hand for the ODF format an OASIS only verion be your choice rather than the version that was submitted to ISO?

Reply
Rob says

2008/05/04 at 9:04 am

Actually, if you check the version of the ODF 1.0 submitted to JTC1, you’ll see that we referred to OASIS Relax NG. We changed our reference to the ISO version only at the request of the UK NB, who made this request in their ballot comments when ODF was balloted in JTC1 back in 2006.

As for which version of ODF I prefer, I’d say I prefer the latest and greatest. Remember, OASIS ODF 1.1 has all the yummy goodness of ISO-approved ODF 1.0, plus some accessibility enhancements. What’s not to love? But in the end, it is up to each vendor to decide for themselves what to support, and for each user to make their needs and wants known to their vendor.

Reply
Jirka Kosek says

2008/05/04 at 1:37 pm

“The semantics defined by [W3C XML Schema Datatypes] for the ID, IDREF and IDREFS datatypes are purely lexical and do not include the cross-reference semantics of the corresponding [XML 1.0] datatypes.

I think that you are misinterpreting ID/IDREF story in RELAX NG. But anyway, if this check should be done only in lexical space as you describe, schema is wrong. In RELAX NG schemas can be ambiguous and current ODF 1.0 schema allows validation of ID attributes against correct ID type or against text pattern. As each attribute value is valid against text pattern, even lexical tests for ID types will not be effectively performed and documents with wrong values in ID attributes will pass validation.

Reply
Rob says

2008/05/04 at 2:10 pm

Jirka, this is not my interpretation. I’m quoting James Clark here. Is there a part of what he says that is unclear?

In any case, the fact that jing does not check the syntax constraints on ID’s when using the -i flag to disable the DTD Compatibility mode, this clearly a limitation of jing. But there is nothing in principle that would prevent a Relax NG validator from checking these lexical constraints in default processing. But jing doesn’t do that.

The thing that is tripping you and Alex up seems to be the fact that the ODF schema will pass some uses of ID’s that are not allowed by XML. So what? The schema is not a complete statement of ODF conformance. We have attribute values that are typed string that could probably be more strictly typed with a regex. We have other attribute values typed integer that could more strictly be limited in range. So what? If the text of the ODF standard is clear that there are additional value space constraints on these attributes, then conformity fully requires these constraints.

Similarly, just because the schema doesn’t allow you to check the cross-reference semantics of ID/IDREF doesn’t mean the constraints don’t exist. It just means you need to test them in other ways. But it certainly doesn’t mean, as some have suggested, that the ODF schema is invalid. That is preposterous.

Reply
Anonymous says

2008/05/04 at 4:39 pm

Hi Rob. I must confess I’m a little confused…

“That in no way means that the constraints defined in XML 1.0 are not required on ODF document instances. It simply indicates that you would need to verify these constraints using means other than Relax NG DTD Compatibility.”

I hope it makes sense to make these questions:
* how could one verify these constraints?
* without verifying them, can we claim that the OASIS wiki instructions are enough to validate an ODF document against its specification?

Keep up the good work. I must say that I’m deeply disappointed with Alex.

Reply
Anonymous says

2008/05/04 at 4:56 pm

You state the: “The schema is not a complete statement of ODF conformance.”

The ODF specification states:
“Conforming applications…shall write documents that are valid against the OpenDocument schema if all foreign elements and attributes are removed before validation takes place”

So schema validation is actualy a minimum requirement for conforming ODF writing applications.

Reply
Baud says

2008/05/04 at 5:51 pm

Rob may find interesting the following words of Mr. Kosek: “Moreover, Mr. Weir is a master of manipulation, has offended several ISO member states and the severity of OOXML defects discovered by him is quite disputable.“

Reply
Rob says

2008/05/04 at 5:54 pm

OK. Let’s back up. We need to be precise in our language here.

I sense some confusing of conformity and validity.

Conformity is a statement of the relationship between a document, or an application, and the ODF standard. If the document or the application meets the requirements stated in the ODF Standard, then it conforms to the ODF Standard.

Validity is a statement about the relationship between an XML document and an XML schema. A Relax NG schema expresses patterns that a valid XML document must match. An XML document is valid if it matches the patterns described by the schema.

As one reader mentioned, a conformant ODF document must also be valid according to the ODF schema, if the foreign elements and attributes are removed.

But I would not say that this is a “minimum” requirement. More correct would be to say that it is a necessary, but not sufficient requirement. For example, the requirements for packing the multiple XML file into a ZIP — this cannot be described in a Relax NG schema, nor can the relationships between style names in styles.xml and the use of those styles in content.xml, nor the hierarchy of styles, nor any of the semantic of behavior requirements of ODF. So even a minimum, “hello world” ODF document needs more than validity in order to be conformant.

In terms of ID/IDREF, I’m saying that the cross-reference semantics, for example, that ID’s are unique, that IDREF’s contain the name of an ID, that these are not validity constraints that the validation model defined in the Relax NG standard checks. I quoted earlier James Clark on why that is so.

So, in terms of Relax NG, ID/IDREF consistency is not a validation criterion. A document, even with duplicate ID’s, or inconsistent use of ID/IDREF could still be valid to a Relax NG schema.

However, that doesn’t mean that the ODF standard permits inconsistency in ID/IDREF. Conformance to ODF, as I said earlier, requires more than validity. It requires other things, and this includes proper ID/IDREF semantics. This, like any other conformance constraints beyond the schema, would need to be checked by application logic. This is not difficult to do. In fact I’d expect that all ODF documents out there in fact are consistent in their use of ID/IDREF. This is a trivial consistency to maintain in an application.

Back to the question “[C]an we claim that the OASIS wiki instructions are enough to validate an ODF document against its specification?” I hope you see now how that question misstated. One does not validate against the specification. One validates against a schema. And the instructions given on the wiki are complete.

If you want to check conformance of an ODF document against the entire ODF specification, then you would need a custom tool that would check XML validation, check packaging constraints, check referential integrity of links, etc. Note that ODF makes only scant use of ID/IDREF. Most of the interesting links in ODF are not expressible using XML ID’s, since they are links from XML to not XML resources (like images) or from one XML file to another.

No such tool exists that does a complete conformance test of an ODF document, at least not that I’m aware of. The ODF Fellowship has an online Validator that does a subset of tests. That is a good start.

Reply
Rob says

2008/05/04 at 6:09 pm

Baud, thanks for the quote. If I respected Mr. Kosek’s opinion, I would be worried.

But I do understand where he is coming from, and others as well. They are eager to once again criticize XML standards. After spending the past 18 months in a “hear no evil, say no evil” posture, ignoring the defects in OOXML, giving blanket approval to whatever Microsoft wanted, they are eager to once again use their critical mental faculties and try to reestablish their credibility.

As for the question “severity of OOXML defects” discovered by me, I suggest we come to that point again in around a week. I think it will be easier to answer that question in context.

Reply
Nate says

2008/05/05 at 7:58 am

One does not validate against the specification. One validates against a schema.

Rob, it’s probably fair to describe “testing conformance” as “validating.” That seems a reasonable use of language.

I think it might be more accurate to say “Using a Relax NG tool, one validates against the schema, not the specification.” As you correctly identify, there simply is no program to validate against the whole of the specification. Nor against most specifications, come to think of it.

Reply
Rob says

2008/05/05 at 8:15 am

Nate,

The difference between validation and conformance is only confused in casual use. But to an XML expert there is a world of difference.

So watch XML expert’s worlds carefully. If they something something is ‘broken’ or it doesn’t ‘comply’, then they are using waffling words that mean nothing firm. But when they talk about validity and conformity, then they are saying something specific.

Reply
Lucas says

2008/05/05 at 5:07 pm

I see a consistent pattern from Microsoft and their supporters of making invalid or unfair comparisons (if not completely bogus claims) when comparing whatever product they are promoting to the competition.

Take Microsoft Windows Server product vs. LINUX, for example, as David Williams blasted MIcrosof about complete lack of “facts” in Microsoft’s Get The Facts webpage about Microsoft Windows Server in comparison to LINUX.

I’m sad to say — these tactics are nothing new for Microsoft — and even more so, for these tactics to be bled into standards-creation process and used to attack an existing standard in order to promote the standardization of a competing specification.

Reply
Anonymous says

2008/05/06 at 8:08 am

Rob, seems I’m not the only one to have failed to add a comment on Alex Brown’s blog. I apologise if this if off-topic, but I feel this is an important issue, so I’d like to follow your example and post here – Pete Austin

Alex Brown said, “Given Microsoft’s proven ability to tinker with the Office XML file format between service packs, I am hoping that MS Office will shortly be brought into line with the 29500 specification, and will stay that way.”
http://www.griffinbrown.co.uk/blog/PermaLink,guid,3e2202cd-59a3-4356-8f30-b8eb79735e1a.aspx

Alex, you may have missed how, about 18 months ago, the same Microsoft Office team made Outlook use Microsoft Word to render HTML emails, despite the fact that this significantly reduced support for HTML standards compared to earlier Outlook versions. They have still not brought it back into line. This is not a hopeful precedent.
http://www.campaignmonitor.com/blog/archives/2007/01/microsoft_takes_email_design_b.html

Reply
David says

2008/05/06 at 8:25 am

find the office:version attribute and then invoke jing with the correct schema.

Actually that may not be needed, with a bit of preparation it’s possible to write a single relax schema that validates documents according to whichever schema is specified in the version attribute.

I use this all the time for xslt1/2 discrimination (in emacs nxml mode which uses the relax schema for context sensitive editing)

The latest version of Norm’s schema is here
http://norman.walsh.name/2006/07/12/xslt20

but the first post describes the basic trick for doing this, which could be used to wrap two ODF schema, rather than two XSLT schema
is here
http://norman.walsh.name/2004/07/25/xslt20

basically you just take the two schema that you have and write a three line schema that includes them both…

Reply
Anonymous says

2008/05/06 at 9:59 am

interoperability is not proven by way of strict standards conformance. most web sites dont validate against html standards. interoperability stems from truly free standards that can be implemented by anyone. portability of a document from google docs to openoffice.org to abiword to koffice, whilst retaining formatting, is much more important than strict compliance to ODF 1.0. the standard thus allows for free competition in an open market. which is exactly why the world’s biggest monopoly does not wish to participate there. but the times are a-changing, and no amount of shills wills save MS now.

the fact that there is only one non-working implementation of MSOXML is the actual scandalous embarrassment to ISO, as only established standards should be fast-tracked.

best wishes

Reply
Rob says

2008/05/06 at 10:37 am

Well, interoperability doesn’t even require a standard, does it? It you imagine an Orwellian world where everyone was forced to run exactly the same applications on the same operating system, then you would have interoperable systems.

The complexity is when you have a plurality of vendors and applications and operating systems. Then you need an engineered approach to interoperability and that can encompass standards, schemas, test suites, reference implementations, conformity assesment methodology, “plugfests”, etc. That is the cost of plurality.

Reply
Anonymous says

2008/05/07 at 3:01 pm

The thing as far as I can see is that even if the ODF schema included the nonstandard extensions to check intrigity of ID/IDREF it would still not succeed with proper validation of these on the ODF document level since their intrigity must be checked over all the files included in the ODF document and basic XML validators handle one file at the time and never external resources like imbedded binary data.

It is very basic enginering design that you should not give the pretense to validate something if you really know that you are not checking the essential bits of the thing that need validation. Could it be that Alex Brown doesn’t have enough real world programming experience to understand this, or are he just playing dumb so save his face?

Shall we anticipate what will happen next?
I think what Alex Brown will do now is to create a set of files that demonstrate conflicts between use of ID/IDREF and start to claim that since the ODF XML schema can not detect this ODF is broken by design. The problem with this argument will of course be that his example files will not be proper ODF since they violate those parts of the ODF requirements that are not described by the XML schema.

It good to know that you Rob is around to refute his baseless accusations. Keep the good work.

Reply
Rob says

2008/05/07 at 5:29 pm

Oh, I think you do Alex a disservice there. He knows what he is doing. But it is far narrower, and far less important than he is portraying it to be. It reminds me of the “dihydrogen monoxide” scare from a few years ago. Throw around some technical jargon, make scary pronouncements, blown all out of proportion, and the people start panicking.

But dihydrogen monoxide is just water in the end, and in the end anyone who wants to can use tools like jing and msv to validate their documents, and it will find errors where you have errors and will say your document is valid if it is valid The ODF 1.0 schema is useful and can be used, and is used by developers today to check the validity of their ODF output.

To your particular point about multiple XML documents, and references between them, the referential integrity of those links are not, strictly speaking, part of validity, or at least not as defined in Relax NG or ODF. These “whole document” integrity should be treated as conformance constraints.

Reply

Trackbacks

ODF Lies and Whispers says:

2010/02/09 at 12:26 pm

[…] the Wikipedia articles on ODF and OOXML via paid consultants. In any case, Alex’s claims were rebutted long ago. ODF has a number (more than a hundred) of technical flaws which haven’t been addressed for 3 […]

Reply

Reader Interactions

Comments

Trackbacks

Leave a Reply Cancel reply