• Skip to primary navigation
  • Skip to main content
  • Skip to primary sidebar

An Antic Disposition

  • Home
  • About
  • Archives
  • Writings
  • Links
You are here: Home / Archives for ODF

ODF

Microsoft Office document corruption: Testing the OOXML claims

2010/02/15 By Rob 22 Comments

Summary

In this post I take a look at Microsoft’s claims for robust data recovery with their Office Open XML (OOXML) file format.  I show the results of an experiment, where I introduce random errors into documents and observe whether word processors can recover from these errors.  Based on these result, I estimate data recovery rates for Word 2003 binary, OOXML and ODF documents, as loaded in Word 2007, Word 2003 and in OpenOffice.org Writer 3.2.

My tests suggest that the OOXML format is less robust than the Word binary or ODF formats, with no observed basis for the contrary Microsoft claims.  I then discuss the reasons why this might be expected.

The OOXML “data recovery” claims

I’m sure you’ve heard the claim stated, in one form or another, over the past few years.  The claim is that OOXML files are more robust and recoverable than Office 2003 binary files.  For example, the Ecma Office Open XML File Formats overview says:

Smaller file sizes and improved recovery of corrupted documents enable Microsoft Office users to operate efficiently and confidently and reduces the risk of lost information.

Jean Paoli says essentially the same thing:

By taking advantage of XML, people and organizations will benefit from enhanced data recovery capabilities, greater security, and smaller file size because of the use of ZIP compression.

And we see similar claims in Micrsoft case studies:

The Office Open XML file format can help improve file and data management, data recovery, and interoperability with line-of-business systems by storing important metadata within the document.

A Microsoft press release quotes Senior Vice President Steven Sinofsky:

The new formats improve file and data management, data recovery, and interoperability with line-of-business systems beyond what’s possible with Office 2003 binary files.

Those are just four examples of a claim that has been repeated dozens of time.

There are many kinds of document errors.  Some errors are introduced by logic defects in the authoring application.  Some are introduced by other, non-editor applications that might modify the document after it was authored.  And some are caused failures in data transmission and storage.  The Sinofsky press release gives some further detail into exactly what kinds of errors are more easily recoverable in the OOXML format:

With more and more documents traveling through e-mail attachments or removable storage, the chance of a network or storage failure increases the possibility of a document becoming corrupt. So it’s important that the new file formats also will improve data recovery–and since data is the lifeblood of most businesses, better data recovery has the potential to save companies tremendous amounts of money.

So clearly we’re talking here about network and storage failures, and not application logic errors.  Good, this is a testable proposition then.  We first need to model the effect of these errors on documents.

Modeling document errors

Let’s model “network and storage failures” so we can then test how OOXML files behave when subjected to these types of errors.

With modern error-checking file transfer protocols, the days of transmission data errors are a memory.  Maybe  25 years ago, with XMODEM and other transfer mechanisms, you would see randomly-introduced transmission errors in the body of a document.  But today the more likely problem would be that of truncation, of missing the last few bytes of a file transfer.  This could happen for a variety of reasons, including logic errors in application-hosted file transfer support , to user-induced errors from removing a USB memory stick with uncommitted data still in the file buffer.  (I remember debugging a program once that had a bug where it would lose the last byte of a file whenever the file was an exactt multiple of 1024 bytes.)  These types of error can be particularly pernicious with some file formats.  For example, the old Lotus WordPro file format stored the table of contents for the document container at the end of the file.  This was great for incremental updating, but particularly bad for truncation errors.

For this experiment I modeled truncation errors by generating a series of copies of a reference document, each copy truncating an additional byte from the end of the document.

The other class of errors — “storage errors” as Sinofsky calls them — can come from a variety of hardware-level failures, including degeneration of the physical storage medum or mechanical errors in the storage device.  The unit of physical storage — and thus of physical damage — is the sector.  For most storage media the size of a sector is 512 bytes.  I modeled storage errors by creating a series of copies of a reference document, and for each one selecting a random location within that document and then introducing a 512-byte run of random bytes.

The reference document I used for these tests was Microsoft’s whitepaper, The Microsoft Office Open XML Formats.  This is a 16-page document, with title page with logo, a table of contents, a running text footer, and a text box.

Test Execution

I tested Microsoft Word 2003, Word 2007 and OpenOffice.org 3.2.   I attempted to load each test document into each editor.  Since corrupt documents have the potential to introduce application instability, I exited the editor between each test.

Each test outcome was recorded as one of:

  • Silent Recovery:  The application gave no error or warning message.  The document loaded, with partial localized corruption, but most of the data was recoverable.
  • Prompted Recovery: The application gave an error or warning message offering to recover the data.  The document loaded, with partial localized corruption, but most of the data was recoverable.
  • Recovery Failed: The application gave an error or warning message offering to recover the data, but no data was able to be recovered.
  • Failure to load: The application gave an error message and refused to load the document, or crashed or hanged attempting to load it.

The first two outcomes were scored as successes, and the last two were scored as failures.

Results: Simulated File Truncation

In this series of tests I took each reference document (in DOC, DOCX and ODT formats) and created 32 truncated files corresponding to 1-32 bytes truncation.  The results were the same regardless of the number of bytes truncated, as in the following table:

[table id=3 /]

Results: Simulated Sector Damage:

In these tests I created 30 copies of each reference document and introduced a random 512-byte run of random bytes,  with the following summary results:

[table id=6 /]

Discussion

First, what do the results say about Microsoft’s claim that the OOXML format “improves…data recovery…beyond what’s possible with Office 2003 binary files”?  A look at the above two tables brings this claim into question.  With truncation errors, all three word processors scored 100% recovery using the legacy binary DOC format.  With OOXML the same result was achieved only with Office 2007.  But both Office 2003 and OpenOffice 3.2 failed to open any of the truncated documents.  With the simulated sector-level errors, all three tested applications did far better recovering data from legacy DOC binary files than from OOXML files.  For example, Microsoft Word 2007 recovered 83% of the DOC files but only 47% of the OOXML files.  OpenOffice 3.2 recovered 90% of the DOC files, but only 37% of the OOXML files.

In no case, of almost 200 tested documents, did we see the data recover of OOXML files exceed that of the legacy binary formats.  This makes sense, if you consider this from an information theoretic perspective.  The ZIP compression in OOXML, while it compresses the document at the same time makes the byte stream denser in terms of the information encoding.  The number of physical bits per information bits is smaller in the ZIP than in the uncompressed DOC file. (In the limit of perfect compression, this ratio would be 1-to-1.)  Because of this, a physical error of 1-bit introduces more than 1-bit of error in the information content of the document.  In other words, a compressed document, all else being equal, will be less robust, not more robust to “network and storage failures”.  Because of this it is extraordinary that Microsoft so frequently claims that OOXML is both smaller and more robust than the binary formats, without providing details of how they managed to optimize these two opposing and complementary qualities.

Although no similar claims have been made regarding ODF documents, I tested them as well.  Since ODF documents are compressed by ZIP, we would expect them to also be less robust to physical errors than DOC, for the same reasons discussed above.  This was confirmed in the tests.  However, ODF documents exhibited a higher recovery rate than OOXML.  Both OpenOffice 3.2 (60% versus 37%) as well as Word 2007 (60% versus 47%) had higher recovery rates for ODF documents.  If all else had been equal, we would have expected ODF documents to have lower recover rates than OOXML.  Why?  Because the ODF documents were on average 18% smaller than the corresponding OOXML documents, so the fixed 512-byte sector errors were proportionately larger impact in ODF documents.

The above is explainable if we consider the general problem of random errors in markup.  There are two opposing tendencies here.  On the one hand, the greater the ratio of character data to markup, the more likely it will be that any introduced error will be benign to the integrity of the document, since it will most likely occur within a block of text.  At the extreme, a plain text file, with no markup whatsoever, can handle any degree of error introduction with only proportionate data corruption.  However, one can also argue in the other direction, that the more encoded structure there is in the document, the easier it is to surgically remove only the damaged parts of the file.  However, we must acknowledge that physical errors, the “network and storage failures” that we looked at in these tests, do not respect document structure.  Certainly the results of these tests call into question the wisdom of claiming that the complexity of the document model leads it to be more robust.  When things go wrong, simplicity often wins.

Finally, I should observe that application difference, as well as file format differences, play a role in determining success in recovering damaged files.  With DOC files, OpenOffice.org 3.2 was able to read more files than either version of Microsoft Word.  This confirms some of the anecdotes I’ve heard that OpenOffice will read files that Word will not.  With OOXML files, however, Word 2007 did best, though OpenOffice fared better than Word 2003.  With ODF files, both Word and OpenOffice scored the same.

Further work

Obviously the field of document file robustness is a complex question.  These tests strongly motivate the thought that there are real differences in how robust document formats are with respect to corruption, and these observed differences appear to contradict claims made in Microsoft’s OOXML promotional materials.  It would require more tests to demonstrate the significance and magnitude of those differences.

With more test cases, one could also determine exactly which portions of a file are the most vulnerable.  For example, one could make a heat map visualization to illustrate this.  Are there any particular areas of a document where even a 1-byte error can cause total failures?  It appears that a single-byte truncation error on OOXML documents will cause a total failure in Office 2003, but not in Office 2007.  Are there any 1-byte errors that cause failure in both editors?

We also need to remember that neither OOXML nor ODF are pure XML formats.  Both formats involve a ZIP container file with multiple XML files and associated resources inside.  So document corruption may consist of damage to the directory or compression structures of the ZIP container as well as errors introduced into the contained XML and other resources.    The directory of the ZIP’s contents is stored at the end of the file.  So the truncation errors are damaging the directory.  However, this information is redundant, since each undamaged ZIP entry can be recovered in a sequential processing of the archive.  So I would expect a near perfect recovery rate for the modest truncations exercised in these tests.  But with OOXML files in Office 2003 and OpenOffice 3.2, even a truncation of a single byte prevented the document from loaded.  This should be relatively easy to fix.

Also, the large number of tests with the “Silently Recover” outcome is a concern.  Although the problem in general is solved with digital signatures, there should be some lightweight way, perhaps checking CRC’s at the ZIP entry level, to detect and warn users when a file has been damaged.  If this is not done, the user could inadvertently work and resave the damaged work or otherwise propagate the errors, when an early warning of the error would potentially give the user the opportunity, for example, to download the file again, or seek another, hopefully, undamaged copy of the document.  But by silently recovering and loading the file, the user is not made aware of their risky situation.

Files and detailed results

If you are interested in repeating or extending these tests, here are the test files (including reference files) in DOC, DOCX and ODT formats.  You can also download a ZIP of the Java source code I used to introduce the document errors.  And you can also download the ODF spreadsheet containing the detailed results.

WARNING: The above ZIP files contain corrupted documents.  Loading them could potentially cause system instability and crash your word processor or operating system (if you are running Windows).  You probably don’t want to be playing with them at the same time you are editing other critical documents.

Updates

2010-02-15: I did an additional 100 tests of DOC and DOCX in Office 2007.  Combined with the previous 30, this gives the DOC files a recovery rate of 92% compared to only 45% for DOCX.  With that we have significant results at 99% confidence level.

Given that, can anyone see a basis for Microsoft’s claims?  Or is this more subtle?  Maybe they really meant to say that it is easier to recover from errors in an OOXML file, while ignoring the more significant fact that it is also far easier to corrupt an OOXML file.  If so, the greater susceptibility to corruption seems to have outpaced any purported enhanced ability of Office 2007 to recover from these errors.

It is like a car with bad brakes claiming that is has better airbags.  No thanks.  I’ll pass.

Filed Under: ODF, OOXML

ODF 1.2 Part 1 Public Review

2010/01/25 By Rob 7 Comments

Salt and Fresh-ground Pepper (photo by author)

A major milestone was reached for the OASIS ODF TC today.  The latest Committee Draft of ODF 1.2 Part 1 was sent out for a 60-day public review.

“What does this mean, and why should I care?” you might be asking.  That’s a fair question.

First, a quick review of the OASIS standards approval process.  The stages look like this:

  1. TC creates one or more Working Drafts
  2. A Working Draft may then additionally be approved as a Committee Draft
  3. A Committee Draft may then be additionally approved as a Public Review Draft
  4. After addressing received public comments, a Committee Draft may be approved as a Committee Specification
  5. Finally, a Committee Specification may be voted on by OASIS as an OASIS Standard

There is the possibility of iteration at most of these stages.  So we’re not done with ODF 1.2.   There is still work to be done, but we are certainly in the endgame now.

Also, it is important to remember that ODF 1.2 has been factored into three “parts”:

  • Part 1 specifies the core schema
  • Part 2 is OpenFormula (spreadsheet formulas)
  • Part 3 defines the packaging model of ODF, and went out for public review back in November

Part 1 is by far the largest of the 3 parts, at 838 pages.  Here is a high-level view of what is covered:

  1. Introduction
  2. Scope
  3. Document Structure
  4. Metadata
  5. Text Content
  6. Paragraph Elements Content
  7. Text Fields
  8. Text Indexes
  9. Tables
  10. Graphic Content
  11. Chart Content
  12. Database Front-end Document Content
  13. Form Content
  14. Common Content
  15. SMIL Animations
  16. Styles
  17. Formatting Elements
  18. Datatypes
  19. General Attributes
  20. Formatting Attributes
  21. Document Processing
  22. Conformance
  23. Appendix A.  OpenDocument Relax NG Schema
  24. Appendix B.  OpenDocument Metadata Manifest Ontology
  25. Appendix C.  MIME Types and File Name Extensions (Non Normative)
  26. Appendix E.  Recommend Usage of SMIL
  27. Appendix G.  Acknowledgments (Non Normative)

If any of this interests you I’d encourage you to take a look at the draft and submit comments per the process defined in the public review announcement.  I expect few will review the entire specification, but even if you can review only a chapter of particular interest to you, or even do a random page review, that will help.  We’re looking for any reasonable feedback, from typographical errors, to ambiguities to new feature proposals.  It is all good.

You can follow the incoming comments via the TC’s comment list, or unofficially via the ODFJIRA Twitter feed.

Now, I know that a vigorous public review, with many reviewers and many comments, is seen in some quarters as inconvenient and troublesome.  It is thought better (in those circles) for standards to sail by, unread, unchallenged and unimplemented.  I do not subscribe to that view.  I ask you to not be gentle on the ODF 1.2 public review draft.  Send us a lot of comments, so we know where we need to improve.  Send us a lot of defect reports, so we know what to fix.  Send us a lot of feature proposals, so we know what to do next.  Short of joining the ODF TC directly, this is the best opportunity to give us feedback.

Filed Under: ODF Tagged With: OASIS, OpenFormula, XML

The Relevancy of ODF 1.0

2009/12/14 By Rob 10 Comments

By the time you read this (actually probably by the time I finish writing this post) a ballot approving the Public Review Draft of ODF 1.2, Part 1 will have passed.  Part 1 is the largest of the three parts of ODF 1.2, and reaching a Public Review Draft status is a major accomplishment.  Expect to see an official notice of the start of the public review period over the next few days.

But as we look forward to ODF 1.2, and then beyond to “ODF-Next”, it is worth giving some consideration to what we do with ODF 1.1 and ODF 1.0.

Today, if you surveyed ODF implementations, you would find that the preponderance of them write ODF 1.1 documents by default. Twelve months ago many of them wrote out ODF 1.0 format, and in another 12 months I predict most will be writing out ODF 1.2 format by default.

So what does this mean for ODF the standard?

Every 5 years each ISO standard undergoes what is called “Periodic Review”.  The outcome of this review is to classify the standard as one of: confirmed, revised, stabilized or withdrawn.   If it is confirmed, it means the standard is of continued relevancy and is still undergoing maintenance.  Revised means it is currently undergoing revision and periodic review is not necessary.  Stabilized means it “has ongoing validity and effectiveness but is mature and insofar as can be determined will not require further maintenance of any sort”.  And a standard is withdrawn (the most extreme option) if it has been declared unsafe, has a non-RAND patent asserted against it, or is “no longer in use”.

Some of the nattering nabobs in SC34 (e.g., Alex Brown) are floating the idea that ODF 1.0 should be withdrawn from ISO, claiming it is not implemented and not relevant.  At the recent SC34 meeting in Paris this view was echoed by a Microsoft participant (one of many) in the meeting, who additionally urged that a motion to withdraw ODF 1.0 be brought forward at the Stockholm SC34 Plenary in March.

I think this shows an extraordinarily poor understanding of how documents and document format standards work.  ODF is not a standard for a transient phenomenon, like a network or telephone protocol standard, that is no longer relevant when the last producer of the network protocol is gone from the market and the last signal fades from the wire.  ODF specifies a document format, and documents persist and remain relevant so long as the documents and their owners remain.

Additionally, and especially in public sector use, there are regulatory or statutory requirements for how long documents (records) must be preserved.  Some for 3 years, some for 7, some for 30 years, and some records must be preserved forever.  Just because ODF 1.2 comes along does not make ODF 1.0 retention and public access requirements go away.

Although most major ODF editors now write out new documents in ODF 1.1 format by default, they all are able to read and process ODF 1.0 documents as well.  So they are all “consumers” of ODF 1.0 and conform to the ODF 1.0 standard.  This occurs at the same time they are also conforming ODF 1.1 “producers”.  So it is absolutely false to say that there are no ODF 1.0 implementations today.  There are many, including OpenOffice, Symphony, Google Docs, KOffice, even Microsoft Office.  The are all ODF 1.0 consumers.

We should also consider the needs of new word processors that implement ODF, since there are still a few that do not support ODF yet, like Apple’s iWork.  When they eventually implement ODF they will want to implement write support (“producer” conformance) for the current version of ODF, as well as read support (“consumer” conformance) for earlier versions of ODF.  So to enable competition in this space, and allow for new players, we must preserve access to the relevant legacy standards.  Otherwise we would be perpetuating the type of information exclusion we typically associate with Microsoft, in the decades when they restricted access to their legacy formats.

In any case, it is still puzzling to me why some are pushing for the very unusual and extreme action of withdrawing ODF 1.0 from ISO.  This doesn’t pass the sniff test.  Something is rotten here.  This is an anti-competition, anti-user, anti-adopter and overtly political move, lead by Microsoft employees and Microsoft consultants in the Microsoft-dominated JTC1/SC34.   (I wish I had a pump big enough to drain that swamp.)  Ironically, by questioning the relevancy of ODF 1.0, they will cause many more to question the relevancy of SC34.

At some point, I agree that stabilization may be something to consider in the future.  But for now, ODF does not fit in that category because it is actively undergoing maintenance. SC34 members, including Alex Brown, have submitted defect reports against ODF 1.0, and the OASIS ODF TC is responding to them.  It is quite reasonable to expect that ODF 1.1 and ODF 1.2 will be broadly implemented at the same time as ODF 1.0 continues to undergo corrective maintenance.  That is the nature of document format standards like ODF.  Their relevancy, as perceived by users and adopters of the standard, is determined by the mass of legacy documents in the format, not on whether their current word processor saves in that format by default.

[12/15/09 Doug Mahugh today wrote to the OASIS ODF TC list, apprently concerned that this blog post might be misread as an official statement of the OASIS ODF TC.  I’ve attempted to dispel such notions in my response on that list.  As I’ve made abundantly clear on my Who is Rob Weir page, “The postings on this site are my own and don’t necessarily represent the positions, strategies or opinions of any of my employers or the organizations I’m associated with”.

My practice is simple:   I am not speaking as OASIS ODF TC Co-Chair, unless I am posting ODF TC agendas, minutes or similar official ODF TC notices to the ODF TC’s mailing list, or when I explicitly sign my name with the title, “Co-Chair, OASIS ODF TC”.]

Filed Under: ODF

ODF 1.2, Part 3 goes out for Public Review

2009/11/16 By Rob 4 Comments

A major milestone for ODF 1.2 was reached on Friday. Part 3 of ODF 1.2, which specifies document packaging (how a document’s XML, images and metadata are combined into a single file and are optionally encrypted or signed), went out for a 60-day public review period. This public review period will run through January 12th, 2010. A public review is a necessary OASIS procedure before a Committee Draft can be approved as a Committee Specification and then as an OASIS Standard.

The official announcement of the review has more information, including links to download the public review draft and information on how to submit comments on the draft.

Compared to the packaging specification used in ODF 1.0 and ODF 1.1, the main differences are:

  1. We’ve split this material into its own specification, since these packaging conventions are more widely applicable, and in fact have been more widely used than just in ODF. For example, the International Digital Publishing Forum (IDPF), who standardize the increasingly important ePub digital book format, use ODF’s packaging as the base of their Open eBook Publication Structure Container Format (OCF) 1.0 specification.
  2. We’ve added digital signature support (chapter 4) based on the W3C’s XML Digital Signature Core, including the ability to use standardized extensions such as XAdES.
  3. We now have an RDF-based metadata framework with OWL ontology for the manifest file (chapter 5).
  4. We include a more detailed conformance definition has been added, including conformance targets for packages, producers and consumers, as well as a separate conformance class for extended packages.
  5. Generally, a redraft of the specification to ISO style guidelines.

This specification is only 34 pages long, so if you’re at all interested please give it a look  between now and January 12th, and send along any comments via the office-comment list. Anything that improves the specification is welcome, from reports of typographical errors, to technical omissions or errors, to suggestions for future features. It is all good.

And if you want to follow along, you can track the incoming comments in several ways:

  • Subscribe to the office-comment list mentioned above.
  • View the archives of the off-comment list.
  • View the public review comments we’re tracking in JIRA. I have a python script that scrapes the office-comment list and enters them into JIRA. This will be more complete than the office-comment list because it will includes additional comments from ODF TC.
  • I have another python script that takes each newly entered issue from JIRA and sends it out via Twitter. So you can follow all new ODF issues by subscribing to @ODFJIRA. Depending on your Twitter reader, you might be able to mark some issues as “favorites” and return to them later to see how they have been resolved.  (While you’re at it, you might also follow me, @RCWeir)

Also, keep your eye open for the announcement of a public review for ODF 1.2, Part 1 (ODF Schema) and Part 2 (OpenFormula), which will be ready for review soon.

Filed Under: ODF

Protocols, Formats and the Limits of Disclosure

2009/10/12 By Rob 4 Comments

A few words today on an important distinction that deserves greater appreciation, since it lies at the heart of several current interoperability debates. What I relate here will be well-known to any engineer, though I think almost anyone can understand the gist of this.

First, let’s review the basics.

Formats define how information is encoded. For example, HTML is the standard format for describing web pages.

Protocols define how encoded information is transmitted from one endpoint to another. For example, HTTP is the standard protocol for downloading web pages from web servers to web browsers.

There are other such format/protocol pairs, such as MIME and SMTP for emails. When we talk about “web standards” we talk about formats (often described by W3C Recommendations) and protocols (often described in IETF RFCs).

An instance of data that conforms to a given format standard might be given any number of terms: a web page, a document, an image, a video, etc., according to the underlying standard. The instance of a format is a data, bits and bytes that you can save to your hard drive, burn to a CD, email, etc. Data in a format is persistent and has a static representation.

But what is an instance of a protocol? It is a transaction. It is ephemeral. You can’t easily save an instance of HTTP or SMTP on your hard drive, or email it to someone else. A protocol is complex dance, a set of queries and responses, often a negotiation of capabilities that preface the data transmission.

There is a key distinction between formats and protocols when it comes to interoperability. The key is that a protocol typically involves the negotiation of communication details between two identifiable parties, each of whom can state their capabilities and preferences, as well as conform to the capabilities of the datalink itself. Software running on each endpoint of the transaction can adapt as part of this negotiation.

You may be familiar with this from the modem days, where this “handshaking” procedure was audibly manifest to you whenever you connected to a remote host. But although you don’t hear or see it, this negotiation still occurs with protocols today, behind the scenes.

For example, when you request a web page, your client negotiates all sorts of parameters with the web server, including packet size and timings (at the TCP/IP level) to authentication, language, character set and cache preferences (at the HTTP level). This negotiation of capabilities is essential for handling the diversity of difference web servers and web clients in existence today.

With a protocol, you have two technical endpoints communicating and negotiating the parameters of the data exchange. In other words, you have software on both ends of the communication able to execute logic to adapt to the needs of the other endpoint and the capabilities of the underlying datalink.

However, when it comes to formats, things are different.

Let’s use an word processor document as an example of a format instance. I author a document, and then I send it out, via email, as an attachment on my blog, burned on a conference CD-ROM, posted to a document server or whatever. I have no idea who the party on the receiving end will be, nor what software they will be using. They could be running Microsoft Office, but they could also be using OpenOffice, Google Docs, Lotus Symphony, WordPerfect, AbiWord, KOffice, etc. I, as the document author, have no ability to target my document to the quirks of the receiving party, since their identity and capabilities are unknown and in general unknowable.

Since a document is not executable logic, it cannot adapt to the quirks of various endpoints. A document is static. When it comes time to interpret the document, you don’t see two vendor endpoints adapting and negotiating. You see only one piece of software, the receiving party’s application, and they need to interpret a static data instance in a given format.

In other words, with document formats, there is no dynamic negotiation, because at the time when your write a document out, you have no idea what the reading application will be. And although the application that reads the document may know the identity of the writing application (via metadata stored in the document for example), it has no ability to negotiate with the writing application, since that application is not present when the document is being loaded.

OK. Simple enough. However, a confused understanding this distinction will lead you to muddled reasoning about interoperability and how it is achieved.

Although it is not ideal, having Microsoft disclosure the details of exactly how they implement various proprietary protocols and even their quirky implementation of standard protocols, this may enable 3rd parties to code to these details. If the disclosure is timely, complete and accurate, this information may be useful. I think of the SAMBA work, for example.

However, no amount of disclosure from Microsoft on how they interpret the ODF standard will help. We see that today, with Office 2007 SP2, where it strips out ODF spreadsheet formulas. Having official documentation of this fact from Microsoft, in the form of “Implementation Notes” does not help interoperability. Why? Because when I create an ODF document, I do not know who the reader will be. It may be a Microsoft Office user. But maybe it won’t. It very well could be read by many different users, using many different programs. I cannot adapt my document to the quirks of all the various ODF implementations.

When you deal with formats, interoperability is achieved by converging on a common interpretation of the format. Having well-documented, but divergent interpretations does not improve interoperability. Disclosure of quirks is insufficient. Disclosure presumes a document exchange universe where the writing application knows that the reader will be Microsoft Office and only Microsoft Office and therefor the writer can adapt to Microsoft’s quirks. That is monopolist’s logic. Interoperability with competition only comes when all implementors converge in their interpretation of the format. When that happens we don’t need disclosures. We just follow the standard.

Filed Under: ODF Tagged With: File Formats, Interoperability, Protocols

  • « Go to Previous Page
  • Page 1
  • Interim pages omitted …
  • Page 5
  • Page 6
  • Page 7
  • Page 8
  • Page 9
  • Interim pages omitted …
  • Page 25
  • Go to Next Page »

Primary Sidebar

Copyright © 2006-2026 Rob Weir · Site Policies