Summary
In this post I take a look at Microsoft’s claims for robust data recovery with their Office Open XML (OOXML) file format. I show the results of an experiment, where I introduce random errors into documents and observe whether word processors can recover from these errors. Based on these result, I estimate data recovery rates for Word 2003 binary, OOXML and ODF documents, as loaded in Word 2007, Word 2003 and in OpenOffice.org Writer 3.2.
My tests suggest that the OOXML format is less robust than the Word binary or ODF formats, with no observed basis for the contrary Microsoft claims. I then discuss the reasons why this might be expected.
The OOXML “data recovery” claims
I’m sure you’ve heard the claim stated, in one form or another, over the past few years. The claim is that OOXML files are more robust and recoverable than Office 2003 binary files. For example, the Ecma Office Open XML File Formats overview says:
Smaller file sizes and improved recovery of corrupted documents enable Microsoft Office users to operate efficiently and confidently and reduces the risk of lost information.
Jean Paoli says essentially the same thing:
By taking advantage of XML, people and organizations will benefit from enhanced data recovery capabilities, greater security, and smaller file size because of the use of ZIP compression.
And we see similar claims in Micrsoft case studies:
The Office Open XML file format can help improve file and data management, data recovery, and interoperability with line-of-business systems by storing important metadata within the document.
A Microsoft press release quotes Senior Vice President Steven Sinofsky:
The new formats improve file and data management, data recovery, and interoperability with line-of-business systems beyond what’s possible with Office 2003 binary files.
Those are just four examples of a claim that has been repeated dozens of time.
There are many kinds of document errors. Some errors are introduced by logic defects in the authoring application. Some are introduced by other, non-editor applications that might modify the document after it was authored. And some are caused failures in data transmission and storage. The Sinofsky press release gives some further detail into exactly what kinds of errors are more easily recoverable in the OOXML format:
With more and more documents traveling through e-mail attachments or removable storage, the chance of a network or storage failure increases the possibility of a document becoming corrupt. So it’s important that the new file formats also will improve data recovery–and since data is the lifeblood of most businesses, better data recovery has the potential to save companies tremendous amounts of money.
So clearly we’re talking here about network and storage failures, and not application logic errors. Good, this is a testable proposition then. We first need to model the effect of these errors on documents.
Modeling document errors
Let’s model “network and storage failures” so we can then test how OOXML files behave when subjected to these types of errors.
With modern error-checking file transfer protocols, the days of transmission data errors are a memory. Maybe 25 years ago, with XMODEM and other transfer mechanisms, you would see randomly-introduced transmission errors in the body of a document. But today the more likely problem would be that of truncation, of missing the last few bytes of a file transfer. This could happen for a variety of reasons, including logic errors in application-hosted file transfer support , to user-induced errors from removing a USB memory stick with uncommitted data still in the file buffer. (I remember debugging a program once that had a bug where it would lose the last byte of a file whenever the file was an exactt multiple of 1024 bytes.) These types of error can be particularly pernicious with some file formats. For example, the old Lotus WordPro file format stored the table of contents for the document container at the end of the file. This was great for incremental updating, but particularly bad for truncation errors.
For this experiment I modeled truncation errors by generating a series of copies of a reference document, each copy truncating an additional byte from the end of the document.
The other class of errors — “storage errors” as Sinofsky calls them — can come from a variety of hardware-level failures, including degeneration of the physical storage medum or mechanical errors in the storage device. The unit of physical storage — and thus of physical damage — is the sector. For most storage media the size of a sector is 512 bytes. I modeled storage errors by creating a series of copies of a reference document, and for each one selecting a random location within that document and then introducing a 512-byte run of random bytes.
The reference document I used for these tests was Microsoft’s whitepaper, The Microsoft Office Open XML Formats. This is a 16-page document, with title page with logo, a table of contents, a running text footer, and a text box.
Test Execution
I tested Microsoft Word 2003, Word 2007 and OpenOffice.org 3.2. I attempted to load each test document into each editor. Since corrupt documents have the potential to introduce application instability, I exited the editor between each test.
Each test outcome was recorded as one of:
- Silent Recovery: The application gave no error or warning message. The document loaded, with partial localized corruption, but most of the data was recoverable.
- Prompted Recovery: The application gave an error or warning message offering to recover the data. The document loaded, with partial localized corruption, but most of the data was recoverable.
- Recovery Failed: The application gave an error or warning message offering to recover the data, but no data was able to be recovered.
- Failure to load: The application gave an error message and refused to load the document, or crashed or hanged attempting to load it.
The first two outcomes were scored as successes, and the last two were scored as failures.
Results: Simulated File Truncation
In this series of tests I took each reference document (in DOC, DOCX and ODT formats) and created 32 truncated files corresponding to 1-32 bytes truncation. The results were the same regardless of the number of bytes truncated, as in the following table:
[table id=3 /]
Results: Simulated Sector Damage:
In these tests I created 30 copies of each reference document and introduced a random 512-byte run of random bytes, with the following summary results:
[table id=6 /]
Discussion
First, what do the results say about Microsoft’s claim that the OOXML format “improves…data recovery…beyond what’s possible with Office 2003 binary files”? A look at the above two tables brings this claim into question. With truncation errors, all three word processors scored 100% recovery using the legacy binary DOC format. With OOXML the same result was achieved only with Office 2007. But both Office 2003 and OpenOffice 3.2 failed to open any of the truncated documents. With the simulated sector-level errors, all three tested applications did far better recovering data from legacy DOC binary files than from OOXML files. For example, Microsoft Word 2007 recovered 83% of the DOC files but only 47% of the OOXML files. OpenOffice 3.2 recovered 90% of the DOC files, but only 37% of the OOXML files.
In no case, of almost 200 tested documents, did we see the data recover of OOXML files exceed that of the legacy binary formats. This makes sense, if you consider this from an information theoretic perspective. The ZIP compression in OOXML, while it compresses the document at the same time makes the byte stream denser in terms of the information encoding. The number of physical bits per information bits is smaller in the ZIP than in the uncompressed DOC file. (In the limit of perfect compression, this ratio would be 1-to-1.) Because of this, a physical error of 1-bit introduces more than 1-bit of error in the information content of the document. In other words, a compressed document, all else being equal, will be less robust, not more robust to “network and storage failures”. Because of this it is extraordinary that Microsoft so frequently claims that OOXML is both smaller and more robust than the binary formats, without providing details of how they managed to optimize these two opposing and complementary qualities.
Although no similar claims have been made regarding ODF documents, I tested them as well. Since ODF documents are compressed by ZIP, we would expect them to also be less robust to physical errors than DOC, for the same reasons discussed above. This was confirmed in the tests. However, ODF documents exhibited a higher recovery rate than OOXML. Both OpenOffice 3.2 (60% versus 37%) as well as Word 2007 (60% versus 47%) had higher recovery rates for ODF documents. If all else had been equal, we would have expected ODF documents to have lower recover rates than OOXML. Why? Because the ODF documents were on average 18% smaller than the corresponding OOXML documents, so the fixed 512-byte sector errors were proportionately larger impact in ODF documents.
The above is explainable if we consider the general problem of random errors in markup. There are two opposing tendencies here. On the one hand, the greater the ratio of character data to markup, the more likely it will be that any introduced error will be benign to the integrity of the document, since it will most likely occur within a block of text. At the extreme, a plain text file, with no markup whatsoever, can handle any degree of error introduction with only proportionate data corruption. However, one can also argue in the other direction, that the more encoded structure there is in the document, the easier it is to surgically remove only the damaged parts of the file. However, we must acknowledge that physical errors, the “network and storage failures” that we looked at in these tests, do not respect document structure. Certainly the results of these tests call into question the wisdom of claiming that the complexity of the document model leads it to be more robust. When things go wrong, simplicity often wins.
Finally, I should observe that application difference, as well as file format differences, play a role in determining success in recovering damaged files. With DOC files, OpenOffice.org 3.2 was able to read more files than either version of Microsoft Word. This confirms some of the anecdotes I’ve heard that OpenOffice will read files that Word will not. With OOXML files, however, Word 2007 did best, though OpenOffice fared better than Word 2003. With ODF files, both Word and OpenOffice scored the same.
Further work
Obviously the field of document file robustness is a complex question. These tests strongly motivate the thought that there are real differences in how robust document formats are with respect to corruption, and these observed differences appear to contradict claims made in Microsoft’s OOXML promotional materials. It would require more tests to demonstrate the significance and magnitude of those differences.
With more test cases, one could also determine exactly which portions of a file are the most vulnerable. For example, one could make a heat map visualization to illustrate this. Are there any particular areas of a document where even a 1-byte error can cause total failures? It appears that a single-byte truncation error on OOXML documents will cause a total failure in Office 2003, but not in Office 2007. Are there any 1-byte errors that cause failure in both editors?
We also need to remember that neither OOXML nor ODF are pure XML formats. Both formats involve a ZIP container file with multiple XML files and associated resources inside. So document corruption may consist of damage to the directory or compression structures of the ZIP container as well as errors introduced into the contained XML and other resources. The directory of the ZIP’s contents is stored at the end of the file. So the truncation errors are damaging the directory. However, this information is redundant, since each undamaged ZIP entry can be recovered in a sequential processing of the archive. So I would expect a near perfect recovery rate for the modest truncations exercised in these tests. But with OOXML files in Office 2003 and OpenOffice 3.2, even a truncation of a single byte prevented the document from loaded. This should be relatively easy to fix.
Also, the large number of tests with the “Silently Recover” outcome is a concern. Although the problem in general is solved with digital signatures, there should be some lightweight way, perhaps checking CRC’s at the ZIP entry level, to detect and warn users when a file has been damaged. If this is not done, the user could inadvertently work and resave the damaged work or otherwise propagate the errors, when an early warning of the error would potentially give the user the opportunity, for example, to download the file again, or seek another, hopefully, undamaged copy of the document. But by silently recovering and loading the file, the user is not made aware of their risky situation.
Files and detailed results
If you are interested in repeating or extending these tests, here are the test files (including reference files) in DOC, DOCX and ODT formats. You can also download a ZIP of the Java source code I used to introduce the document errors. And you can also download the ODF spreadsheet containing the detailed results.
WARNING: The above ZIP files contain corrupted documents. Loading them could potentially cause system instability and crash your word processor or operating system (if you are running Windows). You probably don’t want to be playing with them at the same time you are editing other critical documents.
Updates
2010-02-15: I did an additional 100 tests of DOC and DOCX in Office 2007. Combined with the previous 30, this gives the DOC files a recovery rate of 92% compared to only 45% for DOCX. With that we have significant results at 99% confidence level.
Given that, can anyone see a basis for Microsoft’s claims? Or is this more subtle? Maybe they really meant to say that it is easier to recover from errors in an OOXML file, while ignoring the more significant fact that it is also far easier to corrupt an OOXML file. If so, the greater susceptibility to corruption seems to have outpaced any purported enhanced ability of Office 2007 to recover from these errors.
It is like a car with bad brakes claiming that is has better airbags. No thanks. I’ll pass.