{"id":701,"date":"2010-02-15T00:32:39","date_gmt":"2010-02-15T05:32:39","guid":{"rendered":"http:\/\/2d823b65bb.nxcli.io\/?p=701"},"modified":"2010-02-15T18:09:26","modified_gmt":"2010-02-15T23:09:26","slug":"office-document-corruption","status":"publish","type":"post","link":"https:\/\/www.robweir.com\/blog\/2010\/02\/office-document-corruption.html","title":{"rendered":"Microsoft Office document corruption: Testing the OOXML claims"},"content":{"rendered":"<h3>Summary<\/h3>\n<p>In this post I take a look at Microsoft&#8217;s claims for robust data recovery with their Office Open XML (OOXML) file format.\u00a0 I show the results of an experiment, where I introduce random errors into documents and observe whether word processors can recover from these errors.\u00a0 Based on these result, I estimate data recovery rates for Word 2003 binary, OOXML and ODF documents, as loaded in Word 2007, Word 2003 and in OpenOffice.org Writer 3.2.<\/p>\n<p>My tests suggest that the OOXML format is less robust than the Word binary or ODF formats, with no observed basis for the contrary Microsoft claims.\u00a0 I then discuss the reasons why this might be expected.<\/p>\n<h3>The OOXML &#8220;data recovery&#8221; claims<\/h3>\n<p>I&#8217;m sure you&#8217;ve heard the claim stated, in one form or another, over the past few years.\u00a0 The claim is that OOXML files are more robust and recoverable than Office 2003 binary files.\u00a0 For example, the <a style=\"text-decoration: line-through;\" rel=\"nofollow\" href=\"http:\/\/office.microsoft.com\/en-us\/products\/HA102058151033.aspx\">Ecma Office Open XML File Formats overview<\/a> says:<\/p>\n<blockquote><p>Smaller file sizes and improved recovery of corrupted documents enable Microsoft Office users to operate efficiently and confidently and reduces the risk of lost information.<\/p><\/blockquote>\n<p><a style=\"text-decoration: line-through;\" rel=\"nofollow\" href=\"http:\/\/office.microsoft.com\/en-us\/products\/HA102139821033.aspx\">Jean Paoli<\/a> says essentially the same thing:<\/p>\n<blockquote><p>By taking advantage of XML, people and organizations will benefit from enhanced data recovery capabilities, greater security, and smaller file size because of the use of ZIP compression.<\/p><\/blockquote>\n<p>And we see similar claims in Micrsoft <a style=\"text-decoration: line-through;\" rel=\"nofollow\" href=\"http:\/\/www.microsoft.com\/casestudies\/Case_Study_Detail.aspx?CaseStudyID=201097\">case studies<\/a>:<\/p>\n<blockquote><p>The Office Open XML file format can help improve file and data management, data recovery, and interoperability with line-of-business systems by storing important metadata within the document.<\/p><\/blockquote>\n<p>A Microsoft <a style=\"text-decoration: line-through;\" rel=\"nofollow\" href=\"http:\/\/www.microsoft.com\/presspass\/features\/2005\/jun05\/06-01XMLFileFormat.mspx\">press release<\/a> quotes Senior Vice President Steven Sinofsky:<\/p>\n<blockquote><p>The new formats improve file and data management, data recovery, and interoperability with line-of-business systems beyond what&#8217;s possible with Office 2003 binary files.<\/p><\/blockquote>\n<p>Those are just four examples of a claim that has been repeated dozens of time.<\/p>\n<p>There are many kinds of document errors.\u00a0 Some errors are introduced by logic defects in the authoring application.\u00a0 Some are introduced by other, non-editor applications that might modify the document after it was authored.\u00a0 And some are caused failures in data transmission and storage.\u00a0 The Sinofsky press release gives some further detail into exactly what kinds of errors are more easily recoverable in the OOXML format:<\/p>\n<blockquote><p>With more and more documents traveling through e-mail attachments or removable storage, the chance of a network or storage failure increases the possibility of a document becoming corrupt. So it&#8217;s important that the new file formats also will improve data recovery&#8211;and since data is the lifeblood of most businesses, better data recovery has the potential to save companies tremendous amounts of money.<\/p><\/blockquote>\n<p>So clearly we&#8217;re talking here about network and storage failures, and not application logic errors.\u00a0 Good, this is a testable proposition then.\u00a0 We first need to model the effect of these errors on documents.<\/p>\n<h3>Modeling document errors<\/h3>\n<p>Let&#8217;s model &#8220;network and storage failures&#8221; so we can then test how OOXML files behave when subjected to these types of errors.<\/p>\n<p>With modern error-checking file transfer protocols, the days of transmission data errors are a memory.\u00a0 Maybe\u00a0 25 years ago, with XMODEM and other transfer mechanisms, you would see randomly-introduced transmission errors in the body of a document.\u00a0 But today the more likely problem would be that of truncation, of missing the last few bytes of a file transfer.\u00a0 This could happen for a variety of reasons, including logic errors in application-hosted file transfer support , to user-induced errors from removing a USB memory stick with uncommitted data still in the file buffer.\u00a0 (I remember debugging a program once that had a bug where it would lose the last byte of a file whenever the file was an exactt multiple of 1024 bytes.)\u00a0 These types of error can be particularly pernicious with some file formats.\u00a0 For example, the old Lotus WordPro file format stored the table of contents for the document container at the end of the file.\u00a0 This was great for incremental updating, but particularly bad for truncation errors.<\/p>\n<p>For this experiment I modeled truncation errors by generating a series of copies of a reference document, each copy truncating an additional byte from the end of the document.<\/p>\n<p>The other class of errors &#8212; &#8220;storage errors&#8221; as Sinofsky calls them &#8212; can come from a variety of hardware-level failures, including degeneration of the physical storage medum or mechanical errors in the storage device.\u00a0 The unit of physical storage &#8212; and thus of physical damage &#8212; is the sector.\u00a0 For most storage media the size of a sector is 512 bytes.\u00a0 I modeled storage errors by creating a series of copies of a reference document, and for each one selecting a random location within that document and then introducing a 512-byte run of random bytes.<\/p>\n<p>The reference document I used for these tests was Microsoft&#8217;s whitepaper, <cite>The Microsoft Office Open XML Formats<\/cite>.\u00a0 This is a 16-page document, with title page with logo, a table of contents, a running text footer, and a text box.<\/p>\n<h3>Test Execution<\/h3>\n<p>I tested Microsoft Word 2003, Word 2007 and OpenOffice.org 3.2.\u00a0\u00a0 I attempted to load each test document into each editor.\u00a0 Since corrupt documents have the potential to introduce application instability, I exited the editor between each test.<\/p>\n<p>Each test outcome was recorded as one of:<\/p>\n<ul>\n<li>Silent Recovery:\u00a0 The application gave no error or warning message.\u00a0 The document loaded, with partial localized corruption, but most of the data was recoverable.<\/li>\n<li>Prompted Recovery: The application gave an error or warning message offering to recover the data.\u00a0 The document loaded, with partial localized corruption, but most of the data was recoverable.<\/li>\n<li>Recovery Failed: The application gave an error or warning message offering to recover the data, but no data was able to be recovered.<\/li>\n<li>Failure to load: The application gave an error message and refused to load the document, or crashed or hanged attempting to load it.<\/li>\n<\/ul>\n<p>The first two outcomes were scored as successes, and the last two were scored as failures.<\/p>\n<h3>Results: Simulated File Truncation<\/h3>\n<p>In this series of tests I took each reference document (in DOC, DOCX and ODT formats) and created 32 truncated files corresponding to 1-32 bytes truncation.\u00a0 The results were the same regardless of the number of bytes truncated, as in the following table:<\/p>\n<p>[table id=3 \/]<\/p>\n<h3>Results: Simulated Sector Damage:<\/h3>\n<p>In these tests I created 30 copies of each reference document and introduced a random 512-byte run of random bytes,\u00a0 with the following summary results:<\/p>\n<p>[table id=6 \/]<\/p>\n<h3>Discussion<\/h3>\n<p>First, what do the results say about Microsoft&#8217;s claim that the OOXML format &#8220;improves&#8230;data recovery&#8230;beyond what&#8217;s possible with Office 2003 binary files&#8221;?\u00a0 A look at the above two tables brings this claim into question.\u00a0 With truncation errors, all three word processors scored 100% recovery using the legacy binary DOC format.\u00a0 With OOXML the same result was achieved only with Office 2007.\u00a0 But both Office 2003 and OpenOffice 3.2 failed to open any of the truncated documents.\u00a0 With the simulated sector-level errors, all three tested applications did far better recovering data from legacy DOC binary files than from OOXML files.\u00a0 For example, Microsoft Word 2007 recovered 83% of the DOC files but only 47% of the OOXML files.\u00a0 OpenOffice 3.2 recovered 90% of the DOC files, but only 37% of the OOXML files.<\/p>\n<p>In no case, of almost 200 tested documents, did we see the data recover of OOXML files exceed that of the legacy binary formats.\u00a0 This makes sense, if you consider this from an information theoretic perspective.\u00a0 The ZIP compression in OOXML, while it compresses the document at the same time makes the byte stream denser in terms of the information encoding.\u00a0 The number of physical bits per information bits is smaller in the ZIP than in the uncompressed DOC file. (In the limit of perfect compression, this ratio would be 1-to-1.)\u00a0 Because of this, a physical error of 1-bit introduces more than 1-bit of error in the information content of the document.\u00a0 In other words, a compressed document, all else being equal, will be less robust, not more robust to &#8220;network and storage failures&#8221;.\u00a0 Because of this it is extraordinary that Microsoft so frequently claims that OOXML is both smaller and more robust than the binary formats, without providing details of how they managed to optimize these two opposing and complementary qualities.<\/p>\n<p>Although no similar claims have been made regarding ODF documents, I tested them as well.\u00a0 Since ODF documents are compressed by ZIP, we would expect them to also be less robust to physical errors than DOC, for the same reasons discussed above.\u00a0 This was confirmed in the tests.\u00a0 However, ODF documents exhibited a higher recovery rate than OOXML.\u00a0 Both OpenOffice 3.2 (60% versus 37%) as well as Word 2007 (60% versus 47%) had higher recovery rates for ODF documents.\u00a0 If all else had been equal, we would have expected ODF documents to have lower recover rates than OOXML.\u00a0 Why?\u00a0 Because the ODF documents were on average 18% smaller than the corresponding OOXML documents, so the fixed 512-byte sector errors were proportionately larger impact in ODF documents.<\/p>\n<p>The above is explainable if we consider the general problem of random errors in markup.\u00a0 There are two opposing tendencies here.\u00a0 On the one hand, the greater the ratio of character data to markup, the more likely it will be that any introduced error will be benign to the integrity of the document, since it will most likely occur within a block of text.\u00a0 At the extreme, a plain text file, with no markup whatsoever, can handle any degree of error introduction with only proportionate data corruption.\u00a0 However, one can also argue in the other direction, that the more encoded structure there is in the document, the easier it is to surgically remove only the damaged parts of the file.\u00a0 However, we must acknowledge that physical errors, the &#8220;network and storage failures&#8221; that we looked at in these tests, do not respect document structure.\u00a0 Certainly the results of these tests call into question the wisdom of claiming that the complexity of the document model leads it to be more robust.\u00a0 When things go wrong, simplicity often wins.<\/p>\n<p>Finally, I should observe that application difference, as well as file format differences, play a role in determining success in recovering damaged files.\u00a0 With DOC files, OpenOffice.org 3.2 was able to read more files than either version of Microsoft Word.\u00a0 This confirms some of the anecdotes I&#8217;ve heard that OpenOffice will read files that Word will not.\u00a0 With OOXML files, however, Word 2007 did best, though OpenOffice fared better than Word 2003.\u00a0 With ODF files, both Word and OpenOffice scored the same.<\/p>\n<h3>Further work<\/h3>\n<p>Obviously the field of document file robustness is a complex question.\u00a0 These tests strongly motivate the thought that there are real differences in how robust document formats are with respect to corruption, and these observed differences appear to contradict claims made in Microsoft&#8217;s OOXML promotional materials.\u00a0 It would require more tests to demonstrate the significance and magnitude of those differences.<\/p>\n<p>With more test cases, one could also determine exactly which portions of a file are the most vulnerable.\u00a0 For example, one could make a heat map visualization to illustrate this.\u00a0 Are there any particular areas of a document where even a 1-byte error can cause total failures?\u00a0 It appears that a single-byte truncation error on OOXML documents will cause a total failure in Office 2003, but not in Office 2007.\u00a0 Are there any 1-byte errors that cause failure in both editors?<\/p>\n<p>We also need to remember that neither OOXML nor ODF are pure XML formats.\u00a0 Both formats involve a ZIP container file with multiple XML files and associated resources inside.\u00a0 So document corruption may consist of damage to the directory or compression structures of the ZIP container as well as errors introduced into the contained XML and other resources.\u00a0\u00a0\u00a0 The directory of the ZIP&#8217;s contents is stored at the end of the file.\u00a0 So the truncation errors are damaging the directory.\u00a0 However, this information is redundant, since each undamaged ZIP entry can be recovered in a sequential processing of the archive.\u00a0 So I would expect a near perfect recovery rate for the modest truncations exercised in these tests.\u00a0 But with OOXML files in Office 2003 and OpenOffice 3.2, even a truncation of a single byte prevented the document from loaded.\u00a0 This should be relatively easy to fix.<\/p>\n<p>Also, the large number of tests with the &#8220;Silently Recover&#8221; outcome is a concern.\u00a0 Although the problem in general is solved with digital signatures, there should be some lightweight way, perhaps checking CRC&#8217;s at the ZIP entry level, to detect and warn users when a file has been damaged.\u00a0 If this is not done, the user could inadvertently work and resave the damaged work or otherwise propagate the errors, when an early warning of the error would potentially give the user the opportunity, for example, to download the file again, or seek another, hopefully, undamaged copy of the document.\u00a0 But by silently recovering and loading the file, the user is not made aware of their risky situation.<\/p>\n<h3>Files and detailed results<\/h3>\n<p>If you are interested in repeating or extending these tests, here are the test files (including reference files) in <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/doc-errors\/doc.zip\">DOC<\/a>, <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/doc-errors\/docx.zip\">DOCX<\/a> and <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/doc-errors\/odt.zip\">ODT<\/a> formats.\u00a0 You can also download a ZIP of the <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/doc-errors\/code.zip\">Java source code<\/a> I used to introduce the document errors.\u00a0 And you can also download the ODF spreadsheet containing the <a href=\"https:\/\/2d823b65bb.nxcli.io\/blog\/attachments\/doc-errors\/test-results.ods\">detailed results<\/a>.<\/p>\n<p>WARNING: The above ZIP files contain corrupted documents.\u00a0 Loading them could potentially cause system instability and crash your word processor or operating system (if you are running Windows).\u00a0 You probably don&#8217;t want to be playing with them at the same time you are editing other critical documents.<\/p>\n<h3>Updates<\/h3>\n<p>2010-02-15: I did an additional 100 tests of DOC and DOCX in Office 2007.\u00a0 Combined with the previous 30, this gives the DOC files a recovery rate of 92% compared to only 45% for DOCX.\u00a0 With that we have significant results at 99% confidence level.<\/p>\n<p>Given that, can anyone see a basis for Microsoft&#8217;s claims?\u00a0 Or is this more subtle?\u00a0 Maybe they really meant to say that it is easier to recover from errors in an OOXML file, while ignoring the more significant fact that it is also far easier to corrupt an OOXML file.\u00a0 If so, the greater susceptibility to corruption seems to have outpaced any purported enhanced ability of Office 2007 to recover from these errors.<\/p>\n<p>It is like a car with bad brakes claiming that is has better airbags.\u00a0 No thanks.\u00a0 I&#8217;ll pass.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Summary In this post I take a look at Microsoft&#8217;s claims for robust data recovery with their Office Open XML (OOXML) file format.\u00a0 I show the results of an experiment, where I introduce random errors into documents and observe whether word processors can recover from these errors.\u00a0 Based on these result, I estimate data recovery [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_genesis_hide_title":false,"_genesis_hide_breadcrumbs":false,"_genesis_hide_singular_image":false,"_genesis_hide_footer_widgets":false,"_genesis_custom_body_class":"","_genesis_custom_post_class":"","_genesis_layout":"","footnotes":""},"categories":[9,6],"tags":[],"class_list":{"0":"post-701","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-odf","7":"category-ooxml","8":"entry"},"_links":{"self":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/701","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/comments?post=701"}],"version-history":[{"count":34,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/701\/revisions"}],"predecessor-version":[{"id":729,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/posts\/701\/revisions\/729"}],"wp:attachment":[{"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/media?parent=701"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/categories?post=701"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.robweir.com\/blog\/wp-json\/wp\/v2\/tags?post=701"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}