<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule"
	>
<channel>
	<title>Comments on: Microsoft Office document corruption: Testing the OOXML claims</title>
	<atom:link href="http://www.robweir.com/blog/2010/02/office-document-corruption.html/feed" rel="self" type="application/rss+xml" />
	<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html?utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=office-document-corruption</link>
	<description>Thinking the unthinkable, pondering the imponderable, effing the ineffable and scruting the inscrutable</description>
	<lastBuildDate>Tue, 07 Feb 2012 11:20:47 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Julien Wajsberg</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3208</link>
		<dc:creator>Julien Wajsberg</dc:creator>
		<pubDate>Tue, 09 Mar 2010 14:22:32 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3208</guid>
		<description>Hi,

You said : &quot;Because of this, a physical error of 1-bit introduces more than 1-bit of error in the information content of the document.&quot; I think that is not true, as you perfectly say in the previous sentence. I think you wanted to say :
&quot;Because of this, a physical error of 1-bit introduces a bigger error in the information content of the document in the ZIP than in the uncompressed DOC file.&quot;

Thanks for your analysis which is just perfect.</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>You said : &#8220;Because of this, a physical error of 1-bit introduces more than 1-bit of error in the information content of the document.&#8221; I think that is not true, as you perfectly say in the previous sentence. I think you wanted to say :<br />
&#8220;Because of this, a physical error of 1-bit introduces a bigger error in the information content of the document in the ZIP than in the uncompressed DOC file.&#8221;</p>
<p>Thanks for your analysis which is just perfect.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jose_X</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3009</link>
		<dc:creator>Jose_X</dc:creator>
		<pubDate>Tue, 23 Feb 2010 04:30:34 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3009</guid>
		<description>Michael, I followed the link.

First, that doesn&#039;t apply in the truncation case because that happens at the zip level and does not affect the XML inside for small truncations (Rob stated this was because the redundant zip toc? is what lies at the end of the file).

Second, for cases where the XML is likely affected, it&#039;s unclear from the data and discussion above (iirc) whether an error for ODF or for OOXML was a fatal error and, if so, whether that was the exact (or part of the) reason the application failed.

If it turns out that the failures were solely because of the XML standard but were otherwise recoverable, then this is something to be considered carefully. However: I&#039;m guessing that if what was lost/corrupted was solely text or various other types of data, then the error might not be defined by the standard as a fatal error. I&#039;m guessing a fatal error would include damage to the information defining the XML structure, and in most of these cases, recovery might not be possible, regardless of what the standard says.

To criticize this aspect of the standard, we should be specific (point to specific mentions of &quot;fatal error&quot;), and then come up with examples that show the fatal error requirement is unwise in such a case (eg, because faithful recovery would have been possible and hence desirable). Not taking these steps leaves us hand waving and possibly dwelling on a non-issue.</description>
		<content:encoded><![CDATA[<p>Michael, I followed the link.</p>
<p>First, that doesn&#8217;t apply in the truncation case because that happens at the zip level and does not affect the XML inside for small truncations (Rob stated this was because the redundant zip toc? is what lies at the end of the file).</p>
<p>Second, for cases where the XML is likely affected, it&#8217;s unclear from the data and discussion above (iirc) whether an error for ODF or for OOXML was a fatal error and, if so, whether that was the exact (or part of the) reason the application failed.</p>
<p>If it turns out that the failures were solely because of the XML standard but were otherwise recoverable, then this is something to be considered carefully. However: I&#8217;m guessing that if what was lost/corrupted was solely text or various other types of data, then the error might not be defined by the standard as a fatal error. I&#8217;m guessing a fatal error would include damage to the information defining the XML structure, and in most of these cases, recovery might not be possible, regardless of what the standard says.</p>
<p>To criticize this aspect of the standard, we should be specific (point to specific mentions of &#8220;fatal error&#8221;), and then come up with examples that show the fatal error requirement is unwise in such a case (eg, because faithful recovery would have been possible and hence desirable). Not taking these steps leaves us hand waving and possibly dwelling on a non-issue.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3008</link>
		<dc:creator>Michael</dc:creator>
		<pubDate>Tue, 23 Feb 2010 00:21:21 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3008</guid>
		<description>I thought it was part of a conforming XML implementation that it MUST abort processing (entirely) for quite a wide range of errors:  http://www.w3.org/TR/REC-xml/#dt-fatal

IIRC they changed this, it used to be just about any error.

So aren&#039;t all these implementations that try to fuddle through strictly non-conformant?

(just yet another reason xml is a bad choice for just about anything ...)</description>
		<content:encoded><![CDATA[<p>I thought it was part of a conforming XML implementation that it MUST abort processing (entirely) for quite a wide range of errors:  <a href="http://www.w3.org/TR/REC-xml/#dt-fatal" rel="nofollow">http://www.w3.org/TR/REC-xml/#dt-fatal</a></p>
<p>IIRC they changed this, it used to be just about any error.</p>
<p>So aren&#8217;t all these implementations that try to fuddle through strictly non-conformant?</p>
<p>(just yet another reason xml is a bad choice for just about anything &#8230;)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: What Windows Home Server and OOXML Have in Common: They Corrupt Data &#124; Boycott Novell</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3007</link>
		<dc:creator>What Windows Home Server and OOXML Have in Common: They Corrupt Data &#124; Boycott Novell</dc:creator>
		<pubDate>Mon, 22 Feb 2010 18:18:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3007</guid>
		<description>[...] is one area where the failure of Windows Home Server is similar to that of OOXML. According to this new post from Rob Weir, Microsoft Office has data corruption problems that affect OOXML.  In this post I take a look at [...]</description>
		<content:encoded><![CDATA[<p>[...] is one area where the failure of Windows Home Server is similar to that of OOXML. According to this new post from Rob Weir, Microsoft Office has data corruption problems that affect OOXML.  In this post I take a look at [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jose_X</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3005</link>
		<dc:creator>Jose_X</dc:creator>
		<pubDate>Mon, 22 Feb 2010 03:07:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3005</guid>
		<description>&gt;&gt; However, usually translated into block-level errors in the XML. 

Yes, I think that is correct (&quot;think&quot; only because I can&#039;t be sure without being sure of the algorithm, but I think it&#039;s FIFO-ish replacement more or less), thanks.

&gt;&gt; good.. if a random error makes a valid document invalid.. But we also want applications to be robust to errors

Right.. have errors produce effects that are always identifiable but never catastrophic.

&gt;&gt; maybe one could eliminate the apps from the test altogether and simply test the models in isolation, with simpler command-line apps that merely try to interpret the document

This would complement the fat app approach.

Some types of tests would not be visual and could gain a measurable passing/failing grade with minimal work on focused tools.

Also, attempting pure analysis here or there could in some cases lead to new insight and perhaps even to something approaching an &quot;undeniable proof&quot;.

&gt;&gt; You could even be methodical and introduce a block error starting at every successive offset in the file and in this way map out exactly which parts of the document are vulnerable.

.. a very large number of possibilities, and, if achieved, possibly resulting in data overload.

Perhaps this approach would be successful to stir creative juices and/or to help validate an existing theory.</description>
		<content:encoded><![CDATA[<p>&gt;&gt; However, usually translated into block-level errors in the XML. </p>
<p>Yes, I think that is correct (&#8220;think&#8221; only because I can&#8217;t be sure without being sure of the algorithm, but I think it&#8217;s FIFO-ish replacement more or less), thanks.</p>
<p>&gt;&gt; good.. if a random error makes a valid document invalid.. But we also want applications to be robust to errors</p>
<p>Right.. have errors produce effects that are always identifiable but never catastrophic.</p>
<p>&gt;&gt; maybe one could eliminate the apps from the test altogether and simply test the models in isolation, with simpler command-line apps that merely try to interpret the document</p>
<p>This would complement the fat app approach.</p>
<p>Some types of tests would not be visual and could gain a measurable passing/failing grade with minimal work on focused tools.</p>
<p>Also, attempting pure analysis here or there could in some cases lead to new insight and perhaps even to something approaching an &#8220;undeniable proof&#8221;.</p>
<p>&gt;&gt; You could even be methodical and introduce a block error starting at every successive offset in the file and in this way map out exactly which parts of the document are vulnerable.</p>
<p>.. a very large number of possibilities, and, if achieved, possibly resulting in data overload.</p>
<p>Perhaps this approach would be successful to stir creative juices and/or to help validate an existing theory.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rob</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3004</link>
		<dc:creator>Rob</dc:creator>
		<pubDate>Sun, 21 Feb 2010 18:21:57 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3004</guid>
		<description>@Jose_X, correct, my tests were block errors at the ZIP package level.  However, usually translated into block-level errors in the XML.  You can see that in the test files I posted.  I think as documents get larger this becomes even more true.  This is because, except for directory and compression dictionary areas -- which are particularly vulnerable ---  a contiguous portion of the ZIP will usually decompress into a contiguous portion of content, especially if the block size is much less than the size of the main XML file.

To you&#039;re other point, I&#039;ve done tests like that before for other formats, but more from the perspective of testing the app.  For example, when I worked on the initial port of Xalan XSLT engine to the C++ version I used similar techniques to introduce random perturbations of XSLT scripts to see whether these errors would be handled graciously by the code.  That was easy enough, since the engine itself was command-line driven and you could wrap it all in a WIN32 debugger session to catch and recover from all memory faults, etc.  Doing the same for a GUI app like MS Office or OpenOffice in theory is possible.  You could even be methodical and introduce a block error starting at every successive offset in the file and in this way map out exactly which parts of the document are vulnerable.

In terms of errors introduced by application bugs, I you need to set the criteria carefully.  It is a good thing, IMHO, if a random error makes a valid document invalid since that allows errors to be identifies by common tools, e.g., schema validators.  But we also want applications to be robust to errors, so users don&#039;t loose their work.  Most of the errors I detected were things that should be recoverable by using a robust ZIP library and a robust XML parser.  Until the apps handle those more surface issues better, I don&#039;t think testing is going to tell us much about the robustness of the underlying document models.  Or, maybe one could eliminate the apps from the test altogether and simply test the models in isolation, with simpler command-line apps that merely try to interpret the document, resolve all references to content and styles, etc., but don&#039;t attempt to render anything?</description>
		<content:encoded><![CDATA[<p>@Jose_X, correct, my tests were block errors at the ZIP package level.  However, usually translated into block-level errors in the XML.  You can see that in the test files I posted.  I think as documents get larger this becomes even more true.  This is because, except for directory and compression dictionary areas &#8212; which are particularly vulnerable &#8212;  a contiguous portion of the ZIP will usually decompress into a contiguous portion of content, especially if the block size is much less than the size of the main XML file.</p>
<p>To you&#8217;re other point, I&#8217;ve done tests like that before for other formats, but more from the perspective of testing the app.  For example, when I worked on the initial port of Xalan XSLT engine to the C++ version I used similar techniques to introduce random perturbations of XSLT scripts to see whether these errors would be handled graciously by the code.  That was easy enough, since the engine itself was command-line driven and you could wrap it all in a WIN32 debugger session to catch and recover from all memory faults, etc.  Doing the same for a GUI app like MS Office or OpenOffice in theory is possible.  You could even be methodical and introduce a block error starting at every successive offset in the file and in this way map out exactly which parts of the document are vulnerable.</p>
<p>In terms of errors introduced by application bugs, I you need to set the criteria carefully.  It is a good thing, IMHO, if a random error makes a valid document invalid since that allows errors to be identifies by common tools, e.g., schema validators.  But we also want applications to be robust to errors, so users don&#8217;t loose their work.  Most of the errors I detected were things that should be recoverable by using a robust ZIP library and a robust XML parser.  Until the apps handle those more surface issues better, I don&#8217;t think testing is going to tell us much about the robustness of the underlying document models.  Or, maybe one could eliminate the apps from the test altogether and simply test the models in isolation, with simpler command-line apps that merely try to interpret the document, resolve all references to content and styles, etc., but don&#8217;t attempt to render anything?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3003</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Sun, 21 Feb 2010 18:04:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3003</guid>
		<description>Silently recovering a document _incorrectly_ should be considered a FAIL. Thus I&#039;d break up the silent recovery into two cases - silent correct recovery vs silent incorrect recovery.</description>
		<content:encoded><![CDATA[<p>Silently recovering a document _incorrectly_ should be considered a FAIL. Thus I&#8217;d break up the silent recovery into two cases &#8211; silent correct recovery vs silent incorrect recovery.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jose_X</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3001</link>
		<dc:creator>Jose_X</dc:creator>
		<pubDate>Sun, 21 Feb 2010 15:35:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3001</guid>
		<description>To clarify the previous comment a little more:

I am not interested in random disk errors to files. I am thinking that modeling in some way random errors at the XML level (perhaps crudely simulated with random localized errors to XML files) models deviations from &quot;spec&quot; based on bugs. Syntax deviations are like random errors on the file stream. Semantic deviation might function similarly or not depending on the details.

[I used quotes around &quot;bugs&quot; to suggest that certain bugs might not be accidental or, otherwise, might be allowed to exist by a market dominant implementer since these bugs could promote profit margins through their effect of making it more difficult for third parties to reproduce these buggy effects in their competing products.]

It could be useful to identify which format is more amenable to having X changes in semantics/syntax have a more compounding effect more likely to lead to gibberish by third party products not able to reproduce these changes correctly. Intuition led (leads?) me to think that a &quot;delta&quot; representation of data is less robust to changes (&quot;errors&quot;) in the stream, all else being equal. Maybe I need to think about this more, but the idea is not too unlike what you mentioned (or implied) about the problem with &quot;silent&quot; errors going into a document and how these might surface as time goes on, perhaps having a compounding effect as well (through repeated document manipulations, savings, openings, etc).</description>
		<content:encoded><![CDATA[<p>To clarify the previous comment a little more:</p>
<p>I am not interested in random disk errors to files. I am thinking that modeling in some way random errors at the XML level (perhaps crudely simulated with random localized errors to XML files) models deviations from &#8220;spec&#8221; based on bugs. Syntax deviations are like random errors on the file stream. Semantic deviation might function similarly or not depending on the details.</p>
<p>[I used quotes around "bugs" to suggest that certain bugs might not be accidental or, otherwise, might be allowed to exist by a market dominant implementer since these bugs could promote profit margins through their effect of making it more difficult for third parties to reproduce these buggy effects in their competing products.]</p>
<p>It could be useful to identify which format is more amenable to having X changes in semantics/syntax have a more compounding effect more likely to lead to gibberish by third party products not able to reproduce these changes correctly. Intuition led (leads?) me to think that a &#8220;delta&#8221; representation of data is less robust to changes (&#8220;errors&#8221;) in the stream, all else being equal. Maybe I need to think about this more, but the idea is not too unlike what you mentioned (or implied) about the problem with &#8220;silent&#8221; errors going into a document and how these might surface as time goes on, perhaps having a compounding effect as well (through repeated document manipulations, savings, openings, etc).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jose_X</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-3000</link>
		<dc:creator>Jose_X</dc:creator>
		<pubDate>Sun, 21 Feb 2010 14:57:24 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-3000</guid>
		<description>&gt;&gt; As for XML formats in general, their modularity is in their abstract model. It is not necessarily in the character stream representation. The characters that follow your element X could be content of X, a sibling of X, a child of X or even stuff further up the document tree than X. So any block-level damage could take a whack out of your document tree that is very inelegant and certainly non-localized.

To be sure I read correctly, this discussion is in reply to Uri but does not correlate to the tests performed, right? In other words, there was no block level testing of XML, right? .. meaning the block damage was done against zipped files only, right? [Testing unzipped ODF/OOXML would depend on how the file system would store related but distinct XML files, and this could vary quite a bit (over time and over different file systems).]

While on this subject...

I would like to see the effects of damages to individual XML files for ODF and for OOXML. Though I am much more familiar with ODF than OOXML, there is much I don&#039;t know about either of these formats. What I am after is to see what percentage of localized errors (like the random block data test) have devastating effects on these two formats.

I wonder because I think I had heard that a greater percentage of OOXML is &quot;delta-based&quot;, meaning that the effect of errors &quot;early&quot; in the stream would compound afterward at a faster rate than &quot;nondelta-based&quot; (but still nested) XML.

Besides the statement that such a result would make about a level of robustness of each format, I&#039;m also thinking that it might be more difficult for third parties to keep up with a proprietary OOXML implementation (eg, a dominating implementation like MSOffice20xx) than it would be a proprietary ODF implementation based on &quot;bugs&quot; found respectively within such closed source implementations. Looked at from a different angle, if the results are as I imagine is the case, then it would be easier to keep a closed-source OOXML implementation from being matched sufficiently well by competing third parties (than would be the case for ODF) through the careful insertion of &quot;bugs&quot; that disagree with the spec (or that leverage ambiguous or missing components of the spec).

Regardless of which is easier to &quot;manipulate&quot;, OOXML or ODF, knowing that answer may help third parties better decide how to address a closed source market leader (in OOXML or in ODF).</description>
		<content:encoded><![CDATA[<p>&gt;&gt; As for XML formats in general, their modularity is in their abstract model. It is not necessarily in the character stream representation. The characters that follow your element X could be content of X, a sibling of X, a child of X or even stuff further up the document tree than X. So any block-level damage could take a whack out of your document tree that is very inelegant and certainly non-localized.</p>
<p>To be sure I read correctly, this discussion is in reply to Uri but does not correlate to the tests performed, right? In other words, there was no block level testing of XML, right? .. meaning the block damage was done against zipped files only, right? [Testing unzipped ODF/OOXML would depend on how the file system would store related but distinct XML files, and this could vary quite a bit (over time and over different file systems).]</p>
<p>While on this subject&#8230;</p>
<p>I would like to see the effects of damages to individual XML files for ODF and for OOXML. Though I am much more familiar with ODF than OOXML, there is much I don&#8217;t know about either of these formats. What I am after is to see what percentage of localized errors (like the random block data test) have devastating effects on these two formats.</p>
<p>I wonder because I think I had heard that a greater percentage of OOXML is &#8220;delta-based&#8221;, meaning that the effect of errors &#8220;early&#8221; in the stream would compound afterward at a faster rate than &#8220;nondelta-based&#8221; (but still nested) XML.</p>
<p>Besides the statement that such a result would make about a level of robustness of each format, I&#8217;m also thinking that it might be more difficult for third parties to keep up with a proprietary OOXML implementation (eg, a dominating implementation like MSOffice20xx) than it would be a proprietary ODF implementation based on &#8220;bugs&#8221; found respectively within such closed source implementations. Looked at from a different angle, if the results are as I imagine is the case, then it would be easier to keep a closed-source OOXML implementation from being matched sufficiently well by competing third parties (than would be the case for ODF) through the careful insertion of &#8220;bugs&#8221; that disagree with the spec (or that leverage ambiguous or missing components of the spec).</p>
<p>Regardless of which is easier to &#8220;manipulate&#8221;, OOXML or ODF, knowing that answer may help third parties better decide how to address a closed source market leader (in OOXML or in ODF).</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rob</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2995</link>
		<dc:creator>Rob</dc:creator>
		<pubDate>Thu, 18 Feb 2010 02:31:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2995</guid>
		<description>OpenOffice gets the same result as Office 2003 in truncation tests, a big 0%. Presumably they just need to use a more robust ZIP routine that scans the entries serially if the directory at the end is corrupted.  On the sector damage, OpenOffice gets better results than Microsoft Office 2003.

In any case, to your question, no I have not entered a defect report on these issues.  Since I do not use OOXML myself, the fact that OOXML documents are so easily damaged in irrecoverable ways is not my problem.  However, I have posted the test cases and code for reproducing these results,.  The vendors are welcome to deal with this bug as they wish. I&#039;d mainly hope that Microsoft would stop making baseless and false robustness claims that are easily refuted.  One can hope, at least.</description>
		<content:encoded><![CDATA[<p>OpenOffice gets the same result as Office 2003 in truncation tests, a big 0%. Presumably they just need to use a more robust ZIP routine that scans the entries serially if the directory at the end is corrupted.  On the sector damage, OpenOffice gets better results than Microsoft Office 2003.</p>
<p>In any case, to your question, no I have not entered a defect report on these issues.  Since I do not use OOXML myself, the fact that OOXML documents are so easily damaged in irrecoverable ways is not my problem.  However, I have posted the test cases and code for reproducing these results,.  The vendors are welcome to deal with this bug as they wish. I&#8217;d mainly hope that Microsoft would stop making baseless and false robustness claims that are easily refuted.  One can hope, at least.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Fuzzy</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2994</link>
		<dc:creator>Fuzzy</dc:creator>
		<pubDate>Wed, 17 Feb 2010 22:27:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2994</guid>
		<description>Although it was not the point of the experiment, I think your tests show that there is room for improvement in OpenOffice&#039;s recovery code.  Microsoft Office was able to recover the OOXML Simulated File Truncation 100% of the cases, while OpenOffice 3.2 failed 100% of the cases.  Similarly, Microsoft Office was able to recover 47% of the Simulated Sector Damage cases, but OpenOffice 3.2 was only 37% successful.  Did you submit a report (or two) to the OpenOffice.org Issue Tracker?</description>
		<content:encoded><![CDATA[<p>Although it was not the point of the experiment, I think your tests show that there is room for improvement in OpenOffice&#8217;s recovery code.  Microsoft Office was able to recover the OOXML Simulated File Truncation 100% of the cases, while OpenOffice 3.2 failed 100% of the cases.  Similarly, Microsoft Office was able to recover 47% of the Simulated Sector Damage cases, but OpenOffice 3.2 was only 37% successful.  Did you submit a report (or two) to the OpenOffice.org Issue Tracker?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jakub Narębski</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2993</link>
		<dc:creator>Jakub Narębski</dc:creator>
		<pubDate>Wed, 17 Feb 2010 11:08:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2993</guid>
		<description>@Rob: Not &quot;rewrote their DOC code&quot; (which would mean forgetting about corner-cases and bugfixes, and spec vs reality), but &lt;b&gt;refactored&lt;/b. their DOC code.

Just a bit of nitpicking.</description>
		<content:encoded><![CDATA[<p>@Rob: Not &#8220;rewrote their DOC code&#8221; (which would mean forgetting about corner-cases and bugfixes, and spec vs reality), but <b>refactored&lt;/b. their DOC code.</p>
<p>Just a bit of nitpicking.</b></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Felix</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2992</link>
		<dc:creator>Felix</dc:creator>
		<pubDate>Tue, 16 Feb 2010 23:12:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2992</guid>
		<description>Yes, that sounds right since it is the entirety of the data that makes up the document that we are aiming to protect not any particular part of it.

Does that fall outside of the spec for ODF?</description>
		<content:encoded><![CDATA[<p>Yes, that sounds right since it is the entirety of the data that makes up the document that we are aiming to protect not any particular part of it.</p>
<p>Does that fall outside of the spec for ODF?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rob</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2991</link>
		<dc:creator>Rob</dc:creator>
		<pubDate>Tue, 16 Feb 2010 19:28:23 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2991</guid>
		<description>@Uri, Here is something I would believe.  By taking their old cruddy DOC reading code, that was an accumulation of a decade of hacking, and rewriting new import/export support for OOXML, and subjecting it to extensive testing, they have an input module that is higher quality and easier to maintain than what they had before.  This could lead to less corruption.  But this &quot;fresh start&quot; approach has nothing to do with OOXML.  I think they would have had the same outcome if they wrote a fresh ODF filter, or even if they rewrote their DOC code. 

As for XML formats in general, their modularity is in their abstract model.  It is not necessarily in the character stream representation.  The characters that follow your element X could be content of X, a sibling of X, a child of X or even stuff further up the document tree than X.  So any block-level damage could take a whack out of your document tree that is very inelegant and certainly non-localized.  Of course, you could, as OOXML does, move different pieces into different XML files.  But that won&#039;t have a real impact if, like OOXML, 80% of the stuff still ends up in one place, e..g, in document.xml, as it does. 

But it is an interesting question: if you wanted to design an XML format to be resistant to certain kinds of damage, what would be your design points?

@Felix, it is an interesting idea, but I&#039;d still solve it at a level higher than the XML.  Why?  Because an ODF document can contain other resources, like image files, that are not in XML.  But maybe using Error Correcting Codes at the ZIP level would work?</description>
		<content:encoded><![CDATA[<p>@Uri, Here is something I would believe.  By taking their old cruddy DOC reading code, that was an accumulation of a decade of hacking, and rewriting new import/export support for OOXML, and subjecting it to extensive testing, they have an input module that is higher quality and easier to maintain than what they had before.  This could lead to less corruption.  But this &#8220;fresh start&#8221; approach has nothing to do with OOXML.  I think they would have had the same outcome if they wrote a fresh ODF filter, or even if they rewrote their DOC code. </p>
<p>As for XML formats in general, their modularity is in their abstract model.  It is not necessarily in the character stream representation.  The characters that follow your element X could be content of X, a sibling of X, a child of X or even stuff further up the document tree than X.  So any block-level damage could take a whack out of your document tree that is very inelegant and certainly non-localized.  Of course, you could, as OOXML does, move different pieces into different XML files.  But that won&#8217;t have a real impact if, like OOXML, 80% of the stuff still ends up in one place, e..g, in document.xml, as it does. </p>
<p>But it is an interesting question: if you wanted to design an XML format to be resistant to certain kinds of damage, what would be your design points?</p>
<p>@Felix, it is an interesting idea, but I&#8217;d still solve it at a level higher than the XML.  Why?  Because an ODF document can contain other resources, like image files, that are not in XML.  But maybe using Error Correcting Codes at the ZIP level would work?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Uri</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2990</link>
		<dc:creator>Uri</dc:creator>
		<pubDate>Tue, 16 Feb 2010 16:38:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2990</guid>
		<description>You&#039;re right, Microsoft&#039;s story regarding resistance to corruption is a complete fabrication. And it&#039;s not a one-off mistake as I thought, but deliberate misinformation.

I still believe that users of XML formats (ODF, OOXML, etc) encounter much less severe document corruption, though I don&#039;t have data to back that up. Microsoft could have said: &quot;Remember all those documents our buggy software corrupted? Now we have the same amount of bugs but they are less likely to corrupt your files.&quot; But that&#039;s not good marketing, so they aligned on an alternative story about network and storage failures, which is unfortunately a lie. It&#039;s hardly commendable, but the overall claim of less corrupted documents probably stands.</description>
		<content:encoded><![CDATA[<p>You&#8217;re right, Microsoft&#8217;s story regarding resistance to corruption is a complete fabrication. And it&#8217;s not a one-off mistake as I thought, but deliberate misinformation.</p>
<p>I still believe that users of XML formats (ODF, OOXML, etc) encounter much less severe document corruption, though I don&#8217;t have data to back that up. Microsoft could have said: &#8220;Remember all those documents our buggy software corrupted? Now we have the same amount of bugs but they are less likely to corrupt your files.&#8221; But that&#8217;s not good marketing, so they aligned on an alternative story about network and storage failures, which is unfortunately a lie. It&#8217;s hardly commendable, but the overall claim of less corrupted documents probably stands.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Felix</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2989</link>
		<dc:creator>Felix</dc:creator>
		<pubDate>Tue, 16 Feb 2010 16:05:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2989</guid>
		<description>&quot;But I think the “correct” way of handling this in today’s desktop environment is via digital signatures (or just the hash part) to verify document integrity and then using storage level redundancy (RAID). I don’t mean to suggest we solve this problem in the file format itself.&quot;

That&#039;s great for business users with RAID servers, but for everybody else an option which said:

 Use Safe File Saving (this will increase the size of your documents)?      On/Off

would be pretty useful.</description>
		<content:encoded><![CDATA[<p>&#8220;But I think the “correct” way of handling this in today’s desktop environment is via digital signatures (or just the hash part) to verify document integrity and then using storage level redundancy (RAID). I don’t mean to suggest we solve this problem in the file format itself.&#8221;</p>
<p>That&#8217;s great for business users with RAID servers, but for everybody else an option which said:</p>
<p> Use Safe File Saving (this will increase the size of your documents)?      On/Off</p>
<p>would be pretty useful.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rob</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2988</link>
		<dc:creator>Rob</dc:creator>
		<pubDate>Mon, 15 Feb 2010 23:29:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2988</guid>
		<description>@Felix, that starts getting into the realm of what we call &quot;Error Correcting Codes&quot; or ECC&#039;s.  These provide optimal ways of adding redundant data to a data stream in order to be robust to a given level of noise.  But I think the &quot;correct&quot; way of handling this in today&#039;s desktop environment is via digital signatures (or just the hash part) to verify document integrity and then using storage level redundancy (RAID).  I don&#039;t mean to suggest we solve this problem in the file format itself.  I&#039;m just pointing out the idiocy of Microsoft claiming that they have done so.

@Uri, this is far more than the Sinofsky quote in a press release.  This claim has been repeated over and over again, including the claim that OOXML has magical recovery properties in the face of &quot;storage failures&quot;.  

For example &lt;a re=&quot;nofollow&quot; href=&quot;http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000000652&quot; rel=&quot;nofollow&quot;&gt;this case study&lt;/a&gt; claims: &quot;XML-based documents are less likely to lose valuable information if the file suffers a partial corruption through a transfer or storage failure, for example. This makes the new Open XML format more resilient and reliable than the binary format used by previous Microsoft Office releases. Most files can still be opened if a component within the file is damaged.&quot;

Also, the same claim in Microsoft&#039;s technical repository, &lt;a rel=&quot;nofollow&quot; href=&quot;http://msdn.microsoft.com/en-us/library/aa338205.aspx&quot; rel=&quot;nofollow&quot;&gt;MSDN.&lt;/a&gt;

This is not an off-the-cuff improvisation by a single VP.  You will find this mantra repeated dozens of times, in press releases, in white papers, in training materials, in notes to ISO NBs, in Microsoft blogs, etc.</description>
		<content:encoded><![CDATA[<p>@Felix, that starts getting into the realm of what we call &#8220;Error Correcting Codes&#8221; or ECC&#8217;s.  These provide optimal ways of adding redundant data to a data stream in order to be robust to a given level of noise.  But I think the &#8220;correct&#8221; way of handling this in today&#8217;s desktop environment is via digital signatures (or just the hash part) to verify document integrity and then using storage level redundancy (RAID).  I don&#8217;t mean to suggest we solve this problem in the file format itself.  I&#8217;m just pointing out the idiocy of Microsoft claiming that they have done so.</p>
<p>@Uri, this is far more than the Sinofsky quote in a press release.  This claim has been repeated over and over again, including the claim that OOXML has magical recovery properties in the face of &#8220;storage failures&#8221;.  </p>
<p>For example <a re="nofollow" href="http://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=4000000652" rel="nofollow">this case study</a> claims: &#8220;XML-based documents are less likely to lose valuable information if the file suffers a partial corruption through a transfer or storage failure, for example. This makes the new Open XML format more resilient and reliable than the binary format used by previous Microsoft Office releases. Most files can still be opened if a component within the file is damaged.&#8221;</p>
<p>Also, the same claim in Microsoft&#8217;s technical repository, <a rel="nofollow" href="http://msdn.microsoft.com/en-us/library/aa338205.aspx" rel="nofollow">MSDN.</a></p>
<p>This is not an off-the-cuff improvisation by a single VP.  You will find this mantra repeated dozens of times, in press releases, in white papers, in training materials, in notes to ISO NBs, in Microsoft blogs, etc.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Uri</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2987</link>
		<dc:creator>Uri</dc:creator>
		<pubDate>Mon, 15 Feb 2010 22:23:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2987</guid>
		<description>I think you&#039;re looking at the wrong problem. I would guess the vastly dominant form of document corruption is that caused by implementation errors - bugs in the editor. I find it easy to believe that it&#039;s easier to recover from a bug in an XML than in a dump-from-memory. In Real Life, people that use XML formats probably actually will encounter much less incidents of totally corrupt files, just because bugs in editors will be easier to ignore.

True, this has nothing to do with what Sinofsky said, but he&#039;s a Senior Vice President so you can&#039;t expect him to be in touch with reality. Some manager 4 levels below him was asked to prepare a Powerpoint presentation with talking points about the new format, and the engineers explained the advantage of easier recovery. By the time it got to Sinofsky it was a single context free bullet point that said &quot;improved recovery&quot;, which he used as a theme to wax rhapsodic about without having a clue what that means.

Marketing is saying things that sound good, not correct things. Proving that some marketing claim is incorrect is too easy. It&#039;s much more interesting to check the actual results - what part of actual documents becomes corrupt, and to what degree? I&#039;m guessing that your experiment tells us very little about that, simply because it models a very small part of the prevalent corruption mechanisms.</description>
		<content:encoded><![CDATA[<p>I think you&#8217;re looking at the wrong problem. I would guess the vastly dominant form of document corruption is that caused by implementation errors &#8211; bugs in the editor. I find it easy to believe that it&#8217;s easier to recover from a bug in an XML than in a dump-from-memory. In Real Life, people that use XML formats probably actually will encounter much less incidents of totally corrupt files, just because bugs in editors will be easier to ignore.</p>
<p>True, this has nothing to do with what Sinofsky said, but he&#8217;s a Senior Vice President so you can&#8217;t expect him to be in touch with reality. Some manager 4 levels below him was asked to prepare a Powerpoint presentation with talking points about the new format, and the engineers explained the advantage of easier recovery. By the time it got to Sinofsky it was a single context free bullet point that said &#8220;improved recovery&#8221;, which he used as a theme to wax rhapsodic about without having a clue what that means.</p>
<p>Marketing is saying things that sound good, not correct things. Proving that some marketing claim is incorrect is too easy. It&#8217;s much more interesting to check the actual results &#8211; what part of actual documents becomes corrupt, and to what degree? I&#8217;m guessing that your experiment tells us very little about that, simply because it models a very small part of the prevalent corruption mechanisms.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Felix</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2986</link>
		<dc:creator>Felix</dc:creator>
		<pubDate>Mon, 15 Feb 2010 21:03:38 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2986</guid>
		<description>Immediately having posted my previous comment, it occured to me that there will of course(!) be more efficient methods of obtaining reliability than storing 2 identical copies of the file. 

Nonetheless, the file format should provide these as an option to users. 

Rob, can you put something in ODF 1.2 which says &quot;this file is X Bytes, with an additional Y Bytes of data recovery information&quot; and specify that for ODF 1.2 Y will be zero but in subsequent versions it may be greater than 0?</description>
		<content:encoded><![CDATA[<p>Immediately having posted my previous comment, it occured to me that there will of course(!) be more efficient methods of obtaining reliability than storing 2 identical copies of the file. </p>
<p>Nonetheless, the file format should provide these as an option to users. </p>
<p>Rob, can you put something in ODF 1.2 which says &#8220;this file is X Bytes, with an additional Y Bytes of data recovery information&#8221; and specify that for ODF 1.2 Y will be zero but in subsequent versions it may be greater than 0?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Felix</title>
		<link>http://www.robweir.com/blog/2010/02/office-document-corruption.html#comment-2985</link>
		<dc:creator>Felix</dc:creator>
		<pubDate>Mon, 15 Feb 2010 20:57:10 +0000</pubDate>
		<guid isPermaLink="false">http://www.robweir.com/blog/?p=701#comment-2985</guid>
		<description>This suggests to me that there should be an option in the file format standard to store the file twice back-to back (ABCDEABCDE) , to eliminate both of these potential sources of error.
The size reduction gained in moving to compressed file formats could be spent on recoverability.

Can you give an approximate size ratio for file in ODF, OOXML and DOC?

Thanks</description>
		<content:encoded><![CDATA[<p>This suggests to me that there should be an option in the file format standard to store the file twice back-to back (ABCDEABCDE) , to eliminate both of these potential sources of error.<br />
The size reduction gained in moving to compressed file formats could be spent on recoverability.</p>
<p>Can you give an approximate size ratio for file in ODF, OOXML and DOC?</p>
<p>Thanks</p>
]]></content:encoded>
	</item>
</channel>
</rss>

<!-- Dynamic Page Served (once) in 0.738 seconds -->

