I noticed a curious argument in Jonathan Corbet’s LWN article “Supporting OOXML in LibreOffice” (behind a pay wall). Why should we support OOXML?
…as has been pointed out in the discussion, Microsoft will, someday, phase out support for its (equally proprietary) DOC format, leaving OOXML as the only real option for document interchange. There appears to be little hope that Microsoft’s ODF support will be sufficient to make ODF a viable alternative. So any office productivity suite which aspires to millions of users, and which does not support OOXML, will find itself scrambling to add that support when DOC is no longer an option. It seems better to maintain (and improve) that support now than to be rushing to merge a substandard implementation in the future.
Really? The same company that is unable to fix a leap-year calculation bug from 20 years ago because of fears it might break backwards compatibility is going to remove support for their binary formats? Seriously, is that what people are saying? This sounds like something Microsoft would say to scare people into migrating.
But don’t listen to my opinions. Let’s look at the numbers. I’ve been tracking document counts via Google for almost four years now, looking at the relative distribution of document types, across OOXML, ODF, Legacy Binary, PDF, XPS, etc. Because the size of the web is growing, one cannot fairly compare the absolute numbers of documents from week to week. But the distribution of documents over time is something worth noting.
The following chart shows the percentage of documents on the web that are in OOXML format, as a percentage of all MS Office documents. Note carefully the scale of the chart. It is peaking at less than 3%. So 97+% of the Microsoft Office documents on the web today are in the legacy binary formats, even four years after Office 2007 was released.
Of course, for any given organization these numbers may vary. Some are 100% on the XML formats. Some are 0% on them. If you look at just “gov” internet domains, the percentage today is only 0.7%. If you look at only “edu” domains, the number is 4.5%. No doubt, within organizations, non-public work documents might have a different distribution. But clearly the large number of existing legacy binary documents on government web sites alone is sufficient to prove my point. DOC is not going away.
I call “FUD” on this one.
Michael Kohne says
The thought that MS would ever completely drop support for their old binary formats doesn’t pass the laugh test. They simply DON’T DO THAT. They may well at some point relegate them to translator plug-ins (in 5-10 years), and they may get slightly broken along the way (wouldn’t be the first time), but you will ALWAYS be able to read the older formats in the latest version of MS Office.
Victor Soliz says
Oh boy, that one statement really expects a lot of suspension of disbelief from readers.
Bob Jonkman says
Just because OOXML has a share of 3% does NOT mean that 97% are “in the legacy binary formats”. There are more document formats than you can shake a stick at, some open text format, some binary:
and another hundred or so that slip my recollection.
Seeing a chart that directly compares them all would be instructive.
Jonathan Corbet says
FUD or not, it’s a concern that the LibreOffice developers have, and it has been a motivating factor in their decision. It’s not something I made up…
Here’s a non-paywalled link for the curious: http://lwn.net/SubscriberLink/422367/d415cad9e4fde434/
@Bob, when I say “legacy binary formats” I’m referring to those of MS Office. I’m not sure the number of TXT or WP5 files is relevant to this particular argument.
@Jonathan, I never said you made this up. The context of my quotation of you made it clear (I believe) that you were repeating what you had heard. In any case, unrefuted FUD certainly does impact judgment, so rather than just repeating the FUD, I’m trying to shed some light on the situation.
I agree and have a quick sanity check of the relevance XPS to the larger world vrs PDF that you might find amusing. A search for common words like “cats” in XPS turns up nothing across all major search engines, while the same thing turns up millions of pdf documents. I thought the common word filter would reflect common use, but now I’m not so sure.
Having established that most search engines return similar results to Google, if they classify results that way, I’ll use Google for the next examples for doc, docx, odt, pdf and much more!
A search for “cats” with the file type doc turns up 64,000 results, 1,400 docx, 112 odt and an astounding 13,000,000 pdf. The word happy, 240,000 to 7,000 to 383 to 4,100,000. Happy, 472,000 to 11,000 to 850 to 16,000,000. Airplane 32,8000 to 858 to 283 to 5,100,000. Hair 240,000 to 4,900 to 447 to 6,500,000.
In general, Google indexes 89,300,000 doc, 2,300,000 docx, 425,000 odt and 500,000 pdf documents without further qualification.
A common word filter may be a measure of common use or link farming spam.
A relatively uncommon word like ampere brings 15,800 doc, 255 docx, 254 odt and 819,000 pdf. Filament 13,100 doc to 223 to 167 to 1,320,000. Stamen 1,400 to 241 to 4 to 37,000. Stilton 2,700 to 103 to 39 to 51,000. Grommet 1700 to 112 to 8 to 260,000. Kylix 500 to 16 to 35 to 6,800.
Perhaps less spam targeted words better reflect the general population of file format use?
Given the free availability of Open Office and common use of Google Docs, I’m surprised that odt does not show up more relative to doc and docx.
Perhaps this is because all of the word processing document formats are scratch work that is shed before it is exposed. When people actually publish they write to the web directly and sometimes make a pdf because it’s easier that way and they know that nine times out of ten that Word Doc won’t look right on the other end. The word “grommet”, for example, is indexed about 3,000,000 times with 800,000 html, 256,000 pdf, 91,000 asp, 5,000 shtml, 3,500 xls, 2,600 txt, 1,600 doc, 1500 xml, 820 dll, 409 rtf, 300 exe, 141 ppt, 112 docx, 83 xlsx, 8 odt, 9 jpg, 7 png, 6 zip, 4 flv, 1 wpd and 1 gnumeric result. The individual file types only make up about a third of the unqualified result, but I’m at a loss for filetypes. Apparently, no one makes songs or movies about gommets but people who care to talk about them in public do so on the web or by pdf.
I’m not surprised by docx being on the order of a few percent of doc and that should be the end of ooxml FUD for free software developers. The only people I’ve heard of using Microsoft’s new file formats are grade school administrators but I have never seen any myself. This is probably because anyone trying to push an ooxml is promptly reprimanded when a large portion if not the majority of receivers are unable to read it. OOXML is rarely seen in the wild and is more likely to go the way of XPS, Silverlight, Zune, or Vista than become common. These days Microsoft has more losers than winners. Don’t knock yourself out doing Steve Balmer’s work, folks.
MSOffice had been crashing a lot lately on one of our office machines and researching it led a coworker to an MS forum page where an MS employee said the .doc format was likely the cause of the crash and shouldn’t be used. While I don’t think this was the cause in our case and I am not sure if its ever the cause, it seems to be the case that MS is now spreading FUD about their old formats.
@twitter, I’m seeing the PDF:XPS ratio as 9655:1.
The important thing to distinguish here is the distribution of new document creation that are put on the web versus the distribution of legacy documents on the web. These are two different things. Using Google we can approximate the latter. And by looking at trends I think we can even model the former, thought it trickier to do so. But we cannot say anything certain about new (or legacy) documents that are not on the web.
So the safest observation is to note that there are very many existing Office documents and the vast majority of them remain in the legacy DOC/XLS/PPT formats.
@lefty.crupps, If you are crashing on a particular document, try loading it in OpenOffice and re-saving it. That has been known to fix such problems.
Anthony Youngman says
Michael: MS don’t drop support for legacy formats? Do you really believe that?
They dropped Word 6 support – was that with O2K? And Word 6 documents can *crash* Word now :-)
Renee Marie Jones says
OOXML? Microsoft DOES NOT SUPPORT OOXML. OOXML is defined by a standard and, although Microsoft writes something they *call* ooxml, it does NOT conform to the standard.
@Renee, You are correct, of course. We need a better name for what MS Office actually writes out. Maybe Microsoft Office XML = MOX?
RE: Renee Marie Jones no OOXML
I cannot help but wonder if MSFT has ever claimed to conform with the “standard” in
any contract bid.
Yeah, Microsoft do drop support for older file formats. You still can download Word 5.5 for DOS:
Try to open any file from that version in any recent version of Office. Good luck.
P.S. And of course said version of Word creates files with .doc extension which points out that when we talk about filetype:doc we are talking about bunch of different incompatible formats…
Percentages of different formats on the web is not a reliable metric for % of formats in use in organisations. What appears on the web may have more to do with what browsers will support. My organisation uses .docx internally but .doc on our website, as many of our clients are still on IE8 which does not play nicely with the .docx format.