Rob

Bait and Switch

2007/12/06 By Rob 25 Comments

Promises have been made. Assurances have been given. Commitments have been proffered. But far less has been delivered.

Let’s review the record.

We start with the Ecma whitepaper, “Office Open XML Overview” [pdf] which was included in their submission to ISO:

Standardizing the format specification and maintaining it over time ensure that multiple parties can safely rely on it, confident that further evolution will enjoy the checks and balances afforded by an open standards process.

OK. So we were told that if OOXML is standardized its future evolution will be in an open standards process,with checks and balances.

Brian Jones, from a mid 2006 blog post:

There has also been talk though of taking the formats to ISO once they have been approved by Ecma, which would mean that if ISO chooses to adopt the Open XML formats the stewardship of the formats would be theirs. We’ve had a number of governments indicate that they would like the formats to be given to ISO, and it’s likely that after the Ecma approval that will be the next step.

Again, saying that if approval by ISO is tantamount to transferring custody of the format to ISO.

Six months later, Brian wrote:

Some feedback that we got primarily from governments was that they wanted to see these formats not just fully documented, but that the stewardship and maintenance of that documentation should be handed over to an international standards body.
.
.
.
Obviously, a great way to guarantee the long term availability of OpenXML, and the confidence that it won’t change is for an organization like ISO to take ownership of the spec.

OK. Not exactly a signed-in-blood promise, but still a clear, leading indication that the feedback they received from customers was for stewardship and maintenance and even ownership of OOXML to be handed over to ISO.

As the OOXML (DIS 29500) ballot drew nearer to a close, these vague intimations became outright promises. We heard over and over again that we should approve OOXML because that was the only way to ensure that the format would remain open. The first version might be a mess, but if we approve it just this once, all future versions will be developed in openness and transparency.

For example, John Scholes writes of a Microsoft promise made at an National Computing Centre (NCC) file format debate held in London on July 4th:

Would the maintenance of the standard be carried out by Ecma (assuming OpenXML became an ISO/IEC standard) or would it be carried out by JTC1? No question, JTC1. But would the detail be delegated to Ecma? No, it would all be beyond MS’ control in JTC1. Well at this point there was apparently some sotto voce discussion between Stephen and Stijn, followed by a little backtracking, but it came across loud and clear in subsequent discussions in the margins that Stephen and Jerry believed this was for real. MS was handing over control of OpenXML to JTC1 (or trying to).

I participated in this debate as well, and I can confirm that it occurred exactly as John relates. I even asked a follow-up question to make sure that I hadn’t misunderstand what Microsoft was saying. They were adamant. ISO would control OOXML.

Jerry Fishenden, Microsoft’s lead spokesman in the UK wrote two week’s later:

There’s an easy question to consider here: would you prefer the Microsoft file formats to continue to be proprietary and under Microsoft’s exclusive control? Or would you prefer them to be under the control and maintenance of an independent, open standards organisation? I think for most users, customers and partners that’s a pretty easy question to answer: they’d prefer control and maintenance to be independent of Microsoft. And the good news is that the Open XML file formats are already precisely that: currently under the control of Ecma International (as Ecma-376) and, if the current voting process is positive, eventually under the control of ISO/IEC. Many major and significant UK organisations have already made clear that they support this move for Open XML to become an ISO/IEC standard.
.
.
.

The United States vote is one step in the direction to put Open XML under the control of the ISO/IEC standards body.

So Jerry is stating in no uncertain terms that approval of OOXML puts it under ISO control. This statement was repeated on an August 24th update on Microsoft’s “Open XML Community” web site.

(I’ve heard many second-hand reports of additional repetitions of this promise made at NB meetings around the world, in the run up to the Sept. 2nd ballot. If anyone participated in such a meeting and heard such assurances first hand, feel free to add the details as a comment.)

So much for the promises. What makes this story worthy of a blog post is that we now know that, even as these promises were be made to NB’s, at that same time Ecma was planning something that contradicted their public assurances. Ecma’s “Proposal for a Joint Maintenance Plan” [pdf] outlines quite a different vision for how OOXML will be maintained.

A summary of the proposed terms:

OOXML remains under Ecma (Microsoft) control under Ecma IPR policy.
Ecma TC45 will accept a liaison from JTC1/SC34 who can participate on maintenance activities and only maintenance activities.
Similarly, Ecma TC45 documents and email archives will be made available to the liaison (and through him a set of technical experts), but only the documents and emails related to maintenance.
No mention of voting rights for the liaison or the experts, so I must assume that normal Ecma rules apply — only Ecma members can vote, not liaisons.
Future revisions of OOXML advance immediately to “Stage 4” of the ISO process, essentially enshrining the idea that future versions will be given fast-track treatment

A critical point to note is that “maintenance” in ISO terms is not the same thing as what the average software engineer thinks of as “maintenance”. The work of producing new features or enhancements is not maintenance. The act of creating OOXML 1.1 or OOXML 2.0 is not maintenance. What is maintenance is the publication of errata documents for OOXML 1.0, a task that must be completed within 3 years.

So what Ecma is offering SC34 is nothing close to what was promised. Ecma is really seeking to transfer to SC34 the responsibility of spending the next 3 years fixing errors in OOXML 1.0, while future versions of OOXML (“technical revisions”) are controlled by Microsoft, in Ecma, in a process without transparency, and as should now be obvious to all, without sufficient quality controls.

This maintenance proposal is on the agenda for the JTC1/SC34 Plenary meeting, in Kyoto on December 8th. I think this one-sided proposal should be firmly opposed.

Consider JTC1 Directives [pdf], 13.13:

If the proposed standard is accepted and published, its maintenance will be handled by JTC 1 and/or a JTC 1 designated maintenance group in accordance with the JTC 1 rules.

JTC1’s practice in such matters is to delegate to the relevant subcommittee, so read “SC34” for “JTC 1” above. So it is within the procedures for SC34 to make this decision. In fact, ownership by the SC is the norm. The clause “and/or a JTC 1 designated maintenance group” is a new addition to the Directives which was added right before the OOXML procedure in ISO began. (Curiously this was the same revision of the Directives that added the escape clause to the Contradiction phase that allowed OOXML to continue despite the numerous unresolved contradictions with existing ISO standards.)

So what does a counter-proposal look like?

First, I think we should defer decision on this until the next SC34 Plenary, presumably in Spring 2008. It is not clear whether or not OOXML will ultimately be approved as an ISO standard, and even if it does, maintenance does not need to be completed for 3 years. So I don’t think we should rush into anything.

The UK has made a proposal to create a new working group (WG) in SC34 dedicated to “Office Information Languages”:

SC34/WG4 would be responsible for languages and resources for the description and processing of digital office documents. The set of such documents includes (but is not limited to) documents describing memoranda, letters, invoices, charts, spreadsheets, presentations, forms and reports.

WG4 would be expected to work on the maintenance of, for example:

ISO/IEC 26300:2006

ISO/IEC 29500 (should it exist)

and be responsible for reviewing any future office document formats.

I think this deserves serious consideration. This may be the type of neutral venue — not Ecma and not OASIS — that would be conducive to getting the technical experts together to refactor OOXML and harmonize it with ODF. Even in the likelihood that OOXML ultimately fails in its bid as an ISO standard, the draft could still be referred to a new WG4 for further work. This would also be a way for Microsoft to fulfill their promise to transfer stewardship, control and ownership of OOXML over to ISO, a promise made they made publicly and repeatedly.

662 resolutions, but only if you can find them

2007/12/02 By Rob

Microsoft risks a repetitive stress injury from the recent frenzy of patting themselves on the back for responding to some of the ballot comments submitted in the failed OOXML ISO ballot of Sept 2nd.

They claim to be transparent and acting so that NB’s can easily review their progress in addressing their comments.

Well, let’s take a closer look.

First, Microsoft has managed to get JTC1 to clamp down on information. What was a transparent process is now mired in multiple levels of security leading to delay, denial of information to some NB participants and total opaqueness to the public.

Let’s review how things worked with ODF.

OASIS ODF TC mailing list archives are public for anyone to read
OASIS ODF TC public comment list archives are public for anyone to read
OASIS ODC meeting minutes, for every one of our weekly teleconferences going back to 2002, are all public for anyone to read.
The results of ODF’s ballot in ISO are public, including all of the NB comments
The comments on ODF from SC34 members are also public
The ISO Disposition of Comments report for ODF is also public for anyone to read

Short of allowing the public to read my mind, there is not much more we can do in OASIS to make the process more transparent. (And if you read this blog regularly you already have a good idea of what I’m thinking.)

But what about the OOXML process? Every single one of the above items is unavailable to the public, and in many cases cases is not available even to the JTC1 NB’s who are deciding OOXML’s fate.

In fact, OOXML is moving in reverse. Documents that were once public, such as the Sept 2nd. ballot results and NB comments, have been taken down and replaced with password-protected versions (Look for the DIS 29500 documents here. They all used to be available for all to download.) How do you get access to the password? The password is made available to NB points of contact “on request”. But so far few NB’s have requested the password. You can see here which ones have requested the password and which have not. As of today, only 18 of 51 NB’s have requested the password. Only 35% of SC34 NB’s have access to the same information they had back in September. Indeed, we’re moving backwards.

In the particular cases of these “662 responses”, Ecma is hosting them on their web site, on a different password protected page. (Yes, the comments and the resolutions to the comments are on two different web sites with two different passwords.) I’m hearing as well that few NB’s actually have the password, and some who do are not passing it on to their own committee members. I’ve heard from a few NB members who explicitly requested access to these documents but were denied. Others are simply unaware that these comment resolutions are available. What was once an open process is now closing up.

(12/04/2007 Update: Brian Jones claims that these 662 resolutions are protected by JTC1 rules. But JTC1 rules apply to documents submitted into the JTC1 process, hosted by JTC1 , assigned JTC1 “N” numbers, and archived by JTC1, as required by JTC1 process. But these 662 resolutions are not called for by the JTC1 process, are not hosted by JTC1, are not assigned JTC1 “N” numbers and are not archived by JTC1. They are Ecma documents, hosted by Ecma, assigned ID’s by Ecma, and controlled by Ecma passwords. These documents were never submitted to JTC1. Ecma is in total control over whether or not the public has access to them.

Brian highlights some rules that apply to the Disposition of Comments report, but that is not what we have before us. We won’t have the Disposition of Comments report until after the Ballot Resolution Meeting. At that point, it will be an official JTC1 document, assigned an “N” number, hosted by JTC1 and accessible via their password .

Note also that Microsoft continues to dodge how closed the Ecma TC45 process has been and remains. Why not open up the TC45 mailing list archives, Brian? Are the ISO meanies stopping you? I know that Ecma is not forcing you. Their policy is to let each TC decide for themselves. I’m sure if Microsoft took a leadership position in favor of openness that you could convince the other members of TC45 to increase their transparency. What do you say?)

(12/06/2007 Update: The former Ecma Secretary General weighs in on the topic in a blog post, confirming that the responses are not controlled by ISO access rules, though the original NB comments are:

Consequently, Ecma is not constrained in posting its interim responses on a publicly available page as long as they are not tied to specific NB comments. In other words, Ecma would have to do some work to separate the proposed responses from the specific NB comments, but then Ecma may make its work publicly visible. If there is so much interest outside the NB circuit, then Ecma will surely do something here.
.
.
.
Indeed, seen from Ecma there is nothing that forbids Ecma to distribute its proposals. But it should also be clear, in the light of the longstanding relationship, that it is not a MUST for Ecma to do this. Good habits and rules have a value, like in any great game, such as football. And also there the rules and habits don’t change overnight because somebody has another, maybe even brilliant idea

But suppose you get through your local NB politics and actually lay your hands on the password to the Ecma web site, what do you get then? You then have the privilege of navigating 50 or so different pages, scrolling through them and click on 662 links to download 662 separate PDF files, all from a painfully slow server. Ughh… It hardly seems worth it. It is almost like someone wants to discourage NB’s from actually reading this stuff.

Aiming to lessen the pain a little, I downloaded all 662 comments, and made a singe PDF file that contains all of the comment responses. I also included the original NB comments, and cross-linked everything, so I can navigate from comment to response, and slice and dice it by similar comments, or by NB. It is full text indexed, so I can search for things like “VML” and see all comments or responses relevant to that topic. Since it is liberated from the Ecma website, I can even use it off-line.

Doesn’t my method sound easier to use than downloading 662 PDF files? If you agree, then I’ll make you an offer. If you are a JTC1 or SC34 NB member, and would like access to this consolidated document, let me know via email. (You can find my email address here.) Note that my compilation is not a formal JTC1 document, and that this is not an offer from the US NB. This a personal offer from me to other individuals who are also JTC1 or SC34 NB members. (Of course, if Ecma wants a copy of this as well to make available for all NB’s to download, then that is even better. They know where to find me.)

So, now that I’ve read through these 662 responses, let me fill you in what we have here. First, I’d like to define some terms, so we’re all on the same page and understand the status of these 662 proposals.

At the BRM, baring any breakdown from lack of consensus, there will be issued an official “Resolution of Comments” document. This is the set of textual changes that JTC1 NB’s authorizes the Project Editor (Microsoft’s contractor in Ecma) to make to the DIS 29500 specification. Only the BRM can authorize these changes.

By January 14th, JTC1 NB’s will receive from the Project Editor a “Proposed Resolution of Comments” document. This will be Ecma’s proposals for how they would like to see the Sept. 2nd ballot comments resolved. The BRM is not limited to considering Ecma’s proposals. Their own NB comments from Sept 2nd may also be in play, since those often came with their own proposed resolutions which differ from the ones that Ecma will propose.

So what do we have now, these recent drop of 662 documents from Ecma? I call these by the verbose name: “Ecma’s Draft Proposed Resolution of Comments”. The are not the final Resolution of Comments, and they are not even the final Proposed Resolution of Comments. They are a draft of proposed resolutions to 662 of the 3,522 comments submitted by JTC1 on Sept 2nd.

So the time line is:

From now until late January we receive updates from Ecma in the form of Draft Proposed Resolution of Comments. If they continue to be posted in a user-unfriendly form, I will continue to produce updates to my consolidated report.
By January 14th, Ecma submits their final Proposed Resolution of Comments
At the adjournment of the BRM we have the approved Resolution of Comments
The Project Editor then has 30 days to apply the Resolution of Comments to produce the new text of DIS 29500
It is the above revised text that NB’s will consider whether to approve or not. Note that since the NB only has 30 days to reconsider their Sept. 2nd vote, and the revised text is not due until 30 days after the BRM, it is likely that NB’s will need to use their imagination and decide based on the approved Resolution of Comments document (perhaps 4,000+ pages in length), not having seen the actual revised text of the DIS.

So, what can we sat about the recently released draft proposed resolution of comments document?

This initial set of responses are almost entirely minor, dealing with corrections to examples, spelling errors, punctuation errors, cleanup of broken links, fixing illegible formulas, adding missing units on quantities, etc. There are also many, many duplicates in this area. In particular, the issue regarding spreadsheet functions missing units on some functions (not specifying radians or degrees) was picked up by 12 NB’s. Since there are multiple instances of that defect in the OOXML specification, each one repeated by several NB’s, this single observation results in 48 proposed resolutions. Ecma appears to have concentrated on comments like this, easy to fix and duplicated, in this batch. So although there are 662 resolutions on paper, this maps to perhaps only 80 or so unique issues.

The breakdown of proposed resolutions by NB is in the table below. These numbers are a bit tricky to interpret with the duplicate comments, since one NB’s comments might have been addressed in passing while fixing another NB’s issues. So I doubt Microsoft is spending a lot of time on Columbia, since they voted yes. But there may be a significant duplication between Columbia’s comments and another NB which Microsoft is trying to please. But by looking at unique comments, those submitted by only one NB, we can get a good sense of which NB’s Ecma is trying to please most. And no, I’m not going to tell you which ones they are.

Member	Comments Submitted	Ecma Responses	% Responded to
UK	635	218	34%
Ecma	76	23	30%
Colombia	237	71	30%
Philippines	7	2	29%
USA	288	69	24%
Chile	217	44	20%
Malta	5	1	20%
Japan	82	16	20%
Canada	79	15	19%
Czech Republic	75	13	17%
Uruguay	18	3	17%
Ireland	12	2	17%
France	592	97	16%
Australia	30	4	13%
Germany	162	20	12%
Portugal	118	14	12%
Brazil	64	7	11%
Greece	113	11	10%
Denmark	168	15	9%
Kenya	81	7	9%
Ghana	12	1	8%
India	82	5	6%
Israel	33	1	3%
Venezuela	73	2	3%
Iran	58	1	2%
Turkey	1	0	0%
Jordan	1	0	0%
Ecuador	1	0	0%
Thailand	1	0	0%
Spain	1	0	0%
Belgium	1	0	0%
Austria	1	0	0%
Argentina	1	0	0%
China	1	0	0%
Singapore	2	0	0%
Italy	2	0	0%
Tunisia	3	0	0%
Bulgaria	3	0	0%
Poland	4	0	0%
Mexico	7	0	0%
Peru	10	0	0%
Norway	12	0	0%
Finland	15	0	0%
South Africa	17	0	0%
Switzerland	19	0	0%
Malaysia	23	0	0%
Korea, Republic of	25	0	0%
New Zealand	54	0	0%

To be fair, not every resolution in this batch was editorial. There was some technical detail added. For example, the following points were clarified:

The SpreadsheetML AND/OR functions do not short circuit, so all parameters must be evaluated.
The CHAR() function converts an integer into a character. But no character set was defined in the DIS to govern this conversion. Microsoft clarrified tis saying that the function uses the “Macintosh character set”on the Mac and ANSI on all other platforms.
Spreadsheet functions that do searches or string compares (EXACT, FIND, FINDB, SEARCH, SEARCHB. etc.) do so with lexical character comparisons, not collation-based operations.
Part names in an OPC package can be IRI’s, not just URI’s. So this allows Unicode characters, with some restrictions in items names

However, the 662 comments carefully tip-toed around the controversial issue. I guess we’ll read proposals for those in a future update. So NB members, take the opportunity now to get access to this portal. Ask your NB head for access if you haven’t already been given the password. And if you want a copy of my consolidated PDF file, let me know.

The Myth Of OOXML Adoption

2007/11/30 By Rob 17 Comments

“Politics aside, there are 400 million users of the Office Open format, and we basically just recognized reality.” This quote by the retired Secretary General of Ecma, Jan van den Beld, explaining why it is so important to standardize OOXML.

Anyone else want to recognize reality? Maybe I can help.

Two questions to consider: 1) What is the actual state of OOXML adoption? and 2) What influence should market adoption of a technology have on its standardization?

On the first question, we should note that the 400 million users figure quoted by vdBeld in no way concerns OOXML. That figure is merely Microsoft’s estimate of the total number of Microsoft Office users, of all versions, world wide. Only a small percentage of them are using OOXML.

Let’s see if we can estimate the number.

How are Office 2007 sales? One (leaked) estimate (in September) was 70 million. But a follow-up statement makes it clear this is total Office licenses sold, of all versions. This is probably on the high end, not indicating installations, or even real end sales, since Microsoft typically reports sales into the channel. So that number must be reduced by some factor to account for real installations.

What percentage of Office users are running Office 2007? Joe Wilcox quotes Gartner, saying “Our Symposium survey showed Office at greater than 10 percent installed base…”

And not every Office 2007 will use the default OOXML formats. I’ve heard that corporate installations are often choosing to change their configuration to default to Compatibility Mode, so that Office 2007 saves in the legacy binary formats, for the increased interoperability this offers.

How does this net out? Something more than 40 million and less than 70 million seems the right neighborhood.

Let’s look for some more data points.

Take the example of OpenOffice, which has has seen over 100 million downloads, not including copies which are included already with Linux distributions. So I believe there are far more OpenOffice users than Office 2007 users. Of course, not all OpenOffice users save in ODF format. Some will change the defaults to use the legacy Microsoft binary formats.

Let’s take a look at an updated version of a chart I made back in May, with data now current through 11/27/2007.

The data here shows the number of documents reported by Google over time for ODF and OOXML documents. Hollow circles are ODF data points; solid circles are OOXML data points. (Yes, I need to figure out how to do scatterplot legends in R) The X-axis does not show the date. That would not be fair, since ODF had a significant head start in standardization and adoption. So in order have a fair comparison, both formats are shown against to the number of “days since standardization”, which is May 1st, 2005 for ODF, and December 7th, 2006 for OOXML, the days the formats were approved by OASIS and Ecma respectively.

Next week is the one year anniversary of Ecma’s approval of OOXML as an Ecma Standard. The news is not good. There are fewer than 2,000 OOXML documents on the entire internet (as indexed by Google at least) and the trend is flat.

What about ODF? Almost 160,000 and growing strongly.

Now we shouldn’t be so careless as to say that there are only 2,000 OOXML document in existence, or for that matter only 160,000 ODF documents. Not all documents are posted on the web. In fact, most of them are sitting on hard drives, in mail files, behind corporate firewalls, etc. The documents that Google sees is only a sampling of real-world documents. But this is true of both ODF and OOXML. My hard drive is loaded with ODF documents that are not included in the above sampling. But however you spin it, the minuscule number of OOXML documents and their pathetic growth rate should be a cause of concern and distress for Microsoft.

Where are all the OOXML documents? What governments have adopted OOXML? What agencies? What major companies? If there was an adoption bigger than a Cub Scout pack we would have heard it trumpeted all over the headlines. Listen. Do you hear anything? No. The silence speaks volumes.

But for sake of argument, what if the numbers were different? What if there were millions of documents on the web in OOXML format? Would that have any relevance to the JTC1 standardization process? The answer is a clear “No”. Market share, or even market domination, is not a criterion. In the US NB, INCITS, we are required to make our decision based on “objective technical factors”. Making a decision to favor a proposed standard because of the proposer’s market share would bring antitrust risks.

Consider this: In JTC1 we vote. One country one vote. We do not vote based on a nation’s GDP. Jamaica and Japan are equal in ISO. We have engineers review the standards. We do not bring in accountants to review financial statements and verify inventories. If we want to make decisions based on market share then we should scrap JTC1 altogether and hand standardization over to revenue department authorities to administer.

But that would then perpetuate a technological neo-colonialism where the developed world controls the the patents, the capital and the standards, and the rest of the world licenses, pays and obeys. There’s the rub. Where standards are open, consensually developed in a transparent process and made available to all to freely implement, there we lower barriers to implementation, level the playing field and allow all nations of the world to compete based on their native genius. But where standards are bought we end up with bad standards and a worse world for it.

PDF, The Waste Land, and Monica’s Blue Dress

2007/11/21 By Rob 8 Comments

Adobe’s PDF Architect, James King, has recently started an “Inside PDF” blog which is well worth subscribing to. I’d especially draw your attention to his post “Submission of PDF to ISO” which has much useful information on the process they are going through in ISO, a process that is slightly different than that used by ODF or OOXML in JTC1. (Note in particular that ISO Fast Track is not exactly the same as JTC1 Fast Track.)

In a more recent post, Archiving Documents, James wonders aloud why anyone would use ODF or OOXML for archiving, compared to PDF or PDF/A, saying, “After all, archiving means preserving things, and usually you want to preserver the total look of a document. PDF/A does that.”

I recommend reading the Archiving Documents post in full, and then return here for an alternate point of view.

.
.
.

We say the word “archive” quite easily and cover a large number of activities by that name, and in doing so risk blurring a number of different activities into one over-generalization. Before you are told that format X or format Y is best for archiving it is fair to ask what is meant by “archiving” and ask who does the archiving, for what purpose and under what constraints.

In some cases what must be preserved, and for how long, is spelled out in detail for you, by statute, regulation or court order. Or, a company, in anticipation of such requests may require preservation as part of a corporate-wide records retention policy for certain categories of employees or certain categories of documents.

An example of the range of materials that may be included can be seen this preservation order:

“Documents, data, and tangible things” is to be interpreted broadly to include writings; records; files; correspondence; reports; memoranda; calendars; diaries; minutes; electronic messages; voicemail; E-mail; telephone message records or logs; computer and network activity logs; hard drives; backup data; removable computer storage media such as tapes, disks, and cards; printouts; document image files; Web pages; databases; spreadsheets; software; books; ledgers; journals; orders; invoices; bills; vouchers; checks; statements; worksheets; summaries; compilations; computations; charts; diagrams; graphic presentations; drawings; films; charts; digital or chemical process photographs; video; phonographic tape; or digital recordings or transcripts thereof; drafts; jottings; and notes. Information that serves to identify, locate, or link such material, such as file inventories, file folders, indices, and metadata, is also included in this definition.
–Pueblo of Laguna v. U.S. // 60 Fed. Cl. 133 (Fed. Cir. 2004).

I would pay particular attention to the part at the end, “…drafts; jottings; and notes. Information that serves to identify, locate, or link such material, such as file inventories, file folders, indices, and metadata”.

Similarly, consider government and academic archives, that are preserving documents for the long-term. The archivist tries to anticipate what questions future researchers will have, and then tries to preserve the document in such a way that it can best answer those questions.

A PDF version of a document answers a single question, and answers it quite well: “What did this document look like when printed?” But this is not the only question that one might have of a document. Some other questions that might be asked include:

What was the nature of collaboration that lead to this document? How many people worked on it? Who contributed what?
How did the document evolve from revision to revision?
In the case of a spreadsheet, what was the underlying model and assumptions? In other words, what are the formulas behind the cells?
In the case of a presentation, how did the document interact with embedded media such as audio, animation, video?
How was technology used to create this document? In what way did the technology help or impede the author’s expression? (Note that researchers in the future may be as interested in the technology behind the document as the contents of the document itself.)

The PDF answers one question — what does the document look like — but doesn’t help with the other questions. But these other, richer questions, will be the ones that may most interest historians.

Let’s take an analogous case. T.S. Eliot’s 1922 poem The Waste Land is a landmark of 20th century literature. Not only is it important from an artistic and critical perspective, but it is also important from a technology perspective — it is perhaps the first major poem to have been composed at the typewriter. What was published was, like a PDF, what the author intended, what he wanted the world to see. That is all the world knew until around 1970, after the poet’s death, when the rest of the story emerged in the form of typewritten draft versions of the poem, with handwritten comments by Ezra Pound.

These drafts provided pages and pages of marked up text that showed the nature and degree of the collaboration between Eliot and Pound far more than had been previously known. This is what researchers want to read. The final publication is great, but the working copy tells us so much more about the process. History is so much more than asking “What?”. It continues by asking “How?” and eventually asking “Why?” — this is where the real insight occurs, going beyond the mere collection of facts and moving on to interpretation. PDF answers the “What?” question admirably. I’m glad we have PDF as a tool for this purpose. But we need to make sure that when archiving documents we allow future researchers to ask and receive answers to the other questions as well.

Flash forward to the technology of today. We are not all writing great poetry, but we are collaborating on authoring and reviewing and commenting on documents. But instead of doing it via handwritten notes, we’re doing it via review & comment features of our word processors. Although the final resulting document may be easily exportable as a PDF document, that is really just a snapshot of what the document looks like today. It loses the record of the collaboration. I don’t think that is what we want to archive, or at least not exclusively. If you archive PDF, then you’ve lost the collaborative record.

Another example, take a spreadsheet. You have cells with formulas and these formulas calculate results which are then displayed. When you make a PDF version of the spreadsheet you have a record of what it “looked like”, but this isn’t the same as “what it is”. You cannot look at the formulas in the PDF. They don’t exist. Future researchers may want to check your spreadsheet’s assumptions, the underlying model. There may also be the question of whether your spreadsheet had errors, whether from a mis-copied formula, or from an underlying bug in the application. If you archive exclusively as PDF, no one will ever be able to answer these questions.

One more example, going back to 1998 and the Clinton/Lewinsky scandal. Kenneth Starr’s report on the case was written in WordPerfect format, distributed to the House of Representatives, whose staff then converted it to HTML form and released it on the web. But due to a glitch in the HTML translation process, footnotes that had been marked as deleted in the WordPerfect file reappeared in the HTML version. So we ended up with an official published Starr Report, as well as an unofficial HTML version which had additional footnotes.

Imagine you are an archivist responsible for the Starr Report. What do you do? Which version(s) do you preserve? Is your job to record the official version, as-published? Or is your job to preserve the record for future researchers? Depending on your job description, this might have a clear-cut answer. But if I were a future historian, I would sure hope that someone someplace had the foresight to archive the original WordPerfect version. It answers more questions than the published version does.

So, to sum it up: What you archive determines what questions you can later ask of a document. If you archive as PDF, you have a high-fidelity version of what the final document looked like. This can answer many, but not all, questions. But for the fullest flexibility in what information you can later extract from the document, you really have no choice but to archive the document in its original authoring format.

An intriguing idea is whether we can have it both ways. Suppose you are in an ODF editor and you have a “Save for archiving…” option that would save your ODF document as normal, but also generate a PDF version of it and store it in the zip archive along with ODF’s XML streams. Then digitally sign the archive along with a time stamp to make it tamper-proof. You would need to define some additional access conventions, but you could end up with a single document that could be loaded in an ODF editor (in read-only mode) to allow examination of the details of spreadsheet formulas, etc., as well as loaded in a PDF reader to show exactly how it was formatted.

Document Format FUD: A Guide for the Perplexed

2007/11/18 By Rob 8 Comments

I’ve decided to put together a list of misconceptions that I hear, generally on the topic of document formats. I’ll try to update this list to keep it current, with the most recent entries at the top. Readers are invited to submit the FUD they observe as comments, and I’ll include it where I can.

This inaugural edition is dedicated to the fallout from the recent supernova we know as the OpenDocument Foundation, that in one final act of self-immolation swelled from obscurity to overwhelming brilliance, but then slowly faded away, ever fainter and more erratic, little more than hot gas, the dimming embers no longer sustainable.

Q: Now that the originator and primary supporter of OpenDocument Format has ended its support for ODF, does this mean the end for the ODF standard? (18 Nov 2007)

A: This question is based on a mistaken premise, namely that the OpenDocument Foundation was the originator or steward of the ODF standard. This is an erroneous notion.

The ODF standard is owned by the OASIS standards consortium, with over 600 member organizations and individual members. The committee in OASIS that that does the technical working of maintaining the ODF standard is called the OpenDocument TC. It has 15 organization members as well as 7 individual members. Until recently the OpenDocument Foundation was a member of the ODF TC, one voice among many.

The adoption of the ODF standard is promoted by several organizations, most prominently the ODF Alliance (with over 400 organizational members in 52 countries), the OpenDocument Fellowship (around 100 individual members) and the OpenDoc Society (a new group with a Northern European focus, with around 50 organizational members). To put this in perspective, the OpenDocument Foundation, before it changed its mission and dissolved, had only 3 members.

When you consider the range of ODF adoption, especially in Europe and Asia, the strong continuing work on ODF 1.2 in OASIS, and the strong corporate, government and organizational participation demonstrated in the global ODF User Workshop recently held in Berlin, we seem to be making a disproportionate amount of noise over the hysterics of the disintegrating 3-person OpenDocument Foundation.

A number of analysts/journalists/bloggers didn’t check their facts and seem to have fallen into the trap, and ascribed a far greater importance to the actions of the Foundation. Curiously, these articles all quoted the same Microsoft Director of Corporate Standards. I hope this correlation does not prove to be a persistent contrary indicator for accuracy in future file format stories.

Luckily for us, David Berlind over at ZDNet has penetrated the confusion and gets it right:

…the future of the OpenDocument Foundation has nothing to do with the future of the OpenDocument Format. In other words, any indication by anybody that the OpenDocument Format has been vacated by its supporters is pure FUD.

11/27/2009 Update: Berlind did further research and interviews on this topic and followed up with a podcast and new blog post OpenDocument Format Community steadfast despite theatrics of now impotent ‘Foundation’ on this subject.

Q: The Open Document Foundation has a document, a “Universal Interoperability Framework” that on its title page says “Submitted to the OASIS Office Technical Committee by The OpenDocument Foundation October 16, 2007”. What is the status of this proposal in the ODF TC? (18 Nov 2007)

A: No such document has been submitted to the OASIS TC, on this date or any other date. OASIS policy states that “Contributions, as defined in the OASIS IPR Policy, shall be made by sending to the TC’s general email list either the contribution, or a notice that the contribution has been delivered to the TC’s document repository”. A look at the ODF TC’s list archive for October shows that there was no such contribution.

Q: The Foundation claims that the W3C’s CDF format has better interoperability with MS Office than ODF has. Is this true? (18 Nov 2007)

A: The Foundation’s claims have not been demonstrated, or even competently argued at a technical level that would allow expert evaluation. I cannot fully critique what is essentially vaporware. However, those who know CDF better than I do have commented on the mismatch between CDF and office documents, for example the recent interview with the W3C’s Chris Lilley in Andy Updegrove’s blog.

Q: So, does IBM then oppose CDF in favor of ODF? (18 Nov 2007)

A: No. IBM supports both the development of ODF and CDF and has a leadership role in both working groups. These are two good standards for two different things.

The W3C, over the years has produced a number of reusable, modular core standards for things like vector graphics (SVG), mathematical notation (MathML), forms (XForms), etc. To use a cooking analogy, these are like ingredients that can be combined to make a dish. ODF has taken a number of W3C standards and combined them to make a format for expressing conventional office documents, the familiar word processor, spreadsheet and presentation documents. ODF is an OASIS and ISO standard.

But just as eggs, butter and flour form the base of many recipes, the core W3C standards can be assembled in different ways for different purposes. This is a good thing.

CDF is not so much a final dish, but an intermediate step, like a roux (flour + butter) is when making a sauce. You don’t use a roux directly, but build upon it, e.g., add milk to make a béchamel, add cheese for a cheese sauce, etc., CDF itself s not directly consumable. You need to add a WICD profile, something like WICD Mobile 1.0, before you have something a user agent can process.