≡ Menu

The Case for a Single Document Format: Part III

This is Part III of a four-part post.

In Part I we surveyed of a number of different problem domains, some that resulted in a single standard, some that resulted in multiple standards.

In Part II, we described the forces that tend to unify or divide standards and showed in particular how network effects can drive the adoption of a single standard.

In this Part III we’ll look at the document formats in particular, how we got to the present point, and how and why historically there has been but a single universally-accepted document format.

In Part IV, we’ll tie it all together and show why there should be, and will be, only a single open digital document format.

The Meeting

It is 9:55 on an average Tuesday morning. I’m late (as usual) preparing for a meeting. With 5-minutes to go, I send out an updated meeting invite, with an updated agenda and a URL for the web conference. I also send out another email with an updated presentation attachment. It is the standard last-minute, pre-meeting shuffle that we all do. I expect that an examination of traffic statistics on IBM’s email servers shows a spike 5-minutes before every hour, as we all send out last-minute meeting updates. I login to my web conference and dial into the call. I’ll be meeting with my teammates, some in Westford, some in Raleigh, some in Portsmouth, some in Lexington, some in Dublin and some in Shanghai, a far-flung group. I’ve worked with some of these guys for years but still have never met most of them face-to-face. This is the nature of collaboration in a modern, global company. The call starts and I take a deep breath, push off my slippers and stretch my toes. Yes, I’m leading this meeting from home today.

“Don’t be impatient, Comrade Engineer; We’ve come very far, very fast”, in the words of Yevgraf Zhivago, Alec Guinness’s character in Doctor Zhivago. Let’s flash back 10 years ago and remind ourselves how we worked them…

It is 9:55 on an average Tuesday morning. I’m late (as usual) preparing for a meeting. With 5-minutes to go, I print out the agenda and handouts to the laser printer down the hall. It has printed by the time I arrive, and I sort through the three or four other print jobs to find the one that is mine. I need twelve copies for the meeting, so I join the queue at the photocopier, with everyone else who also waited to the last minute to print out the materials for their meetings. It is the standard last-minute, pre-meeting shuffle that we all do. I expect that an examination of statistics on IBM’s photocopiers shows a spike 5-minutes before every hour. I head over to the conference room and start the meeting. At the end of the call, 80% of the printed materials will be discarded, hopefully into the recycling bin. This was the nature of collaboration in a modern, global company, circa 1995.

What has changed? Why did it change? What does this mean for document formats?

My family in documents

Let me take you on a detour, back in time, to tell a 200-year family story, illustrated with official documents of the period.

I’ll start with the following excerpt from the 1930 Federal Census returns for Abington, Massachusetts, showing my grandmother, Florence Mae Cushing, then age 18, and her parents William and Mary, and household. The columns indicate the following:

  1. Name
  2. Relationship to the head of household
  3. Whether they own or rent their dwelling
  4. Value of their dwelling
  5. Whether they own a radio
  6. Whether they own a farm
  7. Sex
  8. Race
  9. Age
  10. Marital condition
  11. Age at first marriage
  12. Whether they are in school

The thing that caught by eye about this record is that it lists a, “Damon, Mary K” as William’s mother-in-law, widowed, age 73, living with them. Let’s see what we can find out about this woman. First step is to find her maiden name. A search for her marriage record in Abington failed, so we tried for Mary E. Damon’s birth record, which we did find in Abington’s birth register for in 1887 revealing her mother’s maiden name as, “Chessman”:

This then allows us to find Mary K. Chessman’s birth record, also in Abington, from 1856 listing her parents as Edward and Emily:

And then from here we can go back and find the family in the 1860 Federal Census:

We see the family as owning $500 in real estate and $100 in personal property, having 5 children, the oldest 8 years old. Mary K. is only 3.

But when I skip ahead to the 1870 Census, something is clearly wrong:

As you can see above, Emily is listed as head of household, and there is no Edward. And where is our Mary K? Age age 13, she has moved out and is working as a “domestic servant” with a family of factory workers. Her sister Harriet, age 15, is also living there and working in an “eyelet factory”:

So what happened? Resolving this mystery required a bit more sleuthing, but I eventually found the answer in a response to a records request to the National Archives and Records Administration (NARA):

From this I learned that Edward Blanchard Chessman, Mary K’s father, had served in the Civil War with the Massachusetts 32nd Volunteers and had died of disease in 1863 at a military hospital in Alexandria, Virginia. This along, with a dozen pages of additional documents from NARA, detailed the pension application of his widow, the depositions of witnesses who vouched for their marriage and his service, the periodic requests for pension increases, all the way to 1903 when Emily died and her pension file was closed, marked “DEAD” with a big, bold stamp.

Since I was now tipped off to the value of pension records, I next searched for Edward’s grandfather, Ziba Chessman, who I knew had served in the Revolutionary War. I was able to locate his widow’s pension application as well:

The hand of this writer is not so easy to read, but I’d transcribe the start of it as:

Commonwealth of Massachusetts. Norfolk County. On this twenty second day of August 1838 personally appeared before Herman **** The *** of Probate in **** County, Mehitable Chessman a resident in the Town of Braintree in the County of Norfolk and state of Massachusetts aged seventy three years, who being first duly sworn according to law doth on her oath make the following declaration in order to obtain the benefit of the provision made by the Act of Congress passed July 7th 1838 entitled “An Act Granting Half Pay and Pensions to Certain Widows”, that she is the widow of Ziba Chessman late of Braintree in the County of Norfolk and state aforementioned deceased, who was a Solider in the War of the Revolution; that her said husband Ziba Chessman enlisted into Captain Isaac Thayers or Captain Nathaniel Belchers Company in the year 1775 and served a short period of time as a private with the Massachusetts Militia, around the shores of Boston, according to the best of her knowledge….

I am in awe that these records have been maintained and preserved for so long, and made available to people like me who are researching their family tree. There is a continuity of records in New England that goes back almost 400 years. Birth, education records, draft registration, military service, marriage, court appearances and eventually death and burial. Whenever your personal life crossed paths with the government, it generated a record and this record may last forever, and more importantly, once the physical preservation aspects are taken care of, these records can be read forever.

A brief history of document technology

It is somewhat odd that we’ve been debating document formats for so long and have not really said what they are. I’ll recommend the following for our discussion:

A document format consists of the conventions that allow a document to be fixed in a persistent state and then exchanged with other parties who are able to use these same conventions to read and further edit that document. If you and I understand the same document format, then you and I can exchange documents in that format and we can collaborate using that format.

Since around 1450, with Gutenberg’s first notable success of combining document production and automation, and even before (and since) with manual document production, there has been a single globally relevant interoperable document format — ink on paper. Everyone could create it, everyone could read it, everyone could exchange it. It worked then and it works now.

Some noticeable advances in documents since 1450 include the invention of pre-printed forms, around 1850. These seem obvious now, but for many years we had what were called “formulary documents” which had boilerplate text which the clerk wrote out in full for each document, in addition to the customized language for each specific instance. You can get a sense of this from Ziba Chessman’s pension application quoted earlier. From an engineering perspective you can think of this as reuse of design, but not implementation.

Having a pre-printed form was a step forward in productivity, allowing a greater degree of reuse. The Surgeon General’s form shown above is an early example. Such forms were quickly associated with bureaucracy . In fact, the first written use of the word “form” in the English language (according to the Oxford English Dictionary) was this critical view of a 19th century government office:

The waiting-rooms of that Department soon began to be familiar with his presence, and he was generally ushered into them by its janitors much as a pickpocket might be shown into a police-office; the principal difference being that the object of the latter class of public business is to keep the pickpocket, while the Circumlocution object was to get rid of Clennam. However, he was resolved to stick to the Great Department; and so the work of form-filling, corresponding, minuting, memorandum-making, signing, counter-signing, counter-counter-signing, referring backwards and forwards, and referring sideways, crosswise, and zig-zag, recommenced — Dickens, Little Dorrit (1855)

The telegraph (1837) and teletype (1910) gave new, faster ways of moving documents around. Was Morse Code a new document format? Although the telegraph operators may have worked in Morse Code, the author of the document, and the person who ultimately received and read the document still worked with ink on paper.

The typewriter (1872) increase the speed and uniformity of personal document production. This also lead to a new use for carbon paper, an invention of 1806 originally created as an aid for the blind.

In the late 1880’s, Edison’s “Autographic Printing” was commercialized as the Mimeograph, giving a cheaper method of small batch document production.

Melvin Dewey (of Dewey Decimal fame) invents the hanging file folder (1893), leading to increased efficiency of document storage and retrieval.

The Harris Automatic Press Company is incorporated in 1895, ushering in the commercial use of offset printing and a 10-fold increase in document output rates.

The invention of the Soundex algorithm by Robert Russell of Pittsburgh in 1918 allowed more efficient searching of files and cards indexed by surnames, by grouping together names which were phonetically similar.

In 1924 radio facsimile allows pictures, as well as text, to be transmitted long distances.

In 1948 Xerography gave us document duplication without the use of wet, messy chemicals.

In 1969, IBM’s Charles Goldfarb, Ed Mosher and Ray Lorie invented GML, the Generalized Markup Language, the ancestor of SGML, HTML and XML.

The 1970’s saw the rise of the first computer-based word processors, including Wang’s Office Information System.

In 1974 Xerox PARC engineers create Bravo, the first WYSIWYG word processor.

In 1975, with the rise of office automation systems and early word processors, Business Week boldly proclaimed the “Paperless Office”.

At this point we reach an important fork in the road of history. What role would the computer and office automation mean for the future of documents? Does the paperless office become a reality? Or do we remain with paper-based documents? As Xerox PARC engineers were developing the world’s first WYSIWYG word processor, at the same time they were also developing a system for transporting documents electronically, from one computer to another. But this innovation was dropped because it went against Xerox’s core business, the creation and duplication of paper documents. So the choice was made. Paper still ruled. Paper consumption went up, not down. The word processor made it easier to produce more paper, faster. The paperless office did not happen, at least not yet. More first-hand details on this fascinating topic can be read in Sellen & Harper’s The Myth of the Paperless Office. In their words, “…paper became a surrogate for the network, enabling users with different machines to share documents…”.

And so we continued, for another 20 years, of WYSIWYG word processors, WordStar, MacWrite, Writing Assistant, Manuscript, WordPerfect, Word, WordPro, etc. We all created documents and hid the files away on our hard-drives in incompatible formats. When we needed to work with others we usually just printed out the document and exchanged the printout, using the 500-year old format of ink on paper.

Let’s pause here and make some observations.

First, note the areas of sustained and recurring innovation. These have been consistent throughout the past 500 years and reflect the ongoing nature and practical concerns of business communications:

  1. Document authoring
  2. Document duplication
  3. Document distribution
  4. Filling out of forms
  5. Submission of forms
  6. Processing of forms
  7. Storage and Retrieval of documents
  8. Authentication of documents (not mentioned in the history above, but the use of Notary Publics and corporate seals has facilitated this with ink and paper documents, in some forms back to ancient Rome.)

Note also that the engineering progress and increases in efficiencies in these areas occurred without challenging the primacy of a single document format. The universality of ink and paper did not stifle innovation over these 500 years. On the contrary a single standard document format encouraged and focused innovation. We went from documents authored by pen, then set in moveable type, manually pressed, bound and distributed at the speed of a horse, to where we were circa 1995, when I authored documents on a computer, printed to a laser printer and then queued up at the photocopier to make copies of my agenda before the meeting started. Ink on paper — it was the standard document format for 500 years.

But of course, we don’t work this way anymore. Something changed, very recently. I don’t print out agendas any more. I send them via email. I don’t print out reports and review them with a red pen in hand. I mark them up electronically. In fact, unless I need to sign it or staple a receipt to it, I don’t print out anything. I think I can live out the remainder of my professional career on only 2 reams of paper.

What happened then to change this? Why is there less of an emphasis on printed output today? What does this mean for WYSIWYG? And what does this mean for document formats?

These questions and others when I finish up this series in Part IV.

20 April 2007 — Another editing pass, tightening up the language, but still too long. Added link to “The Myth of the Paperless Office”.

{ 14 comments… add one }
  • zridling 2007/04/11, 4:28 am

    Wow, that’s some great footwork in this section of the essay. Imagine if Proust had written his grand novel in Word. The track changes features would not have been consistent enough over subsequent versions to keep up and retain his notes and changes, which would have greatly deterred translators throughout the 20th century.

    So I guess the question is, when you want to save something/anything, what will you reach for? I’ll take mine with ODF because I know how it is composed, who controls it, and that it will always have a freeware implementation of itself. At least that a lightyear ahead of that other spec.

  • notary stamp 2007/04/11, 5:45 am

    It’s more of leading to a paperless society that makes our documentation process change. Great to know that you still have your family’s documents available. But what if we no longer keep records on hard copies? Will our emails be kept forever? We are hopeful but we’re not sure. Do you have something we can look forward to on your next posting?

  • The Wraith 2007/04/11, 6:54 am

    Where there was only a single format it was generally by choice of the people using / working with that format.

    It wasn’t just a question of who asked ISO to standardize it’s format first !!

    Also at the moment you already have different formats like .doc and .pdf . how can this be explained in your single format theory ?

  • Rob 2007/04/11, 7:33 am

    Goop point, and this one of the reasons why PDF is not the complete answer. PDF is great for capturing the final fixed presentation form of a document, but you lose the revisions, the spreadsheet formulas, the items that show how the work or collaboration was done. From an historical perspective, those details may be the most important details.

    I took a course on the history of physics from Gerald Holton years ago. One thing I learned there was the importance of getting access to the scientist’s lab notebooks. The published papers are too clean, they make things look too predictable, too obvious. You get a much better sense of how the discovery was really made when you read the notebooks and interpret every number and every symbol.

    This applies to literary works as well. When the typewritten manuscript draft of T.S. Eliot’s “The Wasteland” was discovered in 1968 we finally saw the handwritten notes, corrections and suggestions from his friend Ezra Pound, and realized how important that collaboration was to the work.

    The same thing applies to business collaboration. It would be impossible for us to accomplish our work in the OASIS ODF TC without having a single interoperable format to work with, one rich enough to handle the formatting of the ODF specification, as well as rich enough to handle the change tracking, revision tracking and associated features needed for document collaboration.

  • Rob 2007/04/11, 8:02 am

    Will our electronic documents be kept forever? My focus is on ensuring that the formats in which the documents are stored and exchanged are capable of being read for long periods of time, i.e., that they are not tied to any one application, operating system or vendor. That was always the beauty of paper. You probably don’t have personal access to a quill pen, a mimeograph machine, a radio facsimile receiver or carbon paper, but you can easily read any document produced with these technologies, because the document format, the conventions for how we read the documents, has remained the same.

    It is interesting that in 500 years, until now, no commercial interest has ever dared to carve this vast interoperable document landscape into a private proprietary fiefdom. Sure, we had intentionally closed formats even in the days of ink on paper. We had our secret codes and Enigma machines and such. But this was the realm of espionage not of business collaboration.

    But of course, in addition to the format issue, there are collection, physical media preservation, funding, privacy and other concerns about long term digital document archiving. The National Archives of Australia, in particular has written a lot on this subject. An open document format, enables, but alone is not sufficient for long-term availability of digital records. A lot of other pieces need to come together first.

    To The Wraith’s comment about choice, we should remember that in the paper world, ISO (and ANSI and others) standards played a large part in standardizing paper sizes, necessary for efficient filing and retrieval of documents. As a variety-reducing standard, the standard paper sizes lead to economies of scale around envelopes, filing folders, filing cabinets, printer paper trays, shredders, etc.

    No one ever claimed that the exact dimensions of A4 paper were magically superior to paper that was 2% larger or 2% smaller. No one ever complained that standardizing on that paper size would eliminate user choice, and cause innovation to suffer. But as a variety-reducing standard it was good to adopt it and optimize the market in paper-related technologies around a single family of paper sizes.

  • PolR 2007/04/11, 8:51 am

    Corporations have a desire to reduce the variety of technologies in their organisations to control the TCO. This is why they tend to select corporate standards for any technology of significance.

    When multiple corporations select different standards on data they need to share, interoperability becomes a problem. This is especially true with documents because they are meant to be exchanged in the first place.

    This article illustrates magnificently the significance of time. A standard that has trouble to last as lettle as a decade before it gets superseded is a major problem. Microsoft Office formats are deprecated every few years. OOXML comes with an expectation that billions of existing documents are mass converted for compatibility. Will we see a repeat at every new version of Office because Microsoft changes some details in the proprietary aspects of Office? You can’t manage historical records that way.

    When I see Microsoft promoting choice, I expect this to fall on deaf ears. Diversity is the worst choice of all and everybody including Microsoft knows it. Otherwise we wouldn’t use TCP/IP. We would all stick with SNA, DECnet, XNS, IPX, Appletalk and the likes. We had plenty of choice back then.

  • Anonymous 2007/04/11, 9:41 am

    “No one ever claimed that the exact dimensions of A4 paper were magically superior to paper that was 2% larger or 2% smaller.”

    Actually, they were :-)

    The A measures were standardized to fit the metric system.

    After the meter was standardized over continental Europe, all machinery and tools were created in easy to measure dimensions.

    Like the old buildings of the classical ages, which all had dimensions in whole local feet/thumbs etc. sizes.

    The A0 basic size was determined by the standard square meter. Paper weight and costs are by the square meter too.

    Therefore, having basic sheet sizes in integral numbers per square meter simplifies acounting tremendously.

    There are 16 A4 in a square meter. So there are 32 square meters in 512 sheets. Standard packages are 500 sheet (I didn’t count them, this could be rounded). So the weights and costs are easy to calculate.

    Making them 2% larger/smaller makes acounting really difficult.


  • Anonymous 2007/04/11, 9:48 am

    “Where there was only a single format it was generally by choice of the people using / working with that format.”

    To go back to A measures. The metrical system was forced upon continental Europe by Napoleon. We were all the better for it. Time zones were forced upon us by the railroads.

    Doc was forced upon us by Microsoft. Even as an MS Office user, I hated the format because it has made me lose data and work. Still I had to use the application so there was no choice.

    I want to be able to buy lightbulbs without having to worry over fitting them into my lamp-shades. Just as I have always hated proprietary vacuum cleaner bags, which were expensive and I kept getting the wrong ones (I finally bought a Dyson).


  • Anonymous 2007/04/11, 1:29 pm

    Just to be picky, but the meter was never standardised in Europe. The metre was.

    This is yet another issue where one standard should be “choosen”, because multiple versions of English help no one.

    I’d like to see Microsoft explain why US English and UK English (along with all the other variants) are good for the consumer.

  • Queen Elizabeth 2007/04/12, 3:54 pm

    Last anonymous–by your logic, multiple languages help no one.

    Maybe the US/UK divide is silly, but consider this: lots of us speak differently for a reason. American bureaucrats do not talk like British bureaucrats because they have to. The Justice Department is not the same as the Home Office. And attorneys/lawyers are not the same as barristers/solicitors.

    Even if we were to have a world tongue, there would still be all sorts of parlances and jargons. They all have different uses.

    Just like file formats.

  • Rob 2007/04/12, 4:42 pm

    British English versus American, Australian, Indian or whatever variety of English — this is partly a product of different environments. The settlers in North American encountered different animals and plants and sometimes gave them new names, sometimes adopted the names from the Native Americans, and sometimes applied the existing name of the closest thing they were familiar with back home.

    On top of that there was the factor of separate linguistic evolution enforced by geographic isolation.

    The English language has picked up words from the many cultures it has been in contact with over the centuries: French, Norse, Spanish, scientific terms from Greek and Latin, Arabic, Indian, Celtic, etc. This has made English richer, but has not turned it into something un-English.

    The average high school graduate can read Shakespeare with little difficulty, Chaucer with some help, and Beowulf with a semester of Anglo Saxon. This is pretty good stability, although I know there are examples of isolated languages which have been even more stable over time, such as Icelandic.

  • Anonymous 2007/04/13, 12:39 am

    The basic problem with all the language responses is that language is not a standard. National governements, which itself are an invention of the last centuries, have tried to standardize and enforce languages, but that never worked.

    The reason is that speech is not a designed artefact, but part of our biology. Language and speech are as much part of being human as walking on two legs.

    What gets standardized is a national ORTHOGRAPHY. Writing IS a designed artefact and has to be standardized to allow communication.

    This difference is clearly visible from the fact that in English, the relation between orthography and speech is little better than in Chinese. But in many languages, there is a good correspondence between these two, eg, Spanish and Italian.

    The differences between the variants of English (and Chinese) are covered up by the orthography, where most incomprehensible variants are found in the UK.

    So, US, UK, and Ausie writings present no problems, but try to understand rural Scottisch SPOKEN English (one of the oldest variants).


  • Anonymous 2007/04/13, 2:56 am

    Actually, there are competing paper standards – metric A0/A1/A3/A3/A4…; B1/B2/B3/B4…; C0/C1/C2/C3/C4…; and those used in North America Letter/Legal etc. It is not fun sharing documents with North America when everywhere else uses metric, as A4 doesn’t fit nicely on Letter. North America, as a market, is big enough to sustain a different usage to the rest of the world, and operating across the divide is painful. For an interesting read see http://www.cl.cam.ac.uk/~mgk25/iso-paper.html and especially the section labelled “Hints for North American paper users”. The link in that website to ftp://ftp.isi.edu/in-notes/rfc2346.txt “Making Postscript and PDF International” is also useful.

  • Anonymous 2009/05/10, 6:26 pm

    ____Gutenberg’s press of the mid-1400’s represents an inflection point. The original document standard of course was carved or painted stone (not very portable) such as decorative panels and cave paintings [see also the Rosetta stone]. This was followed by clay and wax tablets (the original scratch pads) most notably the Sumerian tax records found in the Middle East (http://lcweb2.loc.gov/intldl/cuneihtml/gazette.html). The Chinese and the Egyptians contributed paper and block printing (2nd century China; ~3500 BC Egypt [papyrus]). The Arabs contributed paper-making machinery as the technology migrated from Asia to Europe. What Gutenberg did was lower the cost of reproduction and thereby enable mass literacy.

    P.S. I ask my fellow bibliophiles to mourn the burning of the Mayan bark books (http://www.mayacodices.org/) in 1562. This and the destruction of the Library of Alexandria make me wish for some sort of time machine / viewer so that we can recover what has been lost.

Leave a Comment