PDF, The Waste Land, and Monica’s Blue Dress

Adobe’s PDF Architect, James King, has recently started an “Inside PDF” blog which is well worth subscribing to. I’d especially draw your attention to his post “Submission of PDF to ISO” which has much useful information on the process they are going through in ISO, a process that is slightly different than that used by ODF or OOXML in JTC1. (Note in particular that ISO Fast Track is not exactly the same as JTC1 Fast Track.)

In a more recent post, Archiving Documents, James wonders aloud why anyone would use ODF or OOXML for archiving, compared to PDF or PDF/A, saying, “After all, archiving means preserving things, and usually you want to preserver the total look of a document. PDF/A does that.”

I recommend reading the Archiving Documents post in full, and then return here for an alternate point of view.

.
.
.

We say the word “archive” quite easily and cover a large number of activities by that name, and in doing so risk blurring a number of different activities into one over-generalization. Before you are told that format X or format Y is best for archiving it is fair to ask what is meant by “archiving” and ask who does the archiving, for what purpose and under what constraints.

In some cases what must be preserved, and for how long, is spelled out in detail for you, by statute, regulation or court order. Or, a company, in anticipation of such requests may require preservation as part of a corporate-wide records retention policy for certain categories of employees or certain categories of documents.

An example of the range of materials that may be included can be seen this preservation order:

“Documents, data, and tangible things” is to be interpreted broadly to include writings; records; files; correspondence; reports; memoranda; calendars; diaries; minutes; electronic messages; voicemail; E-mail; telephone message records or logs; computer and network activity logs; hard drives; backup data; removable computer storage media such as tapes, disks, and cards; printouts; document image files; Web pages; databases; spreadsheets; software; books; ledgers; journals; orders; invoices; bills; vouchers; checks; statements; worksheets; summaries; compilations; computations; charts; diagrams; graphic presentations; drawings; films; charts; digital or chemical process photographs; video; phonographic tape; or digital recordings or transcripts thereof; drafts; jottings; and notes. Information that serves to identify, locate, or link such material, such as file inventories, file folders, indices, and metadata, is also included in this definition.
–Pueblo of Laguna v. U.S. // 60 Fed. Cl. 133 (Fed. Cir. 2004).

I would pay particular attention to the part at the end, “…drafts; jottings; and notes. Information that serves to identify, locate, or link such material, such as file inventories, file folders, indices, and metadata”.

Similarly, consider government and academic archives, that are preserving documents for the long-term. The archivist tries to anticipate what questions future researchers will have, and then tries to preserve the document in such a way that it can best answer those questions.

A PDF version of a document answers a single question, and answers it quite well: “What did this document look like when printed?” But this is not the only question that one might have of a document. Some other questions that might be asked include:

What was the nature of collaboration that lead to this document? How many people worked on it? Who contributed what?
How did the document evolve from revision to revision?
In the case of a spreadsheet, what was the underlying model and assumptions? In other words, what are the formulas behind the cells?
In the case of a presentation, how did the document interact with embedded media such as audio, animation, video?
How was technology used to create this document? In what way did the technology help or impede the author’s expression? (Note that researchers in the future may be as interested in the technology behind the document as the contents of the document itself.)

The PDF answers one question — what does the document look like — but doesn’t help with the other questions. But these other, richer questions, will be the ones that may most interest historians.

Let’s take an analogous case. T.S. Eliot’s 1922 poem The Waste Land is a landmark of 20th century literature. Not only is it important from an artistic and critical perspective, but it is also important from a technology perspective — it is perhaps the first major poem to have been composed at the typewriter. What was published was, like a PDF, what the author intended, what he wanted the world to see. That is all the world knew until around 1970, after the poet’s death, when the rest of the story emerged in the form of typewritten draft versions of the poem, with handwritten comments by Ezra Pound.

These drafts provided pages and pages of marked up text that showed the nature and degree of the collaboration between Eliot and Pound far more than had been previously known. This is what researchers want to read. The final publication is great, but the working copy tells us so much more about the process. History is so much more than asking “What?”. It continues by asking “How?” and eventually asking “Why?” — this is where the real insight occurs, going beyond the mere collection of facts and moving on to interpretation. PDF answers the “What?” question admirably. I’m glad we have PDF as a tool for this purpose. But we need to make sure that when archiving documents we allow future researchers to ask and receive answers to the other questions as well.

Flash forward to the technology of today. We are not all writing great poetry, but we are collaborating on authoring and reviewing and commenting on documents. But instead of doing it via handwritten notes, we’re doing it via review & comment features of our word processors. Although the final resulting document may be easily exportable as a PDF document, that is really just a snapshot of what the document looks like today. It loses the record of the collaboration. I don’t think that is what we want to archive, or at least not exclusively. If you archive PDF, then you’ve lost the collaborative record.

Another example, take a spreadsheet. You have cells with formulas and these formulas calculate results which are then displayed. When you make a PDF version of the spreadsheet you have a record of what it “looked like”, but this isn’t the same as “what it is”. You cannot look at the formulas in the PDF. They don’t exist. Future researchers may want to check your spreadsheet’s assumptions, the underlying model. There may also be the question of whether your spreadsheet had errors, whether from a mis-copied formula, or from an underlying bug in the application. If you archive exclusively as PDF, no one will ever be able to answer these questions.

One more example, going back to 1998 and the Clinton/Lewinsky scandal. Kenneth Starr’s report on the case was written in WordPerfect format, distributed to the House of Representatives, whose staff then converted it to HTML form and released it on the web. But due to a glitch in the HTML translation process, footnotes that had been marked as deleted in the WordPerfect file reappeared in the HTML version. So we ended up with an official published Starr Report, as well as an unofficial HTML version which had additional footnotes.

Imagine you are an archivist responsible for the Starr Report. What do you do? Which version(s) do you preserve? Is your job to record the official version, as-published? Or is your job to preserve the record for future researchers? Depending on your job description, this might have a clear-cut answer. But if I were a future historian, I would sure hope that someone someplace had the foresight to archive the original WordPerfect version. It answers more questions than the published version does.

So, to sum it up: What you archive determines what questions you can later ask of a document. If you archive as PDF, you have a high-fidelity version of what the final document looked like. This can answer many, but not all, questions. But for the fullest flexibility in what information you can later extract from the document, you really have no choice but to archive the document in its original authoring format.

An intriguing idea is whether we can have it both ways. Suppose you are in an ODF editor and you have a “Save for archiving…” option that would save your ODF document as normal, but also generate a PDF version of it and store it in the zip archive along with ODF’s XML streams. Then digitally sign the archive along with a time stamp to make it tamper-proof. You would need to define some additional access conventions, but you could end up with a single document that could be loaded in an ODF editor (in read-only mode) to allow examination of the details of spreadsheet formulas, etc., as well as loaded in a PDF reader to show exactly how it was formatted.

Anonymous says

2007/12/05 at 6:14 pm

>> What do you mean it doesn’t look like the document? <<
Exactly what I said. Every now and again some objects are lost during transformation to pdf.

>> Rob’s idea is to have the same program store a PDF of its own output. <<
For one thing, I was talking primary of James King’s ideas, not Rob’s. Rob’s approach is akin caching — all data are preserved, quick read-only pdf is available. Which is quite acceptable if one can afford space etc.

>> If those two weren’t the same, it’d be a bug. But I don’t see how it could. <<
What?! You’ve never seen a bug? Lucky you! Will it make anybody happier to learn of data loss because of the bug?

How bug can happen:

* many things change. Some of them are external to the format. Like font engine. For example, Adobe Acrobat does carry its own. Is there any guaranty that three centuries in the future somebody will know all bugs and workarounds of closed-source software?

* printscreen is not 100% valid copy of the screen. Try to get one of media player. Anyway, it is quite impractical: plainly too big and not scalable.

* I have seen encoding bugs in the programs, when in some specific situation wrong translation was used. I’ve seen it happen during pdf conversion as well.

* image / graphics transformation might be lossy. It can be because of the underlying graphics model (pdf does not support arcs of circle, does it? but it is irrelevant: it does not support 5-th order splines), or because of color space. Or something else. This is acceptable limitation for pdf, after all it is presentation format. But I can easily imagine a bug in transformation algorithms, esp. connected to overflows (so it rarely happens).

Comments

Anonymous says

2007/11/22 at 7:04 am

That won’t work. When you archive originals, the legal requirement is often to keep the *original* original, and you can’t expect that all the files you get handed to archive will have been prepared for you specially. Hence the pdf/a, odf pair would be kept separately.

In fact sometimes more formats need to be kept: the original, the rendered document (maybe pdf, maybe open standards for non-paper content like audio, video), and occasionally an additional searchable format is needed (plain text ocr’d from a tiff; a timestamped transcript of a video; etc).

ODF is an improvement on the past, where we have to archive not just the document but the proprietary application that read it; but its not going to solve this problem in the large. PDF/A isnt a panacea either, but its a reasonable alternative to paper.

Rob says

2007/11/22 at 10:39 am

The idea would be that your “original original” would be a compound document that had both ODF and ODF markups.

Also, the interest here could be beyond archiving. It would be a document that could be viewed anywhere (presumption is PDF readers are free and ubiquitous)with perfect fidelity, as well as edited anywhere (assumption that ODF editors are free and available everywhere.)

Uri says

2007/11/22 at 3:58 pm

I can’t really see an argument with two sides here. Using PDF exclusively for archiving sound to me like a cooking book with only pictures of the finished dishes and no recipes, or an exhibition of photocopies of paintings. It’s substituting a facsimile copy of a thing with the thing itself.
If I need archived financial data, I expect it to be in a format that can be used for financial calculations. Archived computer source-code should be readable to a compiler. If I need a graph from an academic paper, I’d want to extract the full-resolution one from the original document, not the miniaturized version created for printing. A printed version of data is almost always useless compared to the data in its original format and container.
No one is seriously suggesting that graphic design studios start to archive their Photoshop projects as JPEG images. Why should office documents be treated differently?

Lucas says

2007/11/24 at 2:20 am

Not that the ODF vs OOXML is in question her, but could one not do the same thing with OOXML?

There are so many issues at work when it comes to “archiving”, I can’t imagine any file format solution that is clearly any better than others. For example, are there any content recovery “mechanisms” inherent in any file formats? Suppose a file is saved/archived incorrectly because of a software bug, how easily can the content by recovered?

How can scanned elements and the text documents be “archived” without keeping two copies? Legal documents with signatures may need to be preserved with hand-written signatures but keep the text of the document in original form so as to remain searchable. One might argue, one can simply scan the signature page while keeping the original text document. However, this introduces: 1) two files, 2) legal question about whether signature corresponds to the legal text.

Thomas Downing says

2007/11/26 at 7:30 am

Although I pay the bills with technical endeavour, my personal bent is historical. This post resonates with strongly with me; further, I think it paints a strong picture of one of the strengths of ODF as a standard.

The idea of storing the ‘as published’ form of the document using PDF as a part of the larger ‘document as work’ is a crucial feature to ODF as an archival tool. Great suggestion! I hope it gets added to the ODF track soon.

Anonymous says

2007/11/26 at 10:56 am

For me the idea to store PDF is just weird. Working with data archiving for process control I learned hard way that the only acceptable way to archive is to store original data. Nobody can guaranty fidelity of the transformation. You give many examples of data loss, but most general holds: there is no way to guaranty that PDF even looks exactly like what person had in document.

Anonymous says

2007/11/26 at 7:08 pm

What do you mean it doesn’t look like the document? Rob’s idea is to have the same program store a PDF of its own output.

In other words, the same program would be embedding the PDF and displaying the data. If those two weren’t the same, it’d be a bug. But I don’t see how it could. After all, all it would have to do is a glorified printscreen of itself and stuff that into a PDF.

Anonymous says

2007/12/05 at 6:14 pm

>> What do you mean it doesn’t look like the document? <<
Exactly what I said. Every now and again some objects are lost during transformation to pdf.

>> Rob’s idea is to have the same program store a PDF of its own output. <<
For one thing, I was talking primary of James King’s ideas, not Rob’s. Rob’s approach is akin caching — all data are preserved, quick read-only pdf is available. Which is quite acceptable if one can afford space etc.

>> If those two weren’t the same, it’d be a bug. But I don’t see how it could. <<
What?! You’ve never seen a bug? Lucky you! Will it make anybody happier to learn of data loss because of the bug?

How bug can happen:

* many things change. Some of them are external to the format. Like font engine. For example, Adobe Acrobat does carry its own. Is there any guaranty that three centuries in the future somebody will know all bugs and workarounds of closed-source software?

* printscreen is not 100% valid copy of the screen. Try to get one of media player. Anyway, it is quite impractical: plainly too big and not scalable.

* I have seen encoding bugs in the programs, when in some specific situation wrong translation was used. I’ve seen it happen during pdf conversion as well.

* image / graphics transformation might be lossy. It can be because of the underlying graphics model (pdf does not support arcs of circle, does it? but it is irrelevant: it does not support 5-th order splines), or because of color space. Or something else. This is acceptable limitation for pdf, after all it is presentation format. But I can easily imagine a bug in transformation algorithms, esp. connected to overflows (so it rarely happens).

Reader Interactions

Comments

Leave a Reply Cancel reply