Metadata is “data about data”. Meta from the Greek, μετά, meaning with or after. I suppose if you wanted to sound grand you could pronounce it hyper-correctly with the stress on the second syllable, met-ah’. I’ve heard some incorrectly pronounce it meet’-ah, perhaps a false analogy with βῆτα = beta. But you never hear anyone pronounce μέγα = mega as mee-guh, do you?
Metadata is not new. It has been around for centuries. In some cases metadata applies to the overall document, while in other cases it applies to only a portion of the content. Examples of the first case include titles of books, footnotes, ISBN numbers, LOC or Dewey Decimal categorizations, keywords, etc. The various forms of scribal marginalia, whether scholia or glosses in the margins of a manuscript, or personal annotations of the owner of a document, are historic examples of the second kind of metadata.
Marginal notes are frequently used today in business forms. A printed form represents, often imperfectly, a snapshot in time of an organization’s view of their own process. But maybe the process was was approximated or the form was imperfectly designed. Maybe it quickly became outdated, but somehow reality seems to outgrow the strictures of the form’s blanks and checkboxes. So what do, as a customer, do? You write notes in the margins or other places between form fields and hope that there is a human in the loop to read your words.
In any case, of all documents, forms (originally called “formulary documents”) have the most structured representation of data. Enter your social security number into the nine little boxes provided. Enter your date of birth here, Month first, then day, then two-digit year. Last name first, first name last. Everything is nice and simple, and provided your reality matches that which the form designer envisioned. Your data will be easy to consume, whether by another person or, after data entry, by various online processes. Or maybe the form data was entered online originally? Even better.
But what about all the other documents in the world, the ones that are not formally structured as forms? What sense can we make of them? Can you write a program to detect a social security number in a free-form document, or a date, or a zip code? Perhaps with pattern matching, you can find out some simple things. That is the essence of Microsoft’s Smart Tags. (And we had much of this in Lotus Agenda a decade earlier.) But this only works for the most trivial cases. It only takes you so far.
What if I wanted to markup an academic paper, a work-in-progress, to indicate which quotations have been verified and which ones remain to be be verified? Or what if I want to annotate statements in recorded testimony according to which statements contradict and which corroborate another witness’s statements? This goes far beyond pattern matching. I need a way to encode my knowledge, my view of the subject, my insights, into the document.
We have data in a document — “Words,words, words” as Hamlet tells Polonius. But for those who work with thoughts, the present constraints of encoding our knowledge as rudimentary linear strings of characters is severe. In general text is multi-layered and hyper-linked in strange and marvelous ways. Your father’s word processor and word processor file format are inadequate to the task. The concept of a document as being a single store of data that lives in a single place, entire, self-contained and complete is nearing an end. A document is a stream, a thread in space and time, connected to other documents, containing other documents, contained in other documents, in multiple layers of meaning and in multiple dimensions. What we call a traditional document is really just a snapshot in time and space, a projection into a print-ready format of what documents will soon become.
The applications of metadata to business documents are legion. Wherever you have data, you also have the questions of:
- Who entered the data?
- Where did the data come from?
- Who verified the data?
- Who approved the data? Legal? HR? Business?
- Where is this data destined?
- How old is the data? When does it expire?
- How trustworthy is this data?
- Who must we cite as an authority for this data?
- Who owns this data?
- Who has permissions to see this data?
- Who can set policy for this data?
- Who else can edit this data?
- How does this data connect with my business? Is it a part number? The name of a customer or the name of an employee?
And so on.
Open Document Format (ODF) 1.2 takes a step into the word of structured metadata with an RDF metadata framework. If that sounds Greek to you, then let’s say that a metadata framework enables application developers to create applications that do the above things. A framework doesn’t tell you how you must say “This image is provided under a Creative Commons Share-Alike license” but provides a framework for application developers to express concepts like “licensed-under” and “Create Commons Share-Alike”, as well a formal structure for expressing subject-predicate-object relationships, where the subject can be any of around 50 ODF document elements, such as paragraphs, footnotes, images, tables, etc.
To read more, here are some places to start:
For general background on the “semantic web”, a good intro is 2001 Scientific American article “The Semantic Web” by Tim Berners-Lee, et. al.
For a bit more on RDF, the wikipedia page is pretty good.
Svante Schubert at Sun, also on the ODF Metadata Subcommittee has a recent blog post worth reading: “New Extensible Metadata Support With ODF 1.2.
If you want to delve into the particulars of ODF 1.2’s new metadata support, you can read the latest draft of the proposed changes to the specification [ODF] and the examples [ODF] document. Of course, any feedback on ODF drafts and published standards are welcome on the ODF TC’s comment mailing list.
For a gentle introduction to metadata, ODF, where we are coming from and where we are going, I offer this interview [MP3] with Patrick Durusau, Chair of the ODF Metadata Subcommittee, which I recorded back in July.