Metadata is “data about data”. Meta from the Greek, μετά, meaning with or after. I suppose if you wanted to sound grand you could pronounce it hyper-correctly with the stress on the second syllable, met-ah’. I’ve heard some incorrectly pronounce it meet’-ah, perhaps a false analogy with βῆτα = beta. But you never hear anyone pronounce μέγα = mega as mee-guh, do you?
Metadata is not new. It has been around for centuries. In some cases metadata applies to the overall document, while in other cases it applies to only a portion of the content. Examples of the first case include titles of books, footnotes, ISBN numbers, LOC or Dewey Decimal categorizations, keywords, etc. The various forms of scribal marginalia, whether scholia or glosses in the margins of a manuscript, or personal annotations of the owner of a document, are historic examples of the second kind of metadata.
Marginal notes are frequently used today in business forms. A printed form represents, often imperfectly, a snapshot in time of an organization’s view of their own process. But maybe the process was was approximated or the form was imperfectly designed. Maybe it quickly became outdated, but somehow reality seems to outgrow the strictures of the form’s blanks and checkboxes. So what do, as a customer, do? You write notes in the margins or other places between form fields and hope that there is a human in the loop to read your words.
In any case, of all documents, forms (originally called “formulary documents”) have the most structured representation of data. Enter your social security number into the nine little boxes provided. Enter your date of birth here, Month first, then day, then two-digit year. Last name first, first name last. Everything is nice and simple, and provided your reality matches that which the form designer envisioned. Your data will be easy to consume, whether by another person or, after data entry, by various online processes. Or maybe the form data was entered online originally? Even better.
But what about all the other documents in the world, the ones that are not formally structured as forms? What sense can we make of them? Can you write a program to detect a social security number in a free-form document, or a date, or a zip code? Perhaps with pattern matching, you can find out some simple things. That is the essence of Microsoft’s Smart Tags. (And we had much of this in Lotus Agenda a decade earlier.) But this only works for the most trivial cases. It only takes you so far.
What if I wanted to markup an academic paper, a work-in-progress, to indicate which quotations have been verified and which ones remain to be be verified? Or what if I want to annotate statements in recorded testimony according to which statements contradict and which corroborate another witness’s statements? This goes far beyond pattern matching. I need a way to encode my knowledge, my view of the subject, my insights, into the document.
We have data in a document — “Words,words, words” as Hamlet tells Polonius. But for those who work with thoughts, the present constraints of encoding our knowledge as rudimentary linear strings of characters is severe. In general text is multi-layered and hyper-linked in strange and marvelous ways. Your father’s word processor and word processor file format are inadequate to the task. The concept of a document as being a single store of data that lives in a single place, entire, self-contained and complete is nearing an end. A document is a stream, a thread in space and time, connected to other documents, containing other documents, contained in other documents, in multiple layers of meaning and in multiple dimensions. What we call a traditional document is really just a snapshot in time and space, a projection into a print-ready format of what documents will soon become.
The applications of metadata to business documents are legion. Wherever you have data, you also have the questions of:
- Who entered the data?
- Where did the data come from?
- Who verified the data?
- Who approved the data? Legal? HR? Business?
- Where is this data destined?
- How old is the data? When does it expire?
- How trustworthy is this data?
- Who must we cite as an authority for this data?
- Who owns this data?
- Who has permissions to see this data?
- Who can set policy for this data?
- Who else can edit this data?
- How does this data connect with my business? Is it a part number? The name of a customer or the name of an employee?
And so on.
Open Document Format (ODF) 1.2 takes a step into the word of structured metadata with an RDF metadata framework. If that sounds Greek to you, then let’s say that a metadata framework enables application developers to create applications that do the above things. A framework doesn’t tell you how you must say “This image is provided under a Creative Commons Share-Alike license” but provides a framework for application developers to express concepts like “licensed-under” and “Create Commons Share-Alike”, as well a formal structure for expressing subject-predicate-object relationships, where the subject can be any of around 50 ODF document elements, such as paragraphs, footnotes, images, tables, etc.
To read more, here are some places to start:
For general background on the “semantic web”, a good intro is 2001 Scientific American article “The Semantic Web” by Tim Berners-Lee, et. al.
For a bit more on RDF, the wikipedia page is pretty good.
Svante Schubert at Sun, also on the ODF Metadata Subcommittee has a recent blog post worth reading: “New Extensible Metadata Support With ODF 1.2.
Bruce D’Arcus, of the Metadata Subcommittee and co-lead of the OpenOffice.org Bibliographic Project also contributes his thoughts on the new ODF 1.2 metadata.
If you want to delve into the particulars of ODF 1.2’s new metadata support, you can read the latest draft of the proposed changes to the specification [ODF] and the examples [ODF] document. Of course, any feedback on ODF drafts and published standards are welcome on the ODF TC’s comment mailing list.
For a gentle introduction to metadata, ODF, where we are coming from and where we are going, I offer this interview [MP3] with Patrick Durusau, Chair of the ODF Metadata Subcommittee, which I recorded back in July.
I’m a Brit, and I’ve never heard anybody pronounce it “meet-ah”. I pronounce it “met-uh” and so does everybody else as far as I’m aware.
Actually, I think it’s the Germans (say on the TC) that tend to pronounce it “meeta-ah.”
It’s possible to push “semantics” much further than just meta-data.
There’s the question of what conclusions it should be possible to reach from a collection of facts and rules of inference.
Then there’s the question of what those conclusions actually mean in everyday English.
There’s a kind of Wiki for executable English content that combines data, inference, and English meanings.
It’s at http://www.reengineeringllc.com , and shared use is free.
Great post. In the bioinformatics world I come from, using textual analysis to try and identify relationships is a pretty big deal with approaches ranging from hiring armies of manual curators to NLP. Standardized formats can only make our life easier going forward, especially if we can combine open standards with the kinds of vocabularies that biologists have developed.
On the meta part .. maybe it was someone from Australia or New Zealand. I can see someone with an aussie accent sounding like they said meet’-ah. I haven’t heart any brits pronounce it that way either.
I’m sorry to see that the pronunciation of the word seems to have taken precedence over the understanding of it. As for the Brits, I love the way they pronounce words, especially aluminum (I have taken up that pronunciation myself). There I go again getting off track.
I am totally fascinated by this subject. I have worked in a fair number of small businesses and larger institutions. I have seen inconsistencies, poor design, make that extremely poor design and complete ignorance of meta data create many times the work it took to enter the data itself.
It is sad that the understanding of this subject has not filtered down to people who could really use it. Small businesses and small to medium institutions like community colleges spend a lot of time and money trying to manage their data using 19th century concepts on 21st century computers.
I remember reading somewhere that the accent in Classical Greek was a musical, rather than a stress accent. So presumably the alpha would have a rising tone…
You could give that a try. I’ve read that a rising fifth was the interval suggested by ancient authorities. That would sure get some looks at an XML conference!
Classical Greek had long and short vowels, heavy and light syllables, a pitch accent as well as (presumably) a stress pattern. How exactly this all worked together with pitch and stress patterns at the clause and sentence level is a matter of some speculation. W. Sidney Allen’s “Vox Graeca: The Pronunciation of Classical Greek” is a good survey of this broad field.
The essential thing to remember is that classical Greek was dead in Europe at the dawn of the Renaissance. Outside of a few scholars in the shrinking Byzantine Empire, no one could read or pronounce it. So when the study of classical Greek was revived, it took on a Byzantine accent, which since the 4th Century was a stress accent, not a pitch accent. It also had a much smaller set of vowels sounds, much like modern Greek where everything ends up sounding like “ee”.
What we end up with today for Classical Greek pronunciation is a result of that reawakening, plus centuries of attempts at reforming the pronunciation of Greek (and Latin) by various scholars over the years, starting with Erasmus. But we still have that Byzantine stress accent.