I’ve been playing around today with a preview build of the ODF Java API ODFDOM 0.9. One of the capabilities we’re adding is a simple text extraction API.
The idea is to have a very simple API, a single function call in fact, that will allow you to extract the plain text from an ODF document. So strip all formatting, all layout and just return the text. At first you might think this is rather useless, but further reflection shows that it has myriad uses, including accessibility, search indexing, collaborative filtering, and text analytics in general.
Extracting text from ODF is pretty simple. There are a handful of special cases to watch out for. One example is a single word that has mixed styles, e.g.: ODFDOM. In ODF this looks like:
We want text extraction to come out as “ODFDOM” not “ODF DOM” with a space.
On the other hand, there are other examples of adjacent elements, like with footnote citations, where we need to insert a space to prevent two adjacent strings from being conflated.
Overall, the build I used looks pretty good, and works the same across text, spreadsheets and presentations.
So I was looking this afternoon for something I could use to demo this new capability. I thought of using Jonathan Feinberg’s excellent Wordle applet (which I wrote about a while back). This applet creates a word cloud, based on word frequency of text you feed it. As a torture test I decided to feed it the text of ODF 1.2 Committee Draft 05, the version that is currently out for public review.
This is what I got for results.
Part 1 is the annotations the schema for ODF. As expected, the key words are those referring to XML markup concepts like “attribute” and “element”:
Part 2: is OpenFormula, the spreadsheet formula express language. No XML in this part. In fact, this looks more like what I’d expect from an excerpt from a programming language specification, which pretty much what OpenFormula is.
And Part 3 is the packaging specification.
In the end text extraction is just the data preparation step. The real fun happens after, with the analysis and visualization techniques that can be applied to the text once extracted.
If anyone is interested in trying out the text extraction module, please let me know. We’re aiming for a release of ODF 0.9 toward the end of August, but I can probably get you a preview, if you are interested in testing. And let me know if you have any brilliant ideas of what to do with the extracted text. I’m always looking for good demo material.