Archives for July 2010
I’ve been playing around today with a preview build of the ODF Java API ODFDOM 0.9. One of the capabilities we’re adding is a simple text extraction API.
The idea is to have a very simple API, a single function call in fact, that will allow you to extract the plain text from an ODF document. So strip all formatting, all layout and just return the text. At first you might think this is rather useless, but further reflection shows that it has myriad uses, including accessibility, search indexing, collaborative filtering, and text analytics in general.
Extracting text from ODF is pretty simple. There are a handful of special cases to watch out for. One example is a single word that has mixed styles, e.g.: ODFDOM. In ODF this looks like:
We want text extraction to come out as “ODFDOM” not “ODF DOM” with a space.
On the other hand, there are other examples of adjacent elements, like with footnote citations, where we need to insert a space to prevent two adjacent strings from being conflated.
Overall, the build I used looks pretty good, and works the same across text, spreadsheets and presentations.
So I was looking this afternoon for something I could use to demo this new capability. I thought of using Jonathan Feinberg’s excellent Wordle applet (which I wrote about a while back). This applet creates a word cloud, based on word frequency of text you feed it. As a torture test I decided to feed it the text of ODF 1.2 Committee Draft 05, the version that is currently out for public review.
This is what I got for results.
Part 1 is the annotations the schema for ODF. As expected, the key words are those referring to XML markup concepts like “attribute” and “element”:
Part 2: is OpenFormula, the spreadsheet formula express language. No XML in this part. In fact, this looks more like what I’d expect from an excerpt from a programming language specification, which pretty much what OpenFormula is.
And Part 3 is the packaging specification.
In the end text extraction is just the data preparation step. The real fun happens after, with the analysis and visualization techniques that can be applied to the text once extracted.
If anyone is interested in trying out the text extraction module, please let me know. We’re aiming for a release of ODF 0.9 toward the end of August, but I can probably get you a preview, if you are interested in testing. And let me know if you have any brilliant ideas of what to do with the extracted text. I’m always looking for good demo material.
The language game
Microsoft’s talking points go something like this (summarized in my words):
If you adopt ODF instead of OOXML then you “restrict choice”. Why would you want to do that? You’re in favor of openness and competition, right? So naturally, you should favor choice.
You can see a hundreds of variations on this theme, in Microsoft press releases, whitepapers, in press articles and blogged by astroturfers, by searching Google for “ODF restrict choice“.
This argument is quite effective, since it is plausible at first glance, and takes more than 15 seconds to refute. But the argument in the end fails by taking a very superficial view of “choice”, relying merely on the positive allure of its name, essentially using it as a talisman. But “choice” is more than just a pretty word. It means something. And if we dig a little deeper, at what the value of choice really is, the Microsoft argument falls apart.
So let’s make an attempt to show how can one be in favor of choice, but also be in favor of eliminating choice. Let’s resolve the paradox. Personally I think this argument is too long, but maybe it will prompt someone to formulate it in a briefer form.
Choice — the option to act
Choice is the option to act on one more possibilities. Choice is the freedom to take one path or another. Choice is the ability to open one door or another. And what is the value of choice? It depends on the value of the underlying possibilities.
In some cases, the value of choice can be valued quite precisely.
For example, imagine I have three boxes, one containing nothing, one containing $5 and another containing $10. If you have no choice, and are given one box at random, then you will get $5 on average. And if given the choice of which box to pick, also without knowing the contents, you will also get $5 on average.
Similarly, if each box contained exactly $5 and you could see inside, the value of choice would still be zero.
But if the three boxes contained nothing, $5 and $10 and you could see inside, then the value of having a choice is clear. You would naturally pick the $10 box. So having a choice is worth an additional $5.
So we see that for choice to have value, you must have two things:
- A way to estimate the value of outcome over another.
- A preference for one outcome over another
In some cases this can be done with precision. In other cases it can only be estimated or modeled. For example, trading stock options is essentially the selling and buying of the right to exercise the choice (option) to buy or sell a security at a given price within a given time period. The value of this choice can be modeled by sophisticated mathematical models like the Black-Scholes option pricing formula.
So going back to the boxes again. Now imagine one has $10 in it, and the other has a note in it that requires that you pay me $10. You can see the contents of each box. Which one do you choose? It should be obvious, you pick the one with $10 in it.
But what if I say you are not limited to picking only one box. You can pick either box, or both boxes if you wish. You have absolute freedom to choose A, B or A+B. What do you do? Of course, you still pick the box with $10 in it.
But doesn’t that eliminate choice? Yes, of course it did. But the value of choice was only derived from the value of the underlying outcomes. By choosing, I’ve derived the full value of having a choice. Since if one choice is clearly more favorable than others (it “dominates” the others) then the alternatives should be discarded.
Resolving the paradox of the choice
Give the choice of A, B or A+B, each are distinct, mutually exclusive choices. They are the three boxes with three outcomes. Each one has a value that could be estimated. When someone portrays option A+B as preserving choice, they are forgetting that this is a choice that also restricts choice, since it eliminates A or B in their exclusive, pure forms from consideration. Any choice, even the choice of A+B, restricts choice. If you choose A+B then you have not chosen A alone or B alone. You have the value of the outcome A+B, but do not have the possibly greater benefits of picking choice A alone or choice B alone.
Clear? I think this should be obvious, but I’ve seen these concepts cause much confusion.
It is also important to realize that the combination A+B may have conjoint effects, which may be neutral, synergistic or antagonistic. In other words the value of A+B is not necessarily the same as the value of A plus the value of B.
In some cases, certainly, the value of the A+B choice is the same as the sum of each individual values. For example, the boxes with money and notes, these are all simply additive, with no conjoint effects.
But in other cases, the value of A+B has synergistic effects. For example, the choice of diet+exercise is more salubrious that either one chosen in isolation.
And in some cases the value of A+B is less than the value either one in isolation, as anyone who has bought both a cat and a dog knows. These choices are antagonistic.
So back to the file format debate. The choice here is between adopting ODF, OOXML, or ODF+OOXML. These three choice are mutually exclusive. They are the three boxes, with three different outcomes. Each outcome has a value that could be estimated. But we should not fall into the trap of thinking that an ODF+OOXML decision is preserving choice. Far from it. By making that choice, one eliminates the possibility of having only ODF, or of having only OOXML, with the resulting values that those choices would bring. Choosing both formats eliminates outcomes and restricts choice just has much as choosing only ODF eliminates outcomes.
You cannot avoid eliminating the outcomes you do not choose. There are benefits that would come from having only a single standard, and there are costs and complications from maintaining multiple standards. These must all be considered.
A major milestone was reached for the OASIS Open Document Format (ODF) TC last week. The latest Committee Draft of ODF 1.2 (CD 05) was sent out for a 60-day public review.
As you may recall, ODF 1.2 is a single standard in three parts:
- Part 1 specifies the core schema, and was send out for public review in January.
- Part 2 is OpenFormula (spreadsheet formulas)
- Part 3 defines the packaging model of ODF, and went out for public review back in November
The current public review is the first complete review, presenting all three parts of ODF 1.2, including the new Part 2, OpenFormula, which is our spreadsheet formula language.
We will accept public comments (and that includes comments from technical experts in ISO/IEC JTC1/SC34) through September 6th. Comments should be submitted via the TC’s public comment list, which you can join via these instructions. You can monitor incoming comments also by subscribing to the comment list, by searching the archives or unofficially via the ODFJIRA Twitter feed.
The OASIS ODF TC will track and review all received comments and produce a report indicating how we have resolved each comment. If we decide to make substantive changes to the specification based on comments received then we would approve such changes in a Committee Draft (CD 06) and send that out for a 15-day public review of the changes made. I expect this will occur. Then, the TC may vote to approve the public review draft as a Committee Specification. Then we can have a ballot of the OASIS membership to approve it as an OASIS Standard. And finally (after some additional administrative paperwork) we can submit ODF 1.2 ISO/IEC JTC1 according to their PAS process.
I think we can finish up the above remaining formal steps in the 4th quarter.
As I mentioned, the biggest difference in CD 05 over previous Open Document Format public review drafts is the inclusion of the OpenFormula specification. If you are interested in contributing comments during the public review, I’d especially encourage you to review this document. The other parts have already gone through one or more cycles of public review. This part has not.
An outline of the contents of OpenFormula is:
- 1 Introduction
- 2 Expressions and Evaluators
- 3 Formula Processing Model
- 4 Types
- 5 Expression Syntax
- 6 Standard Operators and Functions
- 6.4 Standard Operators
- 6.5 Matrix Functions
- 6.6 Bit operation functions
- 6.7 Byte-position text functions
- 6.8 Complex Number Functions
- 6.9 Database Functions
- 6.10 Date and Time Functions
- 6.11 External Access Functions
- 6.12 Financial Functions
- 6.13 Information Functions
- 6.14 Lookup Functions
- 6.15 Logical Functions
- 6.16 Mathematical Functions
- 6.17 Rounding Functions
- 6.18 Statistical Functions
- 6.19 Number Representation Conversion Functions
- 6.20 Text Functions
- 7 Other Capabilities
- 8 Non-portable Features
The ideal reviewer for OpenFormula would have expertise either in formal descriptions of computer languages, e.g., know EBNF, type systems, numeric computing models, etc., or knowledge of one or more of the domains of knowledge we cover via the spreadsheet functions. Honestly, I think we have enough “language lawyers” on the TC already, so I’m not so worried about that part. And we did have direct participation by experts in some functional domains. For example, the statistical and mathematical functions have been given a good scrub already by “Dr. G.”
However, the financial functions, these I think could use a thorough review by a subject matter expert, ideally an expert in financial accounting standards, actuarial sciences, or similar. If anyone knows such an expert who is willing to contribute comments on approximately 30 pages of function definitions related to loan amortization, bond coupon and yield, rates of return, day count conventions, etc., please let me know via email.
Note finally that although OpenFormula is part of the ODF 1.2 specification, it was designed to be a portable, embeddable expression language syntax. It is a natural fit for a spreadsheet application, but it could be used wherever you need to encode a calculable expression with a rich library of domain-specific functions. It was designed so it could be used in other contexts.
I think it would be a fun project to implement OpenFormula as a standalone library, Java or Python, where you feed it an expression, along with an “address resolver” object to resolve names (e.g., cell references) to values, and then have it calculate the output value. This could be the first step toward some interesting things. For example, I give you an ODF spreadsheet and you generate a web app that executes the same model as my spreadsheet. (Many years ago, in the 1980’s there was a “spreadsheet compiler” that did something similar to 1-2-3 files). Or I give you a spreadsheet and indicate some variable input cells and you execute thousands of variations on it via Monte Carlo analysis. Or I give value ranges for you on input cells, and you calculate the sheet in variations via interval arithmetic. This may be interesting for sensitivity analysis, risk analysis, analysis of propagation of errors, etc.
Think: “Plugable spreadsheet evaluation engines, all understanding a common formula expression language.”
Once you have a standardized model for a spreadsheet and that model is independent from the calculation engine, then you have the ability to plug in in different calculation engines that conform to the standard, and these various calculation engines can have various strategies. This is a very powerful capability, made possible via standardization.