Archives for March 2010

Document Freedom: How to know when you have it

2010/03/31 By Rob 6 Comments

Today is Document Freedom Day. In the five years since Open Document Format (ODF) first was approved in OASIS we have certainly made progress, but there is still work remaining to be done. How will we know when we have arrived? At what point can we declare victory and say “Free at last”? I think that when we can agree that all of the following statements are true, then at that point we have achieved the substantial benefits of document freedom.

I can create documents on the platform of my choice, using the software of my choice.
I can migrate to another editing environment (application or operating system) without losing high-fidelity access to my existing documents.
I can send my documents to anyone and know that they can read them without requiring the purchase of new software.
I can receive documents from anyone and know that I can read them without requiring the purchase of new software.
I have confidence that the documents I create today can be read and understood, 10, 25 or 50 years from now.
Programmers can write and distribute software that reads and writes documents without paying royalties to anyone.
I have confidence that the document format standard is being evolved in a way that guarantees the above rights equally for all users and vendors.

We’ve made substantial progress on these fronts, but I don’t think we’re there yet. We should celebrate our substantial progress, while at the same time commit ourselves for the remaining work ahead. For example, we still need to improve interoperability. In a few weeks we will have our next ODF Plugfest, in Granada, where ODF implementors will gather for the 3rd time to work together to improve interoperability among their implementations.

Weekly Links #4

2010/03/27 By Rob Leave a Comment

ODFDOM for Java: Simplifying programmatic control of documents and their data, Part 1

“This article is the first in a three-part series and introduces the new Open Document Format (ODF) Document Object Model (DOM) for Java™ along with the ODF Toolkit Union open source community, whose mission is to simplify the programmatic manipulation of documents and their data.”

tags: ODF
ODFDOM 0.8 – The new Release of the OpenDocument Java Library – GullFOSS
“The new version of ODFDOM – the OpenDocument Java library – has been released!Most people might know about ODFDOM, for the others: ODFDOM is an Apache 2 licensed Java library to easily create, access and manipulate the ODF documents.

In biggest feature aside of a more than a dozen patches for ODFDOM 0.8 is the complete revised new ODF table API.
The table is the first feature introducing our new layered design to ease ODF usage.”

tags: ODF, ODFDOM
CeBIT 2010: Recipe for Office Migration
“The city council’s conclusion: ‘We would do it again!’ Schiessl: ‘The office product is a key to independence. Once you’ve solved the office issue, you’re independent of any operating system.’ “

tags: ODF

Posted from Diigo. The rest of my favorite links are here.

Public review of “The State of ODF Interoperability”

2010/03/14 By Rob 3 Comments

The OASIS ODF Interoperability and Conformance TC has as a primary goal to:

Initially and periodically thereafter, to review the current state of conformance and interoperability among a number of ODF implementations; To produce reports on overall trends in conformance and interoperability that note areas of accomplishment as well as areas needing improvement, and to recommend prioritized activities for advancing the state of conformance and interoperability among ODF implementations in general without identifying or commenting on particular implementations;

The initial “State of ODF Interoperability” report has now gone out for public review. It is a baseline report, surveying the context of document interoperability, the sources of interoperability problems as well the ways in which these problems are being addressed. Although it explicitly deals with ODF interoperability, much of the report is equally relevant to any other office document format, XML-based or binary.

If you want to participate in the public review, you can find links to the draft, as well as instructions for submitting comments, in the OASIS announcement of the review.

The New & Improved Microsoft Shuffle

2010/03/06 By Rob 27 Comments

A quick update on my post from last week on the “Microsoft Shuffle“, where I looked at how Microsoft’s “random” browser ballot was far from random.

First, I’d like to thanks those who commented on that post, or sent me notes, offering additional analysis. I think we nailed this one. Within a few days of my report Microsoft updated their JavaScript on the browserchoice.eu website, fixing the error. But more on that in a minute.

Some random observations

Several commenters mentioned that if you search Google for “javascript random array sort” the first link returned will be a JavaScript tutorial that has the same offending code as Microsoft’s algorithm. This is not surprising. As I said in my original post, this is a well-known mistake. But it is no less a mistake. If you use Google Code Search for the query “0.5 – Math.random()” lang:javascript you will find 50 or so other instances of the faulty algorithm. So if anyone else is using this same algorithm, they should evaluate whether it is really sufficiently random for their needs. In some case, such as a children’s game, it might be fine. But know that there are better and faster algorithms available that are not much more complicated to code.

Another thing to note is that the Microsoft Shuffle algorithm is bad enough with 5-elements in the array, but the non-randomness gets more pronounced as you increase the length of the array. Regardless of the size of the array, it appears that on Internet Explorer the 1st element will end up in last place 50% of the time. There are other pronounced patterns as well. You can see this yourself this this test file, which allows you to specify the size of the array as well as the number of iterations. Try a 50-element array for 10,000 iterations to get a good sense of how non-random the results can be.

I used that script to run a large test of 1,000,000 iterations of a 1024-element array. The raw results are here. I took that table, and using R’s image() function produced a rendering of that matrix. You can see here the clear over-representation at some positions, including (in the lower left) the flip of the first position to last place. (I’m not quite satisfied with this rendering. Maybe someone can get a better-looking visualization of this same data.)

Evaluating Microsoft’s new shuffle

Sometime last week — I don’t know the exact date — Microsoft updated the code for the browser choice website with a new random shuffle algorithm. You see the code, in situ, here. The core of it is in this function:

function ArrayShuffle(a)
{
    var d, c, b=a.length;
    while(b)
    {
        c=Math.floor(Math.random()*b);
        d=a[--b];
        a[b]=a[c];
        a[c]=d
     }
}

This looks fine to me. I created a new test driver for this routine, which you can try out here. Aside from being much faster, it is gives much better results. Here is a run with a million iterations:

Raw counts

Position	I.E.	Firefox	Opera	Chrome	Safari
1	199988	200754	199944	199431	199883
2	200320	200016	199838	199752	200074
3	199702	199680	199911	200865	199842
4	200408	200286	199740	199861	199705
5	199582	199264	200567	200091	200496

Fraction of total

Position	I.E.	Firefox	Opera	Chrome	Safari
1	0.2000	0.2008	0.1999	0.1994	0.1999
2	0.2003	0.2000	0.1998	0.1998	0.2001
3	0.1997	0.1997	0.1999	0.2009	0.1998
4	0.2004	0.2003	0.1997	0.1999	0.1997
5	0.1996	0.1993	0.2006	0.2001	0.2005

And the results of the Chi-square test:

X-squared = 18.9593, df = 16, p-value = 0.2708

Final thoughts

In the end I don’t think it is reasonable to expect every programmer to memorize the Fisher-Yates algorithm. These things belong in our standard libraries. But what I would expect every programmer to know is:

That the problem here is one that requires a “random shuffle”. If you don’t know what it is called, then it will be difficult to lookup the known approaches. So this is partially a vocabulary problem. We, as programmers, have a shared vocabulary which we use to describe data structures and algorithms; binary searches, priority heaps, tries, and dozens of other concepts. I don’t blame anyone for not memorizing algorithms, but I would expect a programmer to know what types of algorithms apply to their work.
How to research which algorithm to use in a specific context, including where to find reliable information, how to evaluate the classic trade-offs of time and space, etc. There is almost always more than one way to solve a problem.
That where randomized outputs are needed, the outputs should be statistically tested. I would not expect the average programmer to know how to do a chi-square test, or even to know what one is. But I would expect a mature programmer to know either find this out or seek help.

National Grammar Day, Bah Humbug!

2010/03/04 By Rob 4 Comments

Evidently today is National Grammar Day. I am not a fan.

Like most Americans of my generation I was taught to identify parts of speech, diagram sentences and intone with the rest of the class the mysteries of the three-and-twenty most holy helping verbs: “is, am, are, was, were, be, being, been, have, has, had, do, did, does, may, must, might, can, could, will, would, shall, should”. Because I was good at it, and felt a call to the service of pedantry, I continued my novitiate in stranger accents, in German, Latin and Greek.

I was well on my path the the priesthood of a grammarian, when in 1992 I abandoned all my vows in a bus in Somerville, Massachusetts, when a drunk showed me what language was really all about.

I’m not one to start a conversation with a stranger — even a sober one — on public transportation. But in this case I had little choice in the matter, since this particular gentleman insisted on initiating a debate on the virtues of the Allman Brothers, a subject which I was neither equipped nor inclined to discuss with him.

When I expressed my disinclination to debate, and further, my ignorance of all things Allman, the dear fellow was offended and let out a string of expletives, starting with “Un-freakin’-believable” (albeit with a more emphatic, saltier interposed participial adjective than I can relate to you here) and continuing for several minutes. Nothing he said was grammatical. Little was even coherent. But what I did understand was pure genius. I wish I had a tape recorder. As my stop approached, I hesitated a moment, intending to thank the man, offer him my congratulations and laud him as a poet of the first order. But the smell, as well as my own instinct for self-preservation, held me in abeyance.

Since that day I have been an apostate to grammar. I think we should all have a range of ways to speak and write, and should be able to modulate according to circumstances. Language is like a wardrobe. A man should have jogging shorts as well as a tuxedo. In the end, language is not about rules. It is about suiting the words to the occasion, of putting the right words in the right places, and what is “right’ will depend on circumstances.

So down with grammar, down with the rules! Go, split an infinitive, dangle your participles, and like my good friend on the #86 bus, even separate your inseparable prefixes. To quote Duke Ellington, “If it sounds good, it is good”. And remember that the goal in the end is expression and understanding. If you are understood, then you’ve accomplished more than many.

As Gertrude Stein wrote:

Clarity is of no importance because nobody listens and nobody knows what you mean no matter what you mean, nor how clearly you mean what you mean. But if you have vitality enough of knowing enough of what you mean, somebody and sometime and sometimes a great many will have to realize that you know what you mean and so they will agree that you mean what you know, what you know you mean, which is as near as anybody can come to understanding anyone.