The OASIS ODF Interoperability and Conformance TC has as a primary goal to:

Initially and periodically thereafter, to review the current state of conformance and interoperability among a number of ODF implementations; To produce reports on overall trends in conformance and interoperability that note areas of accomplishment as well as areas needing improvement, and to recommend prioritized activities for advancing the state of conformance and interoperability among ODF implementations in general without identifying or commenting on particular implementations;

The initial  “State of ODF Interoperability” report has now gone out for public review.  It is a baseline report, surveying the context of document interoperability, the sources of interoperability problems as well the ways in which these problems are being addressed.  Although it explicitly deals with ODF interoperability, much of the report is equally relevant to any other office document format, XML-based or binary.

If you want to participate in the public review,  you can find links to the  draft, as well as instructions for submitting comments, in the OASIS announcement of the review.

{ 0 comments }

A quick update on my post from last week on the “Microsoft Shuffle“, where I looked at how Microsoft’s “random” browser ballot was far from random.

First, I’d like to thanks those who commented on that post, or sent me notes, offering additional analysis. I think we nailed this one. Within a few days of my report Microsoft updated their Javascript on the browserchoice.eu website, fixing the error. But more on that in a minute.

Some random observations

Several commenters mentioned that if you search Google for “javascript random array sort” the first link returned will be a Javascript tutorial that has the same offending code as Microsoft’s algorithm. This is not surprising. As I said in my original post, this is a well-known mistake. But it is no less a mistake. If you use Google Code Search for the query “0.5 – Math.random()” lang:javascript you will find 50 or so other instances of the faulty algorithm. So if anyone else is using this same algorithm, they should evaluate whether it is really sufficiently random for their needs. In some case, such as a children’s game, it might be fine. But know that there are better and faster algorithms available that are not much more complicated to code.

Another thing to note is that the Microsoft Shuffle algorithm is bad enough with 5-elements in the array, but the non-randomness gets more pronounced as you increase the length of the array. Regardless of the size of the array, it appears that on Internet Explorer the 1st element will end up in last place 50% of the time. There are other pronounced patterns as well. You can see this yourself this this test file, which allows you to specify the size of the array as well as the number of iterations. Try a 50-element array for 10,000 iterations to get a good sense of how non-random the results can be.

I used that script to run a large test of 1,000,000 iterations of a 1024-element array. The raw results are here. I took that table, and using R’s image() function produced a rendering of that matrix. You can see here the clear over-representation at some positions, including (in the lower left) the flip of the first position to last place. (I’m not quite satisfied with this rendering. Maybe someone can get a better-looking visualization of this same data.)

Evaluating Microsoft’s new shuffle

Sometime last week — I don’t know the exact date — Microsoft updated the code for the browser choice website with a new random shuffle algorithm. You see see the code, in situ, here. The core of it is in this function:

function ArrayShuffle(a)
{
    var d, c, b=a.length;
    while(b)
    {
        c=Math.floor(Math.random()*b);
        d=a[--b];
        a[b]=a[c];
        a[c]=d
    }
}

This looks fine to me. I created a new test driver for this routine, which you can try out here. Aside from being much faster, it is gives much better results. Here is a run with a million iterations:

Raw counts

Position I.E. Firefox Opera Chrome Safari
1 199988 200754 199944 199431 199883
2 200320 200016 199838 199752 200074
3 199702 199680 199911 200865 199842
4 200408 200286 199740 199861 199705
5 199582 199264 200567 200091 200496

Fraction of total

Position I.E. Firefox Opera Chrome Safari
1 0.2000 0.2008 0.1999 0.1994 0.1999
2 0.2003 0.2000 0.1998 0.1998 0.2001
3 0.1997 0.1997 0.1999 0.2009 0.1998
4 0.2004 0.2003 0.1997 0.1999 0.1997
5 0.1996 0.1993 0.2006 0.2001 0.2005

And the results of the Chi-square test:

X-squared = 18.9593, df = 16, p-value = 0.2708

Final thoughts

In the end I don’t think it is reasonable to expect every programmer to be memorize the Fisher-Yates algorithm. These things belong in our standard libraries. But what I would expect every programmer to know is:

  1. That the problem here is one that requires a “random shuffle”. If you don’t know what it is called, then it will be difficult to lookup the known approaches. So this is partially a vocabulary problem. We, as programmers, have a shared vocabulary which we use to describe data structures and algorithms; binary searches, priority heaps, tries, and dozens of other concepts. I don’t blame anyone for not memorizing algorithms, but I would expect a programmer to know what types of algorithms apply to their work.
  2. How to research which algorithm to use in a specific context, including where to find reliable information, how to evaluate the classic trade-offs of time and space, etc.  There is almost always more than one way to solve a problem.
  3. That where randomized outputs are needed,  the outputs should be statistically tested. I would not expect the average programmer to know how to do a chi-square test, or even to know what one is. But I would expect a mature programmer to know either find this out or seek help.

{ 22 comments }

Evidently today is National Grammar Day.  I am not a fan.

Like most Americans of my generation I was taught to identify parts of speech, diagram sentences and intone with the rest of the class the mysteries of the three-and-twenty most holy helping verbs: “is, am, are, was, were, be, being, been, have, has, had, do, did, does, may, must, might, can, could, will, would, shall, should”.   Because I was good at it, and felt a call to the service of pedantry, I continued my novitiate in stranger accents, in German, Latin and Greek.

I was well on my path the the priesthood of a grammarian, when in 1992 I abandoned all my vows in a bus in Somerville, Massachusetts, when a drunk showed me what language was really all about.

Somerville, Massachusetts
Image via Wikipedia

I’m not one to start a conversation with a stranger –  even a sober one — on public transportation.  But in this case I had little choice in the matter, since this particular gentleman insisted on initiating a debate on the virtues of the Allman Brothers, a subject which I was neither equipped nor inclined to discuss with him.

When I expressed my disinclination to debate, and further, my ignorance of all things Allman, the dear fellow was offended and let out a string of expletives, starting with “Un-freakin’-believable” (albeit with a more emphatic, saltier interposed participial adjective than I can relate to you here) and continuing for several minutes.  Nothing he said was grammatical.  Little was even coherent.  But what I did understand was pure genius.  I wish I had a tape recorder.  As my stop approached, I hesitated a moment, intending to thank the man, offer him my congratulations and laud him as a poet of the first order.  But the smell, as well as my own instinct for self-preservation, held me in abeyance.

Since that day I have been an apostate to grammar.  I think we should all have a range of ways to speak and write,  and should be able to modulate according to circumstances. Language is like a wardrobe.  A man should have jogging shorts as well as a tuxedo.  In the end, language is not about rules.  It is about suiting the words to the occasion, of putting the right words in the right places, and what is “right’ will depend on circumstances.

So down with grammar, down with the rules! Go, split an infinitive, dangle your participles, and like my good friend on the #86 bus, even separate your inseparable prefixes.  To quote Duke Ellington, “If it sounds good, it is good”.  And remember that the goal in the end is expression and understanding.  If you are understood, then you’ve accomplished more than many.

As Gertrude Stein wrote:

Clarity is of no importance because nobody listens and nobody knows what you mean no matter what you mean, nor how clearly you mean what you mean. But if you have vitality enough of knowing enough of what you mean, somebody and sometime and sometimes a great many will have to realize that you know what you mean and so they will agree that you mean what you know, what you know you mean, which is as near as anybody can come to understanding anyone.

Reblog this post [with Zemanta]

{ 4 comments }

Doing the Microsoft Shuffle: Algorithm Fail in Browser Ballot

February 27, 2010

March 6th Update:  Microsoft appears to have updated the www.browserchoice.eu website and corrected the error I describe in this post.  More details on the fix can be found in The New & Improved Microsoft Shuffle.  However, I think you will still find the following analysis interesting.
-Rob

Introduction
The story first hit in last week on the Slovakian [...]

163 comments Read the full article →

How to photograph an asteroid

February 22, 2010

Over the years, I’ve seen Mercury, Venus, Mars, Jupiter and Saturn with my naked eyes.  And I’ve seen Uranus and Neptune through a telescope.  But I’ve never seen an asteroid until last night, when I photographed the 2nd largest minor planet, Vesta.
Vesta is currently near opposition, meaning as seen from the Earth, Vesta and the [...]

0 comments Read the full article →

Weekly Links #3

February 20, 2010

Danish Open Source Vendors declares victory in open standards war
“The mood at this year’s general meeting was joyous. In late January 2010, OSL could declare victory in maybe the most important and hard fought battle that OSL has been part of since its formation.
On 29 January 2010 the Danish Parliament (Folketinget) decided unanimously to place [...]

0 comments Read the full article →

Microsoft Office document corruption: Testing the OOXML claims

February 15, 2010

Summary
In this post I take a look at Microsoft’s claims for robust data recovery with their Office Open XML (OOXML) file format.  I show the results of an experiment, where I introduce random errors into documents and observe whether word processors can recover from these errors.  Based on these result, I estimate data recovery rates [...]

22 comments Read the full article →

Weekly Links #2

February 13, 2010

Q&A: IBM’s Alistair Rennie on the big picture for Lotus | Between the Lines | ZDNet.com
“Ultimately, new types of documents are possible. I don’t see my kids creating content 5 to 10 years from now going into a dumb text editor and doing a presentation. I see them using rich media and aggregating small bits [...]

2 comments Read the full article →

Weekly Links #1

February 6, 2010

Advogato: Proprietary File Formats conflict with Equal Opportunities
tags:  standards

Bob Sutor: What would ODF support for WordPress look like?
tags: ODF

Lotus Solutions Development Lab: Lab 04: ODFDOM: Generating ODF Documents from a Notes Agent
tags: ODF, Notes

Gwennel – A WYSIWYG and WYSIWYM editor
“Gwennel is a free WYSIWYG and WYSIWYM editor for Windows supporting natively the Open Document Format.”
tags: [...]

0 comments Read the full article →

ODF 1.2 Part 1 Public Review

January 25, 2010

A major milestone was reached for the OASIS ODF TC today.  The latest Committee Draft of ODF 1.2 Part 1 was sent out for a 60-day public review.
“What does this mean, and why should I care?” you might be asking.  That’s a fair question.
First, a quick review of the OASIS standards approval process.  The stages [...]

6 comments Read the full article →