Rob

Document as Activity versus Document as Record

2014/07/31 By Rob 2 Comments

I’ve been thinking some more on the past, present and future of documents. I don’t know exactly where this post will end up, but I think this will help me clarify some of my own thoughts.

First, I think technology has clouded our thinking and we’ve been equivocating with the term “document”, using it for two entirely different concepts.

One concept is of the document as the way we do work, but not an end-in-itself. This is the document as a “collaboration surface”, short-lived, ephemeral, fleeting, quickly created and equally quickly forgotten.

For example, when I create a few slides for a project status report, I know that the presentation document will never be seen again, once the meeting for which it was written has ended. The document serves as a tool for the activity of presenting status, of informing. Twenty years ago we would have used transparencies (“foils”) or sketched out some key points on a black board. And 10 years from now, most likely, we will use something else to accomplish this task. It is just a coincidence that today the tools we use for this kind of work also act like WYSIWYG editors and can print and save as “documents”. But that is not necessary, and historically was not often the case.

Similarly, take a spreadsheet. I often use a spreadsheet for a quick ad-hoc “what-if” calculation. Once I have the answer I am done. I don’t even need to save the file. In fact I probably load or save a document only 1 in 5 times that I launch the application. Some times people use a spreadsheet as a quick and dirty database. But 20 years ago they would have done these tasks using other tools, not document-oriented, and 10 years from now they may use other tools that are equally not document related. The spreadsheet primarily supports the activity of modeling and calculating.

Text documents have myriad collaborative uses today, but other tools have emerged as well . Collaboration is moved to other non-document interfaces, tools like wikis, instant messaging, forums, etc. Things that would have required routing a typed inter-office memo 50 years ago are now done with blog posts.

That’s one kind of document, the “collaboration surface”, the way we share ideas, work on problems, generally do our work.

And then there is a document as the record of what we did. This is implied by the verb “to document”. This use of documents is still critical, since it is ingrained in various regulatory, legal and business processes. Sometimes you need “a document.” It won’t do to have your business contract on a wiki. You can’t prove conformance to a regulation via a Twitter stream. We may no longer print and file our “hard” documents, but there is a need to have a durable, persistable, portable, signable form of a document. PDF serves well for some instances, but not in others. What does PDF do with a spreadsheet, for example? All the formulas are lost.

This distinction, between these two uses of documents, seems analogous to the distinction between Systems of Engagement and Systems of Record, and can be considered in that light. It just happens that each concept happened to use the same technology, the same tools, circa the year 2000, but in general these two concepts are very different.

The obvious question is: What will the future being? How quickly does our tool set diverge? Do we continue with tools that compromise, hold back collaborative features because they must also serve as tools to author document records? Or do we unchain collaborative tools and allow them to focus on what they do best?

Announcing OpenLibreOffice

2014/04/01 By Rob 3 Comments

2014-04-01

The Internet

The Apache OpenOffice project and The Document Foundation are pleased to announce that an agreement has been made to combine resources and jointly develop a next-generation open source office suite, to be called “OpenLibreOffice” (except in France where it will be called “LibreOfficeOpen”). OpenLibreOffice will be quad licensed under the ALv2, MPL, LPGL and WTFPL licenses, so programmers can maximize their ability to express fine distinctions about copyright law. Similarly, source code for OpenLibreOffice will be made available to in C++, C#, Java and Ruby, for the benefit of attorneys who wish to make fine distinctions about type checking.

Some people eat meat. Some are vegetarians. Some are vegan, and won’t even eat eggs or cheese.”, said Michael Meeks of Koolibra. “These distinctions are important to how we look at ourselves. The choice of open source license gives us each an opportunity to feel morally superior, which is the primary joy of open source development.

This new joint effort brings an end to the brief fork that had disrupted development of the decade-old OpenOffice project and lead to a passionate contest to see which project would fail the slowest. As former TDF Board Member Charles Schulz recalls:

The fork originated over a disagreement over the color of icons in the toolbar. Or something like that. I don’t really remember. It was 2011 and everyone was protesting for something. ‘Occupy OpenOffice’ didn’t sound right, so we just called it ‘LibreOffice’. It was intended to be a placeholder name. We were hoping, after a suitable period of insults and ridicule, that Oracle would just give us the trademark for OpenOffice. For unknown reasons, likely involving IBM, the Military-Industrial Complex and the Trilateral Commission, that plan didn’t work. By the time we realized that no one outside of France and Spain knew how to pronounce ‘LibreOffice’, it was too late.

LibreOffice shipped 68 releases over the 4 year duration of their fork, fixing over 1673 bugs and introducing only 1532 new bugs, making it the most productive, though least efficient, open source project of all time. Apache has made only two releases in the last year, taking the “principle of least astonishment” to new levels.

Apache OpenOffice Poo-Bah Rob Weir applauded news of the announcement:

Users will quickly benefit from the combined engineering effort on OpenLibreOffice. But even greater things await the public when the marketing efforts combine and 100 million downloads of OpenOffice get transformed into colorful infographics showing 20 billion IP addresses or abstract videos of flashing lights accompanied by jazz flute music.

In related news, Microsoft released a new policy paper suggesting that open source software was partially responsible for European economic woes, due to the lack of VAT revenue, and proposed a special new surtax on open source software, “in the interest of fairness and open competition”.

###

ODF 1.2 Submitted to ISO

2014/03/31 By Rob 8 Comments

Last Wednesday, March 26th, on Document Freedom Day, OASIS submitted Open Document Format 1.2 standard to the ISO/IEC JTC1 Secretariat for transposition to an International Standard under the Publicly Available Specification (PAS) procedure.

If you recall, the PAS procedure is what we used back in 2005 when ODF 1.0 was submitted to ISO and was approved as ISO/IEC 26300. ODF 1.1 used a different procedure and was processed as an amendment to ISO/IEC 26300. Since ODF 1.2 is a much larger delta to the previous version it makes sense to take it through the PAS procedure again.

The PAS transposition process starts with a two month “translation period” when National Bodies may translate the ODF 1.2 specification if they wish. This is then followed by a three-month ballot. Following a successful ballot any comments received are reviewed by all stakeholders and resolutions determined at a Ballot Resolution Meeting (BRM).

I am notoriously bad at predicting the pace of standards development, but if you add up the steps of the process, this looks like a ballot ending in Q4 and a BRM around year’s end.

The Words Democrats and Republicans Use

2014/02/07 By Rob Leave a Comment

It came to me after listening to the State of the Union Address: Can we tell whether a speech was from a Democrat or a Republican President, purely based on metrics related to the words used? It makes sense that we could. After all, we can analyze emails and detect spam that way. Automatic text classification is a well known problem. On the other hand, presidential speeches go back quite a bit. Is there a commonality of speeches of, a Democrat in 2014 with one from 1950? Only one way to find out…

I decided to limit myself to State of the Union (SOTU) addresses, since they are readily available, and only those post WW II. There has been a significant shift in American politics since WW II so it made sense, for continuity, to look at Truman and later. If I had included all of Roosevelt’s twelve (!) SOTU speeches it might have distorted the results, giving undue weight to individual stylistic factors. So I grabbed the 71 post WWII addresses and stuck them into a directory. I included only the annual addresses, not any exceptional ones, like G.W. Bush’s special SOTU in September 2001.

I then used R’s text mining package, tm, to load the files into a corpus, tokenize, remove punctuation, stop words, etc. I then created a document-term matrix and removed any terms that occurred in fewer than half of the speeches. This left me with counts of 610 terms in 71 documents.

Then came the fun part. I decided to use Pointwise Mutual Information (PMI), an information-centric measure of association from information retrieval, to look at the association between terms in the speeches and party affiliation. PMI shows the degree of association (or “co-location”) of two terms while also accounting for their prevalence of the terms individually. Wikipedia gives the formula, which is pretty much what you would expect. Calculate the log probability of the co-location and subtract out the log probability of the background rate of the term. But instead of looking at the co-occurrence of two terms, I tried looking at the co-occurrence of terms with the party affiliation. For example, the PMI of “taxes” with the class Democrat would be: log p(“taxes”|Democrat) – log p(“taxes”). You can see my full script for the gory details.

Here’s what I got, listing the 25 highest PMI terms for Democrats and Republicans:

So what does this all mean? First note the difference in scale. The top Republican terms had higher PMI than the top Democrat terms. In some sense it is a political Rorschach test. You’ll see what you want to see. But in fairness to both parties I think this does accurately reflect their traditional priorities.

From the analytic standpoint the interesting thing I notice is how this compares to other approaches, like using classification trees. For example, if I train the original data with a recursive partitioning classification tree, using rpart, I can classify the speeches with 86% accuracy by looking at the occurrences of only two terms:

Not a lot of insight there. It essentially latched on to background noise and two semantically useless words. So I prefer the PMI-based results since they appear to have more semantic weight.

Next steps: I’d like to apply this approach back to speeches from 1860 through 1945.

First Move Advantage in Chess

2014/01/27 By Rob 8 Comments

The Elo Rating System

Competitive chess players, at the amateur club level all the way through the top grandmasters, receive ratings based on their performance in games. The ratings formula in use since 1960 is based on a model first proposed by the Hungarian-American physicist Arpad Elo. It uses a logistic equation to estimate the probability of a player winning as a function of that player’s rating advantage over his opponent:

$latex E = \frac 1 {1 + 10^{-\Delta R/400}}&s=3$

So for example, if you play an opponent who out-rates you by 200 points then your chances of winning are only 24%.

After each tournament, game results are fed back to a national or international rating agency and the ratings adjusted. If you scored better than expected against the level of opposition played your rating goes up. If you did worse it goes down. Winning against an opponent much weaker than you will lift your rating little. Defeating a higher-rated opponent will raise your rating more.

That’s the basics of the Elo rating system, in its pure form. In practice it is slightly modified, with ratings floors, bootstrapping new unrated players, etc. But that is its essence.

Measuring the First Mover Advantage

It has long been known that the player that moves first, conventionally called “white”, has a slight advantage, due to their ability to develop their pieces faster and their greater ability to coax the opening phase of the game toward a system that they prefer.

So how can we show this advantage using a lot of data?

I started with a Chessbase database of 1,687,282 chess games, played from 2000-2013. All games had a minimum rating of 2000 (a good club player). I excluded all computer games. I also excluded 0 or 1 move games, which usually indicate a default (a player not showing up for an assigned game) or a bye. I exported the games to PGN format and extracted the metadata for each game to a CSV file via a python script. Additional processing was then done in R.

Looking at the distribution of ratings differences (white Elo-black Elo) we get this. Two oddities to note. First note the excess of games with a ratings difference of exactly zero. I’m not sure what caused that, but since only 0.3% of games had this property, I ignored it. Also there is clearly a “fringe” of excess counts for ratings that are exactly multiples of 5. This suggests some quantization effect in some of the ratings, but should not harm the following analysis.

The collection has results of:

1-0 (36.4%)
1/2-1/2 (35.5%)
0-1 (28.1%)

So the overall score, from white’s perspective was 54.2% (counting a win as 1 point and a draw as 0.5 points).

So white as a 4.2% first move advantage, yes? Not so fast. A look at the average ratings in the games shows:

mean white Elo: 2312
mean black Elo: 2309

So on average white was slightly higher rated than black in these games. A t-test indicated that the difference in means was significant to the 95% confidence level. So we’ll need to do some more work to tease out the actual advantage for white.

Looking for a Performance Advantage

I took the data and binned it by ratings difference, from -400 to 400, and for each difference I calculated the expected score, per the Elo formula, and the average actual score in games played with that ratings difference. The following chart shows the black circles for the actual scores and a red line for the predicted score. Again, this is from white’s perspective. Clearly the actual score is above the expected score for most of the range. In fact white appears evenly matched even when playing against an opponent 35-points higher.

The trend is a bit clearer of we look at the “excess score”, the amount by which white’s results exceed the expected results. In the following chart the average excess score is indicated by a dotted line at y=0.034. So the average performance advantage for white, accounting for the strength of opposition, was around 3.4%. But note how the advantage is strongest where white is playing a slightly stronger player.

Finally I looked at the actual game results, the distribution of wins, draws and losses, by ratings differences. The Elo formula doesn’t speak to this. It deals with expected scores. But in the real world one cannot score 0.8 in a game. There are only three options: win, draw or lose. In this chart you see the first mover advantage in another way. The entire range of outcomes is essentially shifted over to the left by 35 points.