An Antic Disposition

Binging the Bard

Rob — Wed, 01 Jan 2020 16:58:11 +0000

With streaming video binging has become particularly easy today. This Christmas we watched all of the Die Hard movies, for example. Last year, on Boxing Day, we, of course, watched all of the Rocky movies.

For 2020 I’m planning on trying something a little difference. I’m aiming to read the complete works of Shakespeare, all the comedies, tragedies and histories, all the sonnets, even all the weird odds and ends, e.g., the metaphysical poem, “The Phoenix and the Turtle.”

I’ll be following the reading plan described at the Shakespeare 2020 Project, which generally follows the chronological order of the Riverside Shakespeare edition, with a couple of modifications to align certain plays with the calendar. For example, the reading plan starts with Twelfth Night, aligning it with that holiday, also called Epiphany.

I think I’ve read around 2/3 of the plays already. It will be good to revisit old friends and get to know some knew ones. I’ve barely dipped into the sonnets before, and I’m particularly looking forward to spending some time with them.

Distributed, personalized fact-checking of social network streams

Rob — Tue, 01 Oct 2019 20:44:45 +0000

The Dangers of Centralized Control

A report on CNN online today covers criticism of Facebook by the Democratic National Committee, who claims that the social media giant allows President Trump, “to mislead the American people on their platform unimpeded.” The solution desired by the DNC is for Facebook to subject the president’s social media posts to “fact-checking,” something that they are already starting to do with posts by ordinary folks, but have not yet done to politicians.

The dangers of such an approach ought to be clear to all. Quis custodiet ipsos custodes?, as the ancients wrote. Who will watch the watchmen? Do we really want such concentrated power over political speech and political campaigns? I have friends at Facebook and Google. They are very smart people. But they are not so smart that they can act as censors to the planet. No one is that smart.

Of course, there are certainly harmful social media posts out there, dangerous medical advice, hate speech, bullying, frauds, spam, and so on, things that few of us would like to see in our social media streams.

At the same time we need to be mindful that the history of progress has been the history of unpopular ideas gaining acceptance and becoming mainstream, from religious toleration to abolition of slavery, to women’s suffrage to gay marriage.

These ideas, now considered sacrosanct liberties today, where once suppressed by centralized control over the mass distribution of ideas, the social networking of the day. For example, early proponents of contraception, getting their ideas out peacefully via newspapers, were prosecuted under laws that made it illegal to send “obscene” materials through the U.S. Postal Service, a law which was construed to include even information about birth control.

There are danger of doing too much, or doing too little. If we hand central control over censorship (which is what we’re talking about) over to a single company, or a small number of companies who, because of their great size are constantly in the crosshairs of antitrust regulators in the U.S. and in Europe, we effectively hand the “kill switch” of the internet over to the present ruling class, and likely stall future social progress. On the other hand, if we do nothing, our social media accounts become unsavory places where those with the worst intentions get the most attention.

An Analogy: Microsoft Windows and Internet Explorer

Recall back years ago when the big controversy was over Microsoft’s platform monopoly on Microsoft Windows and how they apparently were extending that monopoly to their Internet Explorer browser, and to related internet technologies like ActiveX controls.

It was reasonable for, say, a hyperlink in a Microsoft Word document to launch a browser when clicked. What else could Microsoft do but launch their own Internet Explorer browser? At the time, this argued for on the basis of user-convenience. A highly-integrated solution, one the user did not need to tinker with, was said to be more friendly for the user. Allowing other browsers to insert themselves would be “messy” and lead to user confusion.

Regulators saw things differently. In order to resolve antitrust complaints, Microsoft eventually agreed to open up their Windows platform and allow internet browsers from other companies equal access for integration into their platform. So, Netscape and Opera, and eventually Firefox and Chrome, could install themselves to be the “default” browser for a user, and respond, say, to hyperlink clicks in a Word document.

This move encouraged much progress in the development of web browsers over the ensuing years, competition that Microsoft ultimately lost. But the users all won, by having more choice.

A Open Approach to Social Media Filtering

The debate seems to be focused on what Facebook ought to do, how strict Facebook ought to be with fact-checking, etc. But ask the wrong question and you’ll never get the right answer. Instead, we ought to be asking ourselves whether or not we ought to be limited to just Facebook’s filtering algorithm.

The fundamental problem here with Facebook and their censorship is the acceptable of artificial scarcity in the provision of filtering (censoring) services for social media streams, caused by the lack of an open interface.

Just as Microsoft was made to open up their browser integrations, it is worth considering the benefits of having Facebook’s monopoly on “fact checking” and similar content filtering, so much of it clearly politically charged, tempered by open competition.

Imagine a user interface by which any user could select and subscribe to one or more web services, run by 3rd parties, that would filter his social media stream, and promote, demote, hide, or annotate posts per the judgement of those services.

This could be done, for example, by associating a URI with each post and sending a list of these to the external web service, which would, consulting its own database and/or algorithms, return a list of actions per each URI: promote, demote, hide, annotate, etc., actions which Facebook would then comply with, for that user and that user only.

Web services could be run by non-profits concerned with hate groups, like the Southern Poverty Law Center, with political groups, even political parties, with churches, with schools, etc. Some would be liberal, some conservative, some associated with governments, some from newspapers, some from private associations. It would be open to any and all.

A user could select more than one filtering service and arrange them in a priority order. Since the interface would open, a service could aggregate other services. So, just as some would specialize in filtering, others might specialize in identifying unique and useful groups doing filtering.

Some services might charge for their use. Others might pay users for being used.

Some filtering might be crowd-sourced, hooked into a browser plugin that allows users to “report” objectionable material to one or more filtering services.

None of the services would be obligatory.

And, of course, Facebook would not be locked-out of providing its own filtering services. But it would be just one of many, and could easily be disabled by a user if he disagreed with Facebook’s judgement.

In Closing

An absolute technological monopoly on filtering content in the world’s largest social network must not become a politically-exploitable monopoly on free speech. To allow that is to invite absolutism. The technological means of getting around this, by an open interface for user-pluggable filtering services, is in our hands. We’ve done stuff like this before. It is not all that hard. And aside from its healthy, pro-choice, pro-competition, pro-consumer aspects, I suspect Facebook itself would welcome relief from perpetually being in the hot seat with respect to content filtering. This is a good opportunity for them to wash their hands of it, and end their unasked for and undeserved roll as global censor of social media, and return to being a common carrier, freed from responsibility and liability for censoring content.

Seventh Planet from the Sun

Rob — Tue, 06 Nov 2018 19:23:24 +0000

See below, a page from a high school textbook in use in New England in 1849, Elijah Burritt’s The Geography of the Heavens. Ever hear of this planet?

The name for the seventh planet was up in the air (no pun intended) for years after its discovery in 1781. The British were calling it “Georgium Sidus” (George’s Star), after King George III. (Flattery will get you everywhere. When Gallileo discovered the moons of Jupiter in 1610, he called them “Cosmica Sidera” (Cosimo’s Stars) after his patron, the Grand Duke de’Medici.

The Americans, of course, were not huge fans of King George III. So they named the planet after its discoverer, William Herschel.

Still others called for it to be called “Neptune.” But, in the end (again, no pun intended) the scientific community adopted the name “Uranus,” much to the distress of every 7th grade science teacher.

Downsized

Rob — Wed, 27 Jun 2018 15:16:55 +0000

As of today, June 27th, I am no longer working for IBM. Last quarter’s widely-reported “resource actions” (lay-offs) hit my group and this time my number came up.

It was a good run, 27 years with one company, something that is not so common today.

Fresh out of Harvard I started working at Lotus Development Corporation in Cambridge, Massachusetts, initially doing technical support, including for the Lotus 1‑2-3 C-language developer toolkit. From there I worked in support’s application development team, developing and maintaining our internal information retrieval system, a hodgepodge of a DOS user-interface, a search engine (using a Bayesian inference network) and a fax-on-demand system, all over NetBIOS.

From support I transitioned over to development, to the SmartSuite team, where I first focused on Freelance Graphics, which was transitioning from C to C++, Windows and OS/2, then on a set of Windows ActiveX controls called eSuite DevPack, some Java components and attempts at an office suite running on a Java-based “thin client” or network computer (eSuite Workplace.) It was a time when the thinking, at least in my little part of the world, was that the traditional desktop applications were dead, and all future work would be done in Java running on your desktop web browser. From this came the browser wars.

Then, in 1995, IBM came a knocking and bought Lotus. Our focus, naturally, shifted from desktop to server-based computing, from Java applets to Java servlets. I worked on various projects, from the K-Station Portal (based initially on Domino) to the Apache Xalan XSLT engine to XForms to WebSphere Portal. I developed a framework for document conversions within WebSphere Portal that we called Document Conversion Services (DCS).

Then, one day, I got an odd call, out of the blue, a very senior person asking whether I was familiar with the file formats from SmartSuite and Microsoft Office. Evidently, no one else in the company would admit to having that arcane knowledge. So, I was drafted onto a “special project,” with a few other talented engineers, a real fun group working on various stealthy tasks, the details of which I am still not at liberty to discuss.

Somewhat overlapping the above, I worked on the things that readers of this blog will be more familiar with, the development of the OpenDocument Format (ODF) standard at OASIS and ISO, and the arguments against ISO ratification of Microsoft’s Office Open XML (OOXML) file format. This then overlapped, in part, with my work to establish the OpenOffice project at Apache, based on Oracle’s contribution, to get IBM Symphony contributed as well, and to bring those two efforts together.

Those years were among the most memorable of my career. I was able to work with a lot of talented and enthusiastic people, within IBM, of course, but also at other companies, with non-profits, with academia and government. I was able travel and see parts of the world I might never have otherwise seen, speak to a lot of audiences about the importance of open standards. I even testified to a few legislative committees. My business travels took me to Brussels, Berlin, Budapest, Barcelona, Granada, London, Paris, Lyon, Rome, Orvietto, Geneva, Amsterdam, the Hague, Beijing, Seoul and Johannesburg. It was a lot of hard work, but it was meaningful. Open standards and open source matter. I have many fond memories of those years.

Eventually, however, corporate interest in document editors, document standards, “social documents” and similar initiatives fizzled, and I no longer had support for remaining involved in ODF and OpenOffice. I needed to move on, to find a new gig.

I looked internally within IBM for something that would combine my hard technical skills and my soft skills, including working closely with attorneys, an ability to “meet them half way” when discussing complicated legal/technical topics. Since I’ve been an active inventor throughout my IBM career, with 54 patents to my name, and a good head for reading and analyzing patents, I spent a few years working as a patent engineer, helping to monetize IBM’s vast patent portfolio, developing technical evidence for infringement, identifying possibilities for patent licencing and assignment, etc.

That’s where things stood as of today, when I handed in my badge and laptop.

As for what is next, I honestly cannot yet say what “Rob 2.0” will be. I plan on taking some time to mull things over and explore my options.

One thing I do plan to do, relatively soon, is start a new blog, a fresh start, on a new path at this domain, preserving this older blog at its current (/blog) URL.

A Meditation upon Things

Rob — Tue, 01 Dec 2015 02:52:54 +0000

A Meditation upon Things in which I will briefly speak of the Icelandic Parliament, “creature features” of the 1960’s, Cicero, Duke Ellington, Shakespeare and excessive pedantry.

The Wall Street Journal yesterday had a short piece by James R. Hagerty that raised my ire: “Use More Expressive Words!” Teachers Bark, Beseech, Implore. The article describes teachers who have banned certain words from student assignments, like “go,” “said” and “good” because they are considered insufficiently expressive. This attitude is not new, of course. It certainly existed when I was a young student. I suspect it was first promoted by Peter Mark Roget’s publisher, to spur sales.

Among the condemned is the word “thing.” I’d like to make a plea, if one can be entertained at this late date, for full pardon, and show that “thing,” like all words (even big words) can be used in insipid ways by mediocre authors, it is also capable of great delicacy, truly a word to be cherished, not discarded.

Let’s start with a simple, familiar example, the legendary song, “It don’t mean a thing (if it ain’t got that swing)” with music by Duke Ellington, words by Irving Mills. How can one write this without using “thing” or “anything”? We could try, “Your music will not have popular or critical acclaim if it lacks rhythmic syncopation in the current vernacular,” but this is hardly an improvement.

Of course, we don’t need to stray from the King’s English to have a fling with a thing. The Bard himself used this forbidden word to good effect, in Julius Caesar, Act I, Scene 1, where Marcellus berates the hoi polloi for turning out for Caesar’s triumph: “You blocks, you stones, you worse than senseless things!”

It is worth mentioning, in passing, that Caesar himself would have been familiar with “thing” in its Latin form, “res”. In Latin the term has a dual meaning, “thing” but also “affair” or “deed.” So his adopted nephew, Octavian would later have inscribed in bronze his “Res Gestae Divi Augusti” (Things Accomplished of the Divine Augustus). One of Caesar’s enemies, Marcus Cicero, wrote a book called De Re Publica or “of the public thing”, maybe better translated as “concerning public affairs” or, in the word as it has come down to us, “On the Republic.”

Some of that flavor lingers on in English today. You might say you cannot accept an invitation because, “I have a thing next week.” The British English (forgive the redundancy) “husting” (what Americans might call a “stump speech”) was literally the “house thing” or a small deliberative assembly. In modern Icelandic (forgive the oxymoron) they have the Althing or “all thing”, their legislative general assembly.

“A thing may be incredible and still be true; sometimes it is incredible because it is true,” as Herman Melville said. Well before Melville, and well after, things that go bump in the night have been called…well…things.

Consider Hamlet, Act I, Scene 1: “What, ha’s this thing appear’d againe to night?” And then consider all the horrible horror movies that haunted movie screens (and UHF television channels) in decades long past, such as:

The Thing from Another World (1951)
The Thing That Couldn’t Die (1958)
Godzilla versus the Thing (1964)
Zontar: The Thing from Venus (1966)
The Thing with Two Heads (1972)

Thus the case for thing. Perhaps you care to suggest some other examples that illustrate the versatility and vigor of “thing”?

There are other words that the pedants despise that I contrariwise cherish. Perhaps, next time I will give a few thoughts on another word that has “the right stuff.”

Eldnar Randle: Did he know?

Rob — Wed, 26 Aug 2015 13:23:46 +0000

Eldnar Randle was born in Delano, California in 1892 and died in 1973 in Oregon. For most of his working years he was an auto mechanic. But he shared a distinction shared by only 1 in over 700,000 Americans. Any guesses? A clue: Look closely at his name.

Yes, Eldnar Randle was given a palindromic name. It reads the same backwards and forwards. This phenomenon is quite rare. A search of the 88 million names in the Social Security Master Death File (SSMDF) shows only 119 cases, including:

Leon Noel (many examples)
Welles Sellew
Grey Yerg
Ekard Drake
Ronoel Leonor
Rello Oller
Nilrah Harlin
Nella Allen
Revilo Oliver
Ronnoc Connor
Folke Eklof
Marlys Sylram
Elah Hale
Gnal Lang
Lemar Ramel
Ecallaw Wallace
Rednal Lander
Ellen Nelle
Oirolf Florio
Italo Olati

The question that came to mind was, how many of these were intentional, picked by the parents specifically to be palindromes, and which ones were just pure chance? Given names are often picked to honor some relative, often a parent or grandparent. Picking an unusual name, never used in the family before, probably has a story behind it. Some of the names certainly look a bit far-fetched. Ecallaw Wallace? But others sound quite natural, like Nella Allen. And Eldnar Randle? It is hard to tell. Looking at the 1900 census I see his father was a farm laborer and his mother a housewife. Both were literate. None of the other children had unusual names. But somehow he received the invented named “Eldnar.”

Riew Weir? No, I don’t think that would have worked.

Are there any other examples of world play in names that it is worth looking for among the 88 million names in the SSMDF? Anagrams? Something else?

Analysis of World Chess Champion Opening Repertoires

Rob — Wed, 25 Feb 2015 14:00:37 +0000

A quick test run of the FactoMineR package for R. This package focuses on multivariate exploratory data analysis, such as Principle Components Analysis (for numerical data) and Correspondence Analysis (for categorical data).

In an earlier blog post I took a look at a large collection of chess games and tried to quantify the “first move” advantage in chess, in terms of ratings. This time I’ll use the same large database of chess games, and look at opening repertoires. A chess opening is a set of moves that a player uses at the start of the game in an attempt to steer the game to positions familiar to the player, and which align with that player’s style and preferences. Such openings have descriptive, often colorful names, like King’s Gambit, Sicilian Poisoned Pawn, or Nimzo-Indian Defense, as well as a standard code, from the Encyclopedia of Chess Openings, like B07, C44 and E80. There are 500 such “ECO” codes, from A00 to E99.

I extracted games from all World Chess Champions, from Steinitz (1866) to Carlsen (2014) and calculated the percentage of the games for each player in each ECO code. So each player’s opening repertoire is represented as a vector of 500 weights, summing to 1.0. I then used FactoMineR’s PCA() method to extract principle components from this dataset. The first two components extracted together represent around 42% of the total variance.

Plotting the Champions against these two dimensions shows some intriguing patterns, bringing together players by era:

Further insights can be gleaned by plotting how these two components weight the various openings. To make it easier to read I grouped some of the ECO codes and used descriptive names for the better-known openings. From this we see that the first component appears to distinguish the player’s use of open games (1.e4 e5) in the positive direction versus semi-open and closed games in the negative direction. I’m having a harder time reading a real-world meaning into the second component. Maybe a reader sees something here?

Something to remember in all of this is that the choice of opening in a game is a result of the moves of both players. Players try to influence the opening, steer the game toward their advantages and preparations and against those of their opponents. But neither player has 100% control over the opening, aside with some fringe moves like 1. h4. However, players, especially world-class caliber players, do specialize in certain opening systems, and it is fair to speak of their repertoires.

Update:

The comment from Dana Mackenzie prompted me to try out another feature of FactoMineR, the ability to chart supplemental variables. These are variables that are not used in doing the underlying PCA calculation but can be shown in the charts, to see how they align with the extracted components. For example, I could add catagorical variable for each player to represent their nationality and then plot that, to see if there are national schools of practice regarding openings. Or, as I’ll do here, add a year variable the year the individual won their world championship, to see how this aligns:

We can see by the length of the line here that the Year has a strong correlation with these two components, mostly with the 1st component.

The Power of Brand and the Power of Product Redux

Rob — Tue, 28 Oct 2014 20:36:32 +0000

Last year I did a three-part blog (“The Power of Brand and the Power of Product”) describing a simple model of product adoption and market share, and showed how the parameters of that model could be determined using a single survey question. I used the open source productivity suites, OpenOffice and LibreOffice, as examples. It is now time to update that analysis with the most-recent survey data. (If you want to look up the original posts, here are the links: part one, part two, part three).

To recap the methodology, I conducted a survey using Google’s Consumer Survey service, which uses sampling and post-stratification weighting to match the target population, which in this case was the U.S. internet population. In other words, the survey is weighted to reflect the population demographics, for age, sex, region of the country, urban versus rural, income, etc.

The question in the survey was:

What is your familiarity with the software application called “OpenOffice”?

I have never heard of it
I am aware of it but have never used it
I have tried it once
I use it only sometimes
I use it on a regular basis

With 1502 responses, the results were:

I have never heard of it	61.3%
I am aware of it but have never used it	13.3%
I have tried it once	7.6%
I use it only sometimes	10.3%
I use it on a regular basis	7.5%

The same question was asked about LibreOffice, with results:

I have never heard of it	82.3%
I am aware of it but have never used it	5.8%
I have tried it once	4.4%
I use it only sometimes	3.1%
I use it on a regular basis	4.3%

Now these numbers are somewhat interesting on their own, but what is far more interesting are the derived metrics, which look at things like:

What is the name recognition of the product?
Of those who have heard of the product, what percentage actually give it a try? This is a measure of marketing effectiveness.
Of those who have tried the product, what percentage actually continue to use it? This is a measure of user satisfaction.
What percentage of all respondents use the product? This is a measure of market share.

Full details on how these other metrics are calculated, from this single survey question, can be found in Part One of this series.

Here are some charts to show how these metrics have evolved over the 2 1/2 years I’ve worked with this survey approach:

Those who know me know that I am partial to OpenOffice, an open source project that I contribute to. So I am extremely pleased to see it continue to advance in all fronts. Since coming to Apache, OpenOffice’s name recognition has grown from 24% to 39% and the user share has grown from 11% to 18%, while keeping user satisfaction constant. This is a testament to the hard work of the many talented volunteers at Apache.

ISO/IEC JTC1 Approves ODF 1.2 PAS Ballot

Rob — Wed, 17 Sep 2014 15:22:36 +0000

OASIS ODF 1.2, the current version of the Open Document Format standard, was approved by ISO/IEC JTC1 National Bodies after a 3-month Publicly Available Specification (PAS) ballot. The final vote for DIS 26300 was: 17-0 for Parts 1 and 2, and 18-0 for Part 3.

Of course, this is a very good result and all those involved, whether TC members and staff at OASIS, implementors, adopters and promoters of ODF and open standards in general should be pleased and proud of this accomplishment.

This was a team effort, obviously, and I’d like to give special thanks to Patrick Durusau and Chris Rae on the ODF TC for their special efforts preparing the PAS submission for ballot, Jamie Clark from OASIS for putting together the submission package and Francis Cave, Alex Brown, Murata Mokoto and Keld Simonsen in JTC1/SC34/WG6 for their continued advice, feedback and support.

Since comments were received by Japan and the UK, we now start the comment disposition process. The SC34 Secretariat will determine whether a Ballot Resolution Meeting (BRM) is required, or whether the comments can simply be handed to the Project Editor for application to the specification prior to publication. One way or another, there will be a little more work before publication of the ODF 1.2 International Standard.

The OASIS ODF TC continues work on ODF 1.3, with renewed vigor. After nearly a decade of involvement with ODF, and many years leading the committee, I’ve stepped down. The TC has elected Oliver-Rainer Wittmann, a long-time TC member, ODF implementor and a familiar face at ODF Plugfests, to take over. I’m currently exploring other areas related to open innovation (open standards, open source, open data, open APIs). If you know of anything interesting, https://linkedin.com/in/rcweir.

An inquiry into the topological ordering of casual American male dress

Rob — Fri, 08 Aug 2014 01:16:38 +0000

The dressing and arming of a warrior is a common set scene in epic poetry, e.g., Iliad 2:

He put on a soft khiton,
fine and newly made, and put around himself a great cloak.
Under his shining feet he fastened fine sandals
and around his shoulders he placed a silver-studded sword.
He took up the ancestral scepter which is always unwilting.

The structure and contents of such scenes have been well-studied by scholars, e.g., Armstrong 1958, and even parodied, as in Pope’s mock epic The Rape of the Lock:

Now awful Beauty puts on all its Arms;
The Fair each moment rises in her Charms,
Repairs her Smiles, awakens ev’ry Grace,
And calls forth all the Wonders of her Face;
Sees by Degrees a purer Blush arise,
And keener Lightnings quicken in her Eyes.
The busy Sylphs surround their darling Care;
These set the Head, and those divide the Hair,
Some fold the Sleeve, while others plait the Gown;
And Betty‘s prais’d for Labours not her own.

However, the dressing of the 21st Century casual American male appears to lack rigorous analysis, a deficiency I hope to remedy, at least in the area of furthering understanding of the dependency constraints of this activity.

It is well-known that underpants must be donned before pants. Despite the intriguing experimentation by Rowan Atkinson no practical alternative has been found. Similarly, socks must be put on before shoes, pants before shoes, and both pants and shirt before the belt can be buckled.

Illustrating the topological ordering as direct graph, we have the following:

Dependency analysis

Within these constraints many dress orderings are possible, some of the more common ones beings:

underwear, socks, pants, shirt, shoes, belt
underwear, pants, shirt, belt, sock, shoes
underwear, shirt, pants, socks, belt, shoes

Orderings like the above are familiar to most people. However, there are many other possibilities, some perhaps worthy of further exploration:

socks, shirt, underwear, pants, shoes, belt
shirt, socks, underwear, pants, belt, shoes

It will also be appreciated by those practiced in the art that the two socks need not be put on together. This permits extravagant ordering like:

left sock, shirt, underwear, pants, belt, right sock, shoes
right sock, underwear, pants, left sock, shoes, shirt, belt

There is also nothing that prevents a Towers of Hanoi approach for those with time to kill, where -X indicates that X is to be removed:

pants, shoes, shirt, -shoes, socks, -pants, underwear, pants, shoes, belt

Hopefully the above gives ideas for further exploration and experimentation. Although we do not dress and arm ourselves to fight the Trojans, our morning ritual can be equally an epic experience!

Document as Activity versus Document as Record

Rob — Thu, 31 Jul 2014 20:08:35 +0000

I’ve been thinking some more on the past, present and future of documents. I don’t know exactly where this post will end up, but I think this will help me clarify some of my own thoughts.

First, I think technology has clouded our thinking and we’ve been equivocating with the term “document”, using it for two entirely different concepts.

One concept is of the document as the way we do work, but not an end-in-itself. This is the document as a “collaboration surface”, short-lived, ephemeral, fleeting, quickly created and equally quickly forgotten.

For example, when I create a few slides for a project status report, I know that the presentation document will never be seen again, once the meeting for which it was written has ended. The document serves as a tool for the activity of presenting status, of informing. Twenty years ago we would have used transparencies (“foils”) or sketched out some key points on a black board. And 10 years from now, most likely, we will use something else to accomplish this task. It is just a coincidence that today the tools we use for this kind of work also act like WYSIWYG editors and can print and save as “documents”. But that is not necessary, and historically was not often the case.

Similarly, take a spreadsheet. I often use a spreadsheet for a quick ad-hoc “what-if” calculation. Once I have the answer I am done. I don’t even need to save the file. In fact I probably load or save a document only 1 in 5 times that I launch the application. Some times people use a spreadsheet as a quick and dirty database. But 20 years ago they would have done these tasks using other tools, not document-oriented, and 10 years from now they may use other tools that are equally not document related. The spreadsheet primarily supports the activity of modeling and calculating.

Text documents have myriad collaborative uses today, but other tools have emerged as well . Collaboration is moved to other non-document interfaces, tools like wikis, instant messaging, forums, etc. Things that would have required routing a typed inter-office memo 50 years ago are now done with blog posts.

That’s one kind of document, the “collaboration surface”, the way we share ideas, work on problems, generally do our work.

And then there is a document as the record of what we did. This is implied by the verb “to document”. This use of documents is still critical, since it is ingrained in various regulatory, legal and business processes. Sometimes you need “a document.” It won’t do to have your business contract on a wiki. You can’t prove conformance to a regulation via a Twitter stream. We may no longer print and file our “hard” documents, but there is a need to have a durable, persistable, portable, signable form of a document. PDF serves well for some instances, but not in others. What does PDF do with a spreadsheet, for example? All the formulas are lost.

This distinction, between these two uses of documents, seems analogous to the distinction between Systems of Engagement and Systems of Record, and can be considered in that light. It just happens that each concept happened to use the same technology, the same tools, circa the year 2000, but in general these two concepts are very different.

The obvious question is: What will the future being? How quickly does our tool set diverge? Do we continue with tools that compromise, hold back collaborative features because they must also serve as tools to author document records? Or do we unchain collaborative tools and allow them to focus on what they do best?

Announcing OpenLibreOffice

Rob — Tue, 01 Apr 2014 07:10:30 +0000

2014-04-01

The Internet

The Apache OpenOffice project and The Document Foundation are pleased to announce that an agreement has been made to combine resources and jointly develop a next-generation open source office suite, to be called “OpenLibreOffice” (except in France where it will be called “LibreOfficeOpen”). OpenLibreOffice will be quad licensed under the ALv2, MPL, LPGL and WTFPL licenses, so programmers can maximize their ability to express fine distinctions about copyright law. Similarly, source code for OpenLibreOffice will be made available to in C++, C#, Java and Ruby, for the benefit of attorneys who wish to make fine distinctions about type checking.

Some people eat meat. Some are vegetarians. Some are vegan, and won’t even eat eggs or cheese.”, said Michael Meeks of Koolibra. “These distinctions are important to how we look at ourselves. The choice of open source license gives us each an opportunity to feel morally superior, which is the primary joy of open source development.

This new joint effort brings an end to the brief fork that had disrupted development of the decade-old OpenOffice project and lead to a passionate contest to see which project would fail the slowest. As former TDF Board Member Charles Schulz recalls:

The fork originated over a disagreement over the color of icons in the toolbar. Or something like that. I don’t really remember. It was 2011 and everyone was protesting for something. ‘Occupy OpenOffice’ didn’t sound right, so we just called it ‘LibreOffice’. It was intended to be a placeholder name. We were hoping, after a suitable period of insults and ridicule, that Oracle would just give us the trademark for OpenOffice. For unknown reasons, likely involving IBM, the Military-Industrial Complex and the Trilateral Commission, that plan didn’t work. By the time we realized that no one outside of France and Spain knew how to pronounce ‘LibreOffice’, it was too late.

LibreOffice shipped 68 releases over the 4 year duration of their fork, fixing over 1673 bugs and introducing only 1532 new bugs, making it the most productive, though least efficient, open source project of all time. Apache has made only two releases in the last year, taking the “principle of least astonishment” to new levels.

Apache OpenOffice Poo-Bah Rob Weir applauded news of the announcement:

Users will quickly benefit from the combined engineering effort on OpenLibreOffice. But even greater things await the public when the marketing efforts combine and 100 million downloads of OpenOffice get transformed into colorful infographics showing 20 billion IP addresses or abstract videos of flashing lights accompanied by jazz flute music.

In related news, Microsoft released a new policy paper suggesting that open source software was partially responsible for European economic woes, due to the lack of VAT revenue, and proposed a special new surtax on open source software, “in the interest of fairness and open competition”.

###

ODF 1.2 Submitted to ISO

Rob — Mon, 31 Mar 2014 15:50:42 +0000

Last Wednesday, March 26th, on Document Freedom Day, OASIS submitted Open Document Format 1.2 standard to the ISO/IEC JTC1 Secretariat for transposition to an International Standard under the Publicly Available Specification (PAS) procedure.

If you recall, the PAS procedure is what we used back in 2005 when ODF 1.0 was submitted to ISO and was approved as ISO/IEC 26300. ODF 1.1 used a different procedure and was processed as an amendment to ISO/IEC 26300. Since ODF 1.2 is a much larger delta to the previous version it makes sense to take it through the PAS procedure again.

The PAS transposition process starts with a two month “translation period” when National Bodies may translate the ODF 1.2 specification if they wish. This is then followed by a three-month ballot. Following a successful ballot any comments received are reviewed by all stakeholders and resolutions determined at a Ballot Resolution Meeting (BRM).

I am notoriously bad at predicting the pace of standards development, but if you add up the steps of the process, this looks like a ballot ending in Q4 and a BRM around year’s end.

The Words Democrats and Republicans Use

Rob — Fri, 07 Feb 2014 15:06:55 +0000

It came to me after listening to the State of the Union Address: Can we tell whether a speech was from a Democrat or a Republican President, purely based on metrics related to the words used? It makes sense that we could. After all, we can analyze emails and detect spam that way. Automatic text classification is a well known problem. On the other hand, presidential speeches go back quite a bit. Is there a commonality of speeches of, a Democrat in 2014 with one from 1950? Only one way to find out…

I decided to limit myself to State of the Union (SOTU) addresses, since they are readily available, and only those post WW II. There has been a significant shift in American politics since WW II so it made sense, for continuity, to look at Truman and later. If I had included all of Roosevelt’s twelve (!) SOTU speeches it might have distorted the results, giving undue weight to individual stylistic factors. So I grabbed the 71 post WWII addresses and stuck them into a directory. I included only the annual addresses, not any exceptional ones, like G.W. Bush’s special SOTU in September 2001.

I then used R’s text mining package, tm, to load the files into a corpus, tokenize, remove punctuation, stop words, etc. I then created a document-term matrix and removed any terms that occurred in fewer than half of the speeches. This left me with counts of 610 terms in 71 documents.

Then came the fun part. I decided to use Pointwise Mutual Information (PMI), an information-centric measure of association from information retrieval, to look at the association between terms in the speeches and party affiliation. PMI shows the degree of association (or “co-location”) of two terms while also accounting for their prevalence of the terms individually. Wikipedia gives the formula, which is pretty much what you would expect. Calculate the log probability of the co-location and subtract out the log probability of the background rate of the term. But instead of looking at the co-occurrence of two terms, I tried looking at the co-occurrence of terms with the party affiliation. For example, the PMI of “taxes” with the class Democrat would be: log p(“taxes”|Democrat) – log p(“taxes”). You can see my full script for the gory details.

Here’s what I got, listing the 25 highest PMI terms for Democrats and Republicans:

So what does this all mean? First note the difference in scale. The top Republican terms had higher PMI than the top Democrat terms. In some sense it is a political Rorschach test. You’ll see what you want to see. But in fairness to both parties I think this does accurately reflect their traditional priorities.

From the analytic standpoint the interesting thing I notice is how this compares to other approaches, like using classification trees. For example, if I train the original data with a recursive partitioning classification tree, using rpart, I can classify the speeches with 86% accuracy by looking at the occurrences of only two terms:

Not a lot of insight there. It essentially latched on to background noise and two semantically useless words. So I prefer the PMI-based results since they appear to have more semantic weight.

Next steps: I’d like to apply this approach back to speeches from 1860 through 1945.

First Move Advantage in Chess

Rob — Mon, 27 Jan 2014 14:57:05 +0000

The Elo Rating System

Competitive chess players, at the amateur club level all the way through the top grandmasters, receive ratings based on their performance in games. The ratings formula in use since 1960 is based on a model first proposed by the Hungarian-American physicist Arpad Elo. It uses a logistic equation to estimate the probability of a player winning as a function of that player’s rating advantage over his opponent:

$latex E = \frac 1 {1 + 10^{-\Delta R/400}}&s=3$

So for example, if you play an opponent who out-rates you by 200 points then your chances of winning are only 24%.

After each tournament, game results are fed back to a national or international rating agency and the ratings adjusted. If you scored better than expected against the level of opposition played your rating goes up. If you did worse it goes down. Winning against an opponent much weaker than you will lift your rating little. Defeating a higher-rated opponent will raise your rating more.

That’s the basics of the Elo rating system, in its pure form. In practice it is slightly modified, with ratings floors, bootstrapping new unrated players, etc. But that is its essence.

Measuring the First Mover Advantage

It has long been known that the player that moves first, conventionally called “white”, has a slight advantage, due to their ability to develop their pieces faster and their greater ability to coax the opening phase of the game toward a system that they prefer.

So how can we show this advantage using a lot of data?

I started with a Chessbase database of 1,687,282 chess games, played from 2000-2013. All games had a minimum rating of 2000 (a good club player). I excluded all computer games. I also excluded 0 or 1 move games, which usually indicate a default (a player not showing up for an assigned game) or a bye. I exported the games to PGN format and extracted the metadata for each game to a CSV file via a python script. Additional processing was then done in R.

Looking at the distribution of ratings differences (white Elo-black Elo) we get this. Two oddities to note. First note the excess of games with a ratings difference of exactly zero. I’m not sure what caused that, but since only 0.3% of games had this property, I ignored it. Also there is clearly a “fringe” of excess counts for ratings that are exactly multiples of 5. This suggests some quantization effect in some of the ratings, but should not harm the following analysis.

The collection has results of:

1-0 (36.4%)
1/2-1/2 (35.5%)
0-1 (28.1%)

So the overall score, from white’s perspective was 54.2% (counting a win as 1 point and a draw as 0.5 points).

So white as a 4.2% first move advantage, yes? Not so fast. A look at the average ratings in the games shows:

mean white Elo: 2312
mean black Elo: 2309

So on average white was slightly higher rated than black in these games. A t-test indicated that the difference in means was significant to the 95% confidence level. So we’ll need to do some more work to tease out the actual advantage for white.

Looking for a Performance Advantage

I took the data and binned it by ratings difference, from -400 to 400, and for each difference I calculated the expected score, per the Elo formula, and the average actual score in games played with that ratings difference. The following chart shows the black circles for the actual scores and a red line for the predicted score. Again, this is from white’s perspective. Clearly the actual score is above the expected score for most of the range. In fact white appears evenly matched even when playing against an opponent 35-points higher.

The trend is a bit clearer of we look at the “excess score”, the amount by which white’s results exceed the expected results. In the following chart the average excess score is indicated by a dotted line at y=0.034. So the average performance advantage for white, accounting for the strength of opposition, was around 3.4%. But note how the advantage is strongest where white is playing a slightly stronger player.

Finally I looked at the actual game results, the distribution of wins, draws and losses, by ratings differences. The Elo formula doesn’t speak to this. It deals with expected scores. But in the real world one cannot score 0.8 in a game. There are only three options: win, draw or lose. In this chart you see the first mover advantage in another way. The entire range of outcomes is essentially shifted over to the left by 35 points.

The New Technology Consumers

Rob — Sun, 12 Jan 2014 16:58:54 +0000

There were those who complained about the labor conditions of those who picked grapes and sewed t-shirts. About pesticides on apples and growth hormones in milk. About generically modified corn and soy. About how governments conduct foreign policy, how they treat prisoners of war, how they collect intelligence, how they make treaties and how they make war.

How dare mere consumers, the unwashed masses, the hoi poloi have an opinion on such matters? Let those who know best determine what is in the public good.

I see open source and open standards activists in a similar way. Many consumers care not only in the direct good they receive from technology, but also in how that good was generated, whether from exploitative sweat labor, whether from environmentally invasive methods, and yes, whether by perpetuating software monopolies or damaging the ecosystem of open source and open standards.

What we’re seeing is a generation arising that is no longer content to worship at the alter of technology and follow the dictates of the high priests. They are not content to be fed whatever the industry gives them. They care not only about what something is and how it is used, but also what is its impact on their bodies, the environment, on culture and society.

To those who are unprepared this may appear confusing, irrational and even scary. Why aren’t the consumers content to accept our recommendations? Why are they complaining so much? For some kinds of business, those who do not adapt, this is a threat. And to others, this is an opportunity. Some will win and some will lose. Which will you be?

Apache OpenOffice 2013 Mailing List Review

Rob — Wed, 18 Dec 2013 16:30:58 +0000

I did a quick study of the 2013 mailing list traffic for the Apache OpenOffice project. I looked at all project mailing lists, including native language lists. I omitted the purely transactional mailing lists, the ones that merely echo code check-ins and bug reports. Altogether 14 mailing lists were included in this study.

In 2013 the OpenOffice community mailing lists saw 24,423 posts from 2,211 unique posters, in 4,819 threads.

A word cloud of the most frequent words in post titles (thanks to Jonathan Feinberg’s Wordle app) follows. As you can see, the terms used in the Propose/Approve/Code/Test/Release workflow rise to the top. That shows the project’s focus.

I thought it would also be interesting to look at this from a social network perspective, looking at the atomic units of collaboration on a mailing list: responding to a post. Of course, not all posts involve a response. It is common for someone to post information, not requiring or expecting a response. But there are many responses. As mentioned above, there were 24,423 posts in 4,819 threads, so an average of 4 responses per post. We can represent this as a directed graph, with each poster treated as a node, and a directed arc to each responder node from the node of the original post author. (This might seem backwards, and you could argue for reversing the arcs, but in general in mailing lists the responder is providing value to the original poster, so the centrality of the responder will be more relevant. Consider, for example, the questions coming from random users, and the experienced project members who answer them.)

Forming a graph in this way gives us a giant component (representing 98.84% of the whole graph) with 1,955 nodes and 7,069 arcs. Average degree (number of collaboration partners for each person) is 3.6. 46 people responded to more than 50 other people. Maximum degree is 714 (Apache OpenOffice V.P. Andrea Pescetti). A visualization of this graph, using the open source Gephi) follows. You can click on the image for a larger version. Nodes have been scaled to reflect betweenness centrality (a measure the degree to which a node helps connect others into the graph) and colored via a modularity algorithm which finds sets of nodes that have a high degree of interconnection.

You should click on the graph to see the full-size version.

What a marvelous, large and complex project we have in Apache OpenOffice!

IBM Support for Apache OpenOffice

Rob — Mon, 04 Nov 2013 13:25:38 +0000

As you probably know, IBM has been involved with the OpenOffice.org community for many years. This included collaboration on ODF and accessibility at first, as we worked on our separate Lotus Symphony fork. And then in 2011 we followed the OpenOffice.org community to Apache where Apache OpenOffice then took off. Since then we’ve been merging in features and bug fixes from Symphony, essentially ending the Symphony fork. The first results of this collaboration showed up in Apache OpenOffice 4.0, with the new side panel UI. The reception of this new release has been phenomenal. The release received great reviews, including an 2013 InfoWord Best of Open Source (Bossie) award. The success of this release propelled us to recently hit a new download milestone: Over 75 million copies of Apache OpenOffice in the less than 18 months since the first release of Apache OpenOffice.

The overall market for office productivity suites is changing. Microsoft Office 2003 is hitting End of Life in April 2014, causing companies still using it to explore other options. The introduction of new subscription models from Microsoft, as well as emergence of new cloud-based editors from several players, including IBM, are also making customers reevaluate their dependency on Microsoft Office. Do we really need Office? For everyone? What are the alternatives?

I’m really pleased to see other parts of IBM starting to see the opportunities available with Apache OpenOffice. Already publicly announced include integrations with IBM Connections, IBM SmartCloud and IBM ECM and Case Manager. (If there are other IBM products that you think would benefit greatly from integration, let me know!)

The latest, and most significant, enabler of enterprise use of Apache OpenOffice is our IBM Support for Apache OpenOffice offering. Although individual end-users and even small businesses can easily deploy Apache OpenOffice on their own (75 million downloads testifies to that), larger enterprises with more complicated and demanding needs benefit from the kind of expertise that IBM can provide. So I’m glad to see this offering available to fill out the ecosystem, so everyone can use and be successful with Apache OpenOffice, from individual university students, to small non-profits, to large international corporations.

The Power of Brand and the Power of Product, Part 3

Rob — Mon, 21 Oct 2013 15:11:22 +0000

In the previous two parts (one and two) I described a model of product adoption and market share that could be built with a single survey question. I applied this model to the open source productivity suites OpenOffice and LibreOffice, looking at adoption in September 2012 and April 2013.

The results were described in detail in the previous article in this series, but can be summarized as:

OpenOffice	September 2012	April 2013	Change
Customer Awareness	24.3%	27.6%	14% growth
Customer Motivation	63.0%	65.9%	5% growth
Customer Satisfaction	70.6%	68.7%	3% decline
Market Share	10.8%	12.5%	16% growth

Six months have now passed and it is worth taking another look to see how things have evolved. As I did previously, I used Google’s Consumer Survey service which uses sampling and post-stratification weighting to match the target population, which in this case was the US internet population. In other words, the survey is weighted to reflect the population demographics, for age, sex, region of the country, urban versus rural, income, etc. I did this survey in a personal capacity for my own interest. The Standard Disclaimer applies.

OpenOffice (N=1519)	September 2012	April 2013	September 2013	Change (September to September)
Customer Awareness	24.3%	27.6%	30.7%	26% growth
Customer Motivation	63.0%	65.9%	67.4%	7% growth
Customer Satisfaction	70.6%	68.7%	77.8%	10% growth
Market Share	10.8%	12.5%	16.1%	49% growth

So what do we see? Very nice results, indeed. The OpenOffice brand is strong and growing. Over 30% of consumers surveyed had heard of it. Of those who had heard of it, 67% had given it a try. That number is changed little. This is an opportunity for Apache OpenOffice marketing volunteers to improve both of these numbers. Of those who tried OpenOffice almost 78% continued to use OpenOffice. This is a modest increase, but there is certainly room to improve here. Put it altogether, and the estimated user share, the percentage of US internet users who use OpenOffice “sometimes” or “regularly” is 16.1%, nearly a 50% improvement year-over-year.

In any case, to summarize and to illustrate the improvements graphically, I’ve charted the growth in user share over the three surveys, including results for LibreOffice as well:

Visualizing OASIS Technical Committees

Rob — Mon, 01 Jul 2013 12:28:22 +0000

So what do we have here? This is a simple social network visualization, of OASIS Technical Committees. Each circle in this graph represents a single Technical Committee (TC). The size of the circle is proportionate to how many members are on the committee. The lines between the committees have a weight that is proportionate to the overlap in membership between the TCs. In this case I used Dice’s coefficient as a metric, although any of the several set similarity metrics (Jaccard, etc.) would work here. The color of each node represents the modularity class, a measure of communities or sub-networks within the graph. The resulting graph was then run through Gephi and its Force Atlas layout algorithm , which brings together the TCs that are more closely related by overlapping membership. Click the image for a larger version.

(For those who are interested, the raw data for this is all publicly available, on the OASIS website. Scraping the webpages for the data, calculating the graph and outputting a GEXF format file for Gephi was accomplished in 133 lines of Python.)

Note one important fact: the graph is formed entirely on abstract concepts, the size of each committee and the overlaps in membership. It has no knowledge of what the underlying technologies are, the companies and individuals involved, or of other items of semantic value that could describe the work of the committee. The structure is essentially based on the interests and affiliations of individual committee members. Where there is common interest it is assumed that there is commonality in the work of the TCs.

So how well does this match reality? The image that follows (click for an enlarged version) is the same chart, but with each node labeled by the short name of the TC. As you can see, the above approach does a fine job bringing together related TCs. This occurs both at the fine-grained level, where the DITA TC and the DITA Adoption TC, or the SCA and SCA Assembly TCs are adjacent, and it also applies at the broader level, where we see communities for content-related standards, for privacy/identity standards, legal/emergency, etc.