Popular Posts

How to Crush Dissent

2010/08/15 By Rob 26 Comments

While in Berlin for the LinuxTag 2010 conference a couple of months ago, I took the opportunity for a 8-mile long meandering walk across the city, from Warschauer Strasse and the East Side Gallery to Wittenbergplatz and KaDeWe, taking in the various historical sites along the way. It was a great refresher course in 20th century European history. I especially enjoyed the free outdoor exhibit in Alexanderplatz, which dealt with the Revolutions of 1989 with a focus on the various dissident movements and publications in the DDR. Most were self-published, stealthily distributed samizdat newletters, copied laboriously using typewriters and carbon paper, primitive printing presses, or toward the end, some personal computers smuggled in from the West. They had on display an Amiga 500 and an NEC Pinwriter P6 used in 1989. Through “advanced” technology like this, document production could be raised from a few hundred to tens of thousands of copies.

As I looked at this display of samizdat publications, each a sign of struggle, technical and political, I was smug. Surely, all of this is irrelevant today? The march of technology has now put within each of our hands tools that are orders of magnitude more efficient and effective than any underground publication of 1989. With the Web, and WordPress and Twitter and YouTube and other services, we can instantly get a message out to millions of people. We are far more advanced now.

Or so I thought for a few brief minutes, until the horrible truth struck me as I considered the question more deeply. No, technology has not made dissent safer. We are merely fortunate that the political climes of 2010 permit more dissent. But if challenged, the powers that be have far greater tools to control information than they did in 1989. I am not certain the tools available to the individual come close to being able to withstand them.

I strongly believe that the capability for citizens to dissent is an essential complement to fallible leadership. And all leadership is fallible. Without such capabilities, transitions of power may be less frequent, but they also may be far bloodier.

Note that I say “capability” for dissent. I don’t mean that all forms of dissent should be legal. Certainly this is a good thing as well, and is enshrined in the constitutions of many democracies today. But I mean something more fundamental, the capability of individuals and groups to organize and express dissent, even when this goes against the law. It is almost axiomatic that a regime slouching toward oppression will, at an early stage, declare dissent illegal. History has shown this to us repeatedly. So the capability to express illegal dissent is in some sense even more important than the ability to express dissent legally.

Through the 20th century there were many attempts to reduce capabilities to express dissent, from outlawing of opposition political parties, to shuttering independent newspapers, to mandatory registration of typewriters. These all made dissent more difficult and riskier, but they did not remove the capability. It was still possible, for one person, or a group of people, to organize in secret and get their message out. They did it illegally, and at their own peril. But that was enough to start the wheels turning. If 10 people protest, they are called insane and carted away to the hospital. If 1,000 people protest, tear gas is used and people are sent to prison, But if 100,000 protest, then governments fall. In a sense the gamut from civil war to an open democratic election, including a nationwide protest someplace in the middle, are all proxies for the use of force. There are bloody and bloodless ways of determining the majority opinion, and prudence suggests not eliminating the opportunity to use bloodless methods.

My sad observation is that we are quickly reaching the point, perhaps for the first time in history, where governments will have the means to eliminate even the capability for illegal dissent. I believe this is a destabilizing threshold to cross.

Consider the following thought experiment. Imagine we are back in 1985, back in the DDR, but instead of typewriters, you have all the 21st century technological facilities, the internet, Twitter, Youtube, etc. You are a dissident and I am the government.

Your two main tasks are:

To collaborate electronically with trusted parties, while protecting the contents of the communication, as well as the identities of the other parties.
To publish information anonymously or pseudonymously for public consumption

You wouldn’t be much of a dissident leader if you didn’t attempt those two tasks, and I wouldn’t be much of an oppressive regime if I did not try to stop you!

So where should I start?

A private national network. Think North Korea.
A Great Firewall.
Mandatory registration of computers, internet accounts
Control of DNS
Control of search
Control of Certificate Authorities
Invisible tagging of paper/ink
Software monoculture that provides a single point of government control
Limits on how many emails can be sent. One might argue in favor of this as an anti-spam measure. But also prevents effective organization.
Outlaw strong cryptography.
Reduce due process, making it trivial to subpoena ISP records without judicial review
Make circumvention technology illegal
Copyright — prevent fair use, Creative Commons, etc., extending copyright to government records

The interesting thing is how far we’ve gone down this road, especially at the behest of the recording industry and the copyright lobby.

What capabilities do you have on the other side? What are your abilities to express dissent?

I think the example of Wikileaks quickly comes to mind. That shows one example of a web site, that through technical and jurisdictional means, appears to have avoided take-down by a far more powerful entity, at least so far. However, I think this is a Pyrrhic victory. The mere existence of Wikileaks will spur governments to tighten laws, invest in additional counter-information technologies, such as the Internet “Kill Switch” proposed by the Department of Homeland Security in the U.S., etc.. The presence of a presently uncontrollable voice will surely lead to a concentration of control of the choke points of the internet that will eventually silence that voice.

When an irrepressible force meets an immovable object, one may speculate which will win. I put my bets on the side with the money and the guns. The danger for the rest of us is that in their attempts to control a venue for indiscriminate, absolute free speech, they devise such choke points that they provide the ability for future regimes to crush dissent, and by eliminating dissent also eliminate the best opportunity we have for peaceful revolutions.

Of course, I do not advocate sedition. And I’m not an advocate of absolute free speech. There are copyright laws, there are privacy concerns, there are military secrets, there is child pornography. These all trump free speech. But I think that means that we make these activities illegal and vigorously prosecute those who break these laws. But we should be seeking the minimal technical means necessary to detect the violators, without introducing such technologies that, to the level of a mathematical certainty, eliminate the ability for these activities to take place. Because, if we do so, we also at the same time introduce mechanisms that can be also used to crush political dissent. These technologies may first be promoted under the banner of “national security” or “protection of intellectual property”, but that is just their purported intent, not their technological limitation.

One would need to be a rather poor student of history not to notice that for several times in the past century governments have occasionally lapsed and ended up a wee bit overzealous in their attempts to secure a high degree of visible consensus among their citizens. When this happen, it is good to have several avenues to pursue honest and forthright discourse. Certainly one doesn’t want to make it too easy to topple an established form of government, but neither does one want to make it mathematically impossible. You want to bias the balance of rights toward stability, while acknowledging that the forces of revolution are forces of construction as well as destruction. We have 400 years or more of experience balancing free speech with legitimate needs of governments to declare some speech illegal. To date this has been done without the concentration of technical and administrative control sufficient to effect absolute prior restraint. This is changing. The unintended consequences of having such concentrated control should give us pause and make us hesitate rather than move quickly. The creation of the equivalent of an anti-free speech nuclear bomb, a big red button that when pressed will silence a class of speech, must be avoided.

Doing the Microsoft Shuffle: Algorithm Fail in Browser Ballot

2010/02/27 By Rob 189 Comments

March 6th Update: Microsoft appears to have updated the www.browserchoice.eu website and corrected the error I describe in this post. More details on the fix can be found in The New & Improved Microsoft Shuffle. However, I think you will still find the following analysis interesting.

-Rob

Introduction

The story first hit in last week on the Slovakian tech site DSL.sk. Since I am not linguistically equipped to follow the Slovakian tech scene, I didn’t hear about the story until it was brought up in English on TechCrunch. The gist of these reports is this: DSL.sk did a test of the “ballot” screen at www.browserchoice.eu, used in Microsoft Windows 7 to prompt the user to install a browser. It was a Microsoft concession to the EU, to provide a randomized ballot screen for users to select a browser. However, the DSL.sk test suggested that the ordering of the browsers was far from random.

But this wasn’t a simple case of Internet Explorer showing up more in the first position. The non-randomness was pronounced, but more complicated. For example, Chrome was more likely to show up in one of the first 3 positions. And Internet Explorer showed up 50% of the time in the last position. This isn’t just a minor case of it being slightly random. Try this test yourself: Load www.browserchoice.eu, in Internet Explorer, and press refresh 20 times. Count how many times the Internet Explorer choice is on the far right. Can this be right?

The DLS.sk findings have lead to various theories, made on the likely mistaken theory that this is an intentional non-randomness. Does Microsoft have secret research showing that the 5th position is actually chosen more often? Is the Internet Explorer random number generator not random? There were also comments asserting that the tests proved nothing, and the results were just chance, and others saying that the results are expected to be non-random because computers can only make pseudo-random numbers, not genuinely random numbers.

Maybe there was cogent technical analysis of this issue posted someplace, but if there was, I could not find it. So I’m providing my own analysis here, a little statistics and a little algorithms 101. I’ll tell you what went wrong, and how Microsoft can fix it. In the end it is a rookie mistake in the code, but it is an interesting mistake that we can learn from, so I’ll examine it in some depth.

Are the results random?

The ordering of the browser choices is determined by JavaScript code on the BrowserChoice.eu web site. You can see the core function in the GenerateBrowserOrder function. I took that function and supporting functions, put it into my own HTML file, added some test driver code and ran it for 10,000 iterations on Internet Explorer. The results are as follows:

Internet Explorer raw counts
Position	I.E.	Firefox	Opera	Chrome	Safari
1	1304	2099	2132	2595	1870
2	1325	2161	2036	2565	1913
3	1105	2244	1374	3679	1598
4	1232	2248	1916	590	4014
5	5034	1248	2542	571	605

Internet Explorer fraction of total
Position	I.E.	Firefox	Opera	Chrome	Safari
1	0.1304	0.2099	0.2132	0.2595	0.1870
2	0.1325	0.2161	0.2036	0.2565	0.1913
3	0.1105	0.2244	0.1374	0.3679	0.1598
4	0.1232	0.2248	0.1916	0.0590	0.4014
5	0.5034	0.1248	0.2542	0.0571	0.0605

This confirms the DSL.sk results. Chrome appears more often in one of the first 3 positions and I.E. is most likely to be in the 5th position.

You can also see this graphically in a 3D bar chart:

But is this a statistically significant result? I think most of us have an intuitive feeling that results are more significant if many tests are run, and if the results also vary much from an even distribution of positions. On the other hand, we also know that a finite run of even a perfectly random algorithm will not give a perfectly uniform distribution. It would be quite unusual if every cell in the above table was exactly 2,000.

This is not a question one answers with debate. To go beyond intuition you need to perform a statistical test. In this case, a good test is Pearson’s Chi-square test, which tests how well observed results match a specified distribution. In this test we assume the null-hypothesis that the observed data is taken from a uniform distribution. The test then tells us the probability that the observed results can be explained by chance. In other words, what is the probability that the difference between observation and a uniform distribution was just the luck of the draw? If that probability is very small, say less than 1%, then we can say with high confidence, say 99% confidence, that the positions are not uniformly distributed. However, if the test returns a larger number, then we cannot disprove our null-hypothesis. That doesn’t mean the null-hypothesis is true. It just means we can’t disprove it. In the end we can never prove the null hypothesis. We can only try to disprove it.

Note also that having a uniform distribution is not the same as having uniformly distributed random positions. There are ways of getting a uniform distribution that are not random, for example, by treating the order as a circular buffer and rotating through the list on each invocation. Whether or not randomization is needed is ultimately dictated by the architectural assumptions of your application. If you determine the order on a central server and then serve out that order on each innovation, then you can use non-random solutions, like the rotating circular buffer. But if the ordering is determined independently on each client, for each invocation, then you need some source of randomness on each client to achieve a uniform distribution overall. But regardless of how you attempt to achieve a uniform distribution the way to test it is the same, using the Chi-square test.

Using the open source statistical package R, I ran the chisq.test() routine on the above data. The results are:

X-squared = 13340.23, df = 16, p-value < 2.2e-16

The p-value is much, much less than 1%. So, we can say with high confidence that the results are not random.

Repeating the same test on Firefox is also non-random, but in a different way:

Firefox raw counts
Position	I.E.	Firefox	Opera	Chrome	Safari
1	2494	2489	1612	947	2458
2	2892	2820	1909	1111	1268
3	2398	2435	2643	1891	633
4	1628	1638	2632	3779	323
5	588	618	1204	2272	5318

Firefox fraction of total
Position	I.E.	Firefox	Opera	Chrome	Safari
1	0.2494	0.2489	0.1612	0.0947	0.2458
2	0.2892	0.2820	0.1909	0.1111	0.1268
3	0.2398	0.2435	0.2643	0.1891	0.0633
4	0.1628	0.1638	0.2632	0.3779	0.0323
5	0.0588	0.0618	0.1204	0.2272	0.5318

On Firefox, Internet Explorer is more frequently in one of the first 3 positions, while Safari is most often in last position. Strange. The same code, but vastly different results.

The results here are also highly significant:

X-squared = 14831.41, df = 16, p-value < 2.2e-16

So given the above, we know two things: 1) The problem is real. 2) The problem is not related to a flaw only in Internet Explorer.

In the next section we look at the algorithm and show what the real problem is, and how to fix it.

Random shuffles

The browser choice screen requires what we call a “random shuffle”. You start with an array of values and return those same values, but in a randomized order. This computational problem has been known since the earliest days of computing. There are 4 well-known approaches: 2 good solutions, 1 acceptable (“good enough”) solution that is slower than necessary, and 1 bad approach that doesn’t really work. Microsoft appears to have picked the bad approach. But I do not believe there is some nefarious intent to this bug. It is more in the nature of a “naive” algorithm”, like the bubble sort, that inexperienced programmers inevitably will fall upon when solving a given problem. I bet if we gave this same problem to 100 freshmen computer science majors, at least one of them would make the same mistake. But with education and experience, one learns about these things. And one of the things one learns early on is to reach for Knuth.

The Art of Computer Programming, Vol. 2, section 3.4.2 “Random sampling and shuffling” describes two solutions:

If the number of items to sort is small, then simply put all possible orderings in a table and select one ordering at random. In our case, with 5 browsers, the table would need 5! = 120 rows.
“Algorithm P” which Knuth attributes to Moses and Oakford (1963), but is now known to have been anticipated by Fisher and Yates (1938) so it is now called the Fisher-Yates Shuffle.

Another solution, one I use when I need a random shuffle in a database or spreadsheet, is to add a new column, fill that column with random numbers and then sort by that column. This is very easy to implement in those environments. However, sorting is an O(N log N) operation where the Fisher-Yates algorithm is O(N), so you need to keep that in mind if performance is critical.

Microsoft used none of these well-known solutions in their random solution. Instead they fell for the well-known trap. What they did is sort the array, but with a custom-defined comparison function or “comparator”. JavaScript, like many other programming languages, allows a custom comparator function to be specified. In the case of JavaScript, this function takes two indexes into the value array and returns a value which is:

<0 if the value at the first index should be sorted before the value at the second index
0 if the values at the first index and the second index are equal, which is to say you are indifferent as to what order they are sorted
>0 if the value at the first index should be sorted after the value at the second index

This is a very flexible approach, and allows the programmer to handle all sorts of sorting tasks, from making case-insensitive sorts to defining locale-specific collation orders, etc..

In this case Microsoft gave the following comparison function:

function RandomSort (a,b)
{
    return (0.5 - Math.random());
}

Since Math.random() should return a random number chosen uniformly between 0 and 1, the RandomSort() function will return a random value between -0.5 and 0.5. If you know anything about sorting, you can see the problem here. Sorting requires a self-consistent definition of ordering. The following assertions must be true if sorting is to make any sense at all:

If a<b then b>a
If a>b then b<a
If a=b then b=a
if a<b and b<c then a<c
If a>b and b>c then a>c
If a=b and b=c then a=c

All of these statements are violated by the Microsoft comparison function. Since the comparison function returns random results, a sort routine that depends on any of these logical implications would receive inconsistent information regarding the progress of the sort. Given that, the fact that the results were non-random is hardly surprising. Depending on the exact search algorithm used, it may just do a few exchanges operations and then prematurely stop. Or, it could be worse. It could lead to an infinite loop.

Fixing the Microsoft Shuffle

The simplest approach is to adopt a well-known and respected algorithm like the Fisher-Yates Shuffle, which has been known since 1938. I tested with that algorithm, using a JavaScript implementation taken from the Fisher-Yates Wikpedia page, with the following results for 10,000 iterations in Internet Explorer:

Internet Explorer raw counts
Position	I.E.	Firefox	Opera	Chrome	Safari
1	2023	1996	2007	1944	2030
2	1906	2052	1986	2036	2020
3	2023	1988	1981	1984	2024
4	2065	1985	1934	2019	1997
5	1983	1979	2092	2017	1929

Internet Explorer fraction of total
Position	I.E.	Firefox	Opera	Chrome	Safari
1	0.2023	0.1996	0.2007	0.1944	0.2030
2	0.1906	0.2052	0.1986	0.2036	0.2020
3	0.2023	0.1988	0.1981	0.1984	0.2024
4	0.2065	0.1985	0.1934	0.2019	0.1997
5	0.1983	0.1979	0.2092	0.2017	0.1929

Applying Pearson’s Chi-square test we see:

X-squared = 21.814, df = 16, p-value = 0.1493

In other words, these results are not significantly different than a truly random distribution of positions. This is good. This is what we want to see.

Here it is, in graphical form, to the same scale as the “Microsoft Shuffle” chart earlier:

Summary

The lesson here is that getting randomness on a computer cannot be left to chance. You cannot just throw Math.random() at a problem and stir the pot, and expect good results. Random is not the same as being casual. Getting random results on a deterministic computer is one of the hardest things you can do with a computer and requires deliberate effort, including avoiding known traps. But it also requires testing. Where serious money is on the line, such as with online gambling sites, random number generators and shuffling algorithms are audited, tested and subject to inspection. I suspect that the stakes involved in the browser market are no less significant. Although I commend DSL.sk for finding this issue in the first place, I am astonished that the bug got as far as it did. This should have been caught far earlier, by Microsoft, before this ballot screen was ever made public. And if the EC is not already demanding a periodic audit of the aggregate browser presentation orderings, I think that would be a prudent thing to do.

If anyone is interested, you can take a look at the file I used for running the tests. You type in an iteration count and press the execute button. After a (hopefully) short delay you will get a table of results, using the Microsoft Shuffle as well as the Fisher-Yates Shuffle. With 10,000 iterations you will get results in around 5 seconds. Since all execution is in the browser, use larger numbers at your own risk. At some large value you will presumably run out of memory, time out, hang, or otherwise get an unsatisfactory experience.

Beautiful Word Clouds

2008/06/26 By Rob 16 Comments

We’ve all seen tag clouds by now, the visualization technique that shows the importance (however defined, but typically by prevalence) of a word by assigning a proportionately sized font.

But now comes along a tool that treats these clouds as art. Wordle’s “Beautiful Word Clouds” is quite addictive, allowing you to enter the raw text and then play around with layout algorithms, fonts and coloring schemes to produce some very nice looking clouds. The author — Jonathan Feinberg — works here at IBM, a fact I did not discover until I had already wasted hours playing with the tool. So maybe I can count this as work now?

Here are a few examples of word clouds formed by analyzing three different texts. Can you guess the identity of the three texts?

Some of my wish-list items are:

Apply a stemming algorithm to conflate words with the same root. So in the last example, “standard” and “standards” are counted separately, when they are probably best counted as the same word.
Auto generate an image map associated with the cloud
Export to PNG (even if just written temporarily to server, I can download it from there)
I’d love to read a paper on how the layout algorithms works
What would happen if you combined Kohonen self-organizing maps with word clouds? Arrange the words so their proximity in the cloud was correlated with co-occurrence in the text.

The Right and Lawful Rood

2007/12/13 By Rob 6 Comments

So what do we have here?

Sixteen men, lined up. They seem to be having a good time. Some are older, some younger. A historian of fashion might be able to tell us their relative social status, and perhaps their trade, by looking at their clothing. In the background, three men are observing and comparing notes. To the right is a church, and to the left is the village.

So what are they doing?

Is it an early early depiction of the hokey-pokey (“You put your left foot in…”) ?

No.

Although the scene obviously has some social aspects, the primary activity depicted here is standards development, particularly the historically mandated procedure for determining the linear measurement known as the “rood”, related to the English “rod”, the German “rute” and the Danish “rode”.

This print, from a 16th century surveyor’s manual by Jacob Koebel, called Geometrei. Von künstlichem Feldmessen und absehen, explains the procedure:

Stand at the door of a church on a Sunday and bid 16 men to stop, tall ones and small ones, as they happen to pass out when the service is finished; then make them put their left feet one behind the other, and the length thus obtained shall be a right and lawful rood to measure and survey the land with, and the 16th part of it shall be the right and lawful foot.

From a technical point of view, you might wonder why they didn’t have a standard rule, a metal bar etched with two lines, something tangible which could be carried about and used to calibrate? But who would maintain the standard? And would you trust them? Physical objects may be counterfeited, replaced, shaved, distorted, even stolen. Those who are buying land would like a longer rood, and those selling land would like a shorter rood, so the motivation for fraud is clear.

But the average length of the feet of 16 random men — that is probably not going to change much in a given town, or even across a country. Compared to the logistics required to create, duplicate and distribute a standard rule, the described statistical approach is easier to administer and was accurate enough for the time.

But there is more to it than that. Why didn’t the surveyor just measure his own feet? Or those of his friends? And why require that it be done at church? Why not wherever the surveyor wants to do it?

There must have been something about the process itself, the lining up and being measured, publicly, neighbor beside neighbor, next to the church, that lent it legitimacy. These men are literally voting with their feet.

The transparency of the process is also notable. The rood was determined in public, at the time and place most likely to offer everyone in the town the opportunity to observe. It is hard to cheat with the public watching. Anyone there trying to wear clown shoes or going barefoot would be immediately detected.

Also, it is notable that participation was on an equal basis. No one was able to say, “I am a rich merchant, so I should be allowed to bring 5 pairs of my shoes and line them up in front of me”. And certainly no one could say, “I am the King, the standard is determined by my foot and my foot alone”. This is good, because the variation from King to King would tend to be much greater than the variation from different random samplings of 16 men.

An ODF/OOXML File Format Timeline

2007/06/24 By Rob 31 Comments

I suppose the downside of a blog post containing only a picture is that there is nothing for anyone to quote. So here are a few themes that struck me while putting this chart together:

Microsoft once made file format information on the binary formats readily available, in fact encouraged programmers to use the binary formats. But then around 1999 they reversed course, and eliminated such documentation. At the time, working at Lotus, I had no idea what motivated this change. It was only years later, when Microsoft internal memos were released in cases like Comes v. Microsoft, that the full picture emerged. The file format was viewed by Microsoft as a strategic tool, used to support the overall Microsoft platform, not the user. The format was designed to preserve their vendor lock-in. The availability of the file format documentation to competitors was limited, as a matter of corporate policy.So this reminds us that just because something is documented and available today does not prevent Microsoft from changing their mind at a later point and removing the documentation, failing to update it with new releases, or making it available only under a more restrictive license. Since Ecma owns the OOXML specification, as well as the future maintenance of it, any belief in the long-term openness of this format depends on your trust of Microsoft’s future behavior in this area.
Like any durable goods monopoly (and few things are as durable as software) Microsoft’s largest competitor is their own install base. Microsoft has made many attempts at moving beyond the binary formats in the past, with Office 2000, Office XP and Office 2003. But in each case it failed. These were all false starts and abandoned attempts. So we should look for signs that OOXML is actually Microsoft’s real direction and not another false start or dead end. My guess is that OOXML is merely a transitional format, much like Windows ME was in the OS space, a temporary hybrid used to ease the transition from 16-bit to the 32-bit platform that would eventually come (Windows 2000). Microsoft doesn’t want to support all of the quirks of their legacy formats forever. That just leads to bloated, fragile code, more expensive development and support costs. They would rather have clean, structured markup, like ODF. But the question is, how do you get there? The answer is straightforward: First, eliminate the competition. Second, move users in small steps, promising the comfort of continuity and safety. Third, once you have eliminated competition and have the users on the OOXML format that no one but Microsoft fully understands, then you may have your will of them. For example, introduce a new format that drops support for legacy formats and force everyone to upgrade. They are pretty much doing this already on the Mac by dropping support for VBA in the next version of the Mac Office.Even a cursory look at OOXML shows that it was not designed for long-term use, even by Microsoft. So the question I have is, what is the real format that they are going toward?
Microsoft, after pretty much ignoring document standards for over a decade, suddenly got religion in late 2005 and rushed whatever they had on hand into Ecma. Remember, just months earlier they had recommended the Office 2003 Reference Schemas to Massachusetts for official use. I’m certainly glad Massachusetts did not fall for that by putting their resources on another dead format in the Microsoft format graveyard. OOXML was not designed to be a standard. It is just a proprietary specification that Microsoft has dumped, at the last minute, into ISO’s lap, in an attempt to translate their market domination into a standards imprimatur in order to further cement their market domination. It is a win-win situation for them. Either they have a effective monopoly in office applications and an ISO standard, or they have an effective monopoly in office applications. Nice situation for them either way.