Uncategorized

The 96.97 percent problem

2006/08/21 By Rob 6 Comments

The press release puts out numbers of awesome import. We finally have the answers we seek, the science of web analytics and super-duper tools has laid all doubts to rest:

Amsterdam – August 14 2006 – OneStat.com, the number one provider of real-time intelligence web analytics, today reported that Microsoft’s Windows dominates the operating system market with a global usage share of 96.97 percent. The leading operating system on the web is Microsoft’s Windows XP with a global usage share of 86.80 percent. Microsoft’s Windows 2000 has a global usage share of 6.09 percent and is the second most popular OS on the web.

The global usage share of Apple’s Macintosh is 2.47 percent and the global usage share of Linux is 0.36 percent.

So what’s wrong with this picture? The first thing that hits me is that the survey quotes results to four significant digits. This is unusual in a survey of this kind, since it implies error bars of only +/- 0.005%. Now, what probably really happened here is that 96.97 % of the sampled users were running Windows. But to apply that level of precision to the entire population as they do when they call it “a global usage share of 96.97 percent”, that is something else altogether. Just because you can calculate a number does not mean that you know a number.

According to their press release, OneStat sampled 2 million users from those who visited their customers . We’ll deal with the potential bias issues later. But first let’s settle a statistical question, what sample size would be required to know results to 0.005%? This depends on the population size, the number of internet users, which in 2004 was estimated to be 840,000,000 so I’ll use a nice round billion (1,000,000,000) as an estimate for 2006.

There are a number of survey calculators on the web. I use this one from Creative Research Systems. Plug in the numbers into the Determine Sample Size form:

Confidence level = 95%
Confidence interval 0.005
Population: 1000000000

Press Calculate and you will see that the required sample size is around 280 million. So a sample of only 2 million users, even if perfectly sampled, will not allow you to state numbers like 96.97%. It is off by a factor of 100.

So the question then is, how accurate are the results can one expect from “only” 2 million users. You can use the second calculator on that page, and get an answer of around 0.07%. That isn’t bad at all and may allow you to say 97.0 +/- 0.1%, which is nothing to sneeze at.

(You can also use that form to discover some interesting facts, like a random sample of 384 people is enough to represent a population of any size to a 5% confidence level. It is this type of asymptotic behavior which allows market research firms to make predictions about the preferences of people all over the world, doing many small surveys, though you may find that you yourself may never be surveyed in your entire life.)

Now all of this is moot if the 2 million user sample is not representative of the total population. The results may be precise to one decimal place, but are they accurate? Are the people who visit the web sites of OneSite’s customers reflective of all all web users? Are they typical in terms of country, language, income, age, gender, etc? No supporting info is given.

Sampling bias can be a treacherous thing. For example, let’s look at this blog. Over the past few weeks I’ve received 30,807 visitors, of which 6,512 were running Linux and 14,335 were running Firefox. Based on those numbers, and assuming a world-wide web population of 1 billion, I can issue a press release stating the following:

With 95% confidence Linux has a global usage share of 21.1% (+/- 0.1%) and Firefox as a world wide usage share of 46.5% (+/- 0.1%)

Based purely on the numbers, a have a sample size suffiicent to support the stated precision. But do I think those numbers accurately reflect all web users?

In the end, it is a waste of time to do a survey of 2 million users unless you are rock solid sure that they are randomly selected and representative of the entire population. On the otherhand, if you have a truly unbiased sample, you could tell the OS breakdown of the web to 1% precision with a sampling under 40,000 users.

The lesson? Don’t be awed by numbers. There is often less there than meets the eye.

Epithets

2006/01/16 By Rob 1 Comment

A few thoughts on the Epitheton Ornans, or ornamental epithet. This is more than a nickname, but a formalized word or phrase associated with a person. Classical epic poetry makes heavy use of this rhetorical device. For example, in Homer Achilles is often referred to as “podas okus” or “swift-footed”, whereas Agamemnon is often “anax andron” or “ruler of men”. There is internal evidence that these poems used a stock list of epithets of different lengths and stress paterns to fit into whatever metrical context was needed. In this way, the epithets could aid improvized oral performance, much as a jazz musician has a repetoire of riffs and chord progressions at his command which can be inserted to fill out a phrase.

The Romans allowed the honor of an “agnomen” for significant military victories. So Publius Cornelius Scipio, after defeating the Carthaginian Hannibal, became Scipio Africanus. Over the centuries, this trend escalated. So, by the 4th Century A.D., we have awe-inspiring names such as “Imperator Constantinus Maximus Augustus Persicus maximus, Germanicus maximus, Sarmaticus maximus, Britannicus maximus, Adiabenicus maximus, Medicus maximus, Gothicus maximus, Cappadocicus maximus, Arabicus maximus, Armenicus maximus, Dacicus maximus”. (Today We just call him “Constantine the Great” which is a great time-saver)

The trend continued. If you’ve seen an old British penny, from 100 years ago, you would read the legend “VICTORIA D G BRITT REG F D”, short for “Victoria, by the Grace of God, Queen of England, Defender of the Faith”.

But the use of epithets has been on the wane for many years now, at least in the optimistic parts of the world. North Korea may have its “Dear Leader” and the late “Great Leader”, but we never even considered formally naming Eisenhower “The German Slayer”. We ended up with “Ike”. I guess we like our leaders to be mere men, and not gods. The Cult of Personality is difficult to maintain in a democracy with a free press. “No man is a hero to his butler”.
Sure, we have our little nicknames, “The Artist formally known as Prince”, “Iron” Mike Tyson or the “Scud Stud”, but that is done in jest, or in the entertainment world (which amounts to the same thing). We will never see “Scud Stud” carved in marble or engraved in brass.

But once a year, on this date (or the nearest Monday) I am reminded of the most prominent example of epitheton ornans in common use today. I refer to the ubiquitous use of the phrase “Slain Civil Rights Leader”. The fact that I do not need to name the owner of this epithet demonstrates its currency. A search of Google News shows almost 1,500 uses of this phrase in recent press clips. This epithet is so tightly associated with him that can be used as a substitue for his name, much as a medieval scholar could speak of “the Philosopher” to refer to Aristotle without ambiguity.

I’m trying to think of any other prominent examples of such epithets in common use today. I can’t think of any. Can you?

One wonders how long this epithet will remain? Will it outlast the generation that heard his message and headed his Dream? We can hope so. But I do note that in the generation after the assasinations of Lincoln, Garfield and McKinnley, all three were popularly acclaimed with the epithet “our martyred president”. But a search of Google News shows zero hits for “martyred president”, though there are 271 hits for “President Lincoln”.

The Most Dangerous Idea

2006/01/05 By Rob Leave a Comment

The Edge Foundation’s 2006 question is framed as:

WHAT IS YOUR DANGEROUS IDEA?
The history of science is replete with discoveries that were considered socially, morally, or emotionally dangerous in their time; the Copernican and Darwinian revolutions are the most obvious. What is your dangerous idea? An idea you think about (not necessarily one you originated) that is dangerous not because it is assumed to be false, but because it might be true?

You can read the answers from 120 luminaries from many disciplines here. Many of respondents fled to the polar banalities of atheism, solipsism or pantheism, and there is little here that is really dangerous, subversive, or would even be unseemly at Unitarian prayer breakfast.

But read and judge for yourself. And think of what your most dangerous idea is. I’ll share mine.

The last few years have seen great advances in genetics, the decoding of the human genome, the discovery of gene thearapies, etc. The prospects of curing genetic diseases by formulating designer drugs is no longer the stuff of science fiction. That some diseases are associated with certain ethnic or racial groups is also well-established. For example, Ashkenazic Jews have a greater probability of being born with Niemann-Pick, Gaucher, or Tay-Sachs diseases. Men on the Caribbean island of Tobago have a 3-fold increase in the likelihood of getting prostate cancer due to an shared genetic mutation. Cystic fibrosis is more common among Northern Europeans. This is not to say that race or ethnicity is a genetic determination, but that certain generic mutations associated with certain diseases are more prevalent among certain sub-populations, and these sub-populations often break along racial and ethnic lines.

For a hundred bucks or so, I can take a mail-order test in the privacy of my home to see if I have Native American ancestry, African ancestry or Jewish ancestry, including whether I have the Cohanim gene.

Think of the implications of this. We can identify specific genetic markers that can be used to distinguish members of various human sub-populations. But this ability can be used for good or bad. Put it altogether and think evil. No, even more evil than that. Think Ultimate Evil. Unleash the demons of biological warfare. What in principle prevents one from creating a biological organism which targets a specific human sub-population based on their genetics? For example, a targeted virus which would attack everyone of European ancestry, but would have no effect on the Chinese? The genocidal implications of this are enormous.

Churchill spoke of the danger of losing WWII and how we could “sink into the abyss of a new Dark Age made more sinister, and perhaps more protracted, by the lights of perverted science.” Certainly there is no shortage of ways to destroy the entire world with germs, with bombs, with climate changes, with microscopic blackholes, etc. Our inability to prevent such destructions proves that Man is foolish. But our ability to destroy a fraction of our word, in a clinically targetted, racially motivated way — that may prove that Man is Evil, and that is my most dangerous idea.