The 96.97 percent problem

The press release puts out numbers of awesome import. We finally have the answers we seek, the science of web analytics and super-duper tools has laid all doubts to rest:

Amsterdam – August 14 2006 – OneStat.com, the number one provider of real-time intelligence web analytics, today reported that Microsoft’s Windows dominates the operating system market with a global usage share of 96.97 percent. The leading operating system on the web is Microsoft’s Windows XP with a global usage share of 86.80 percent. Microsoft’s Windows 2000 has a global usage share of 6.09 percent and is the second most popular OS on the web.

The global usage share of Apple’s Macintosh is 2.47 percent and the global usage share of Linux is 0.36 percent.

So what’s wrong with this picture? The first thing that hits me is that the survey quotes results to four significant digits. This is unusual in a survey of this kind, since it implies error bars of only +/- 0.005%. Now, what probably really happened here is that 96.97 % of the sampled users were running Windows. But to apply that level of precision to the entire population as they do when they call it “a global usage share of 96.97 percent”, that is something else altogether. Just because you can calculate a number does not mean that you know a number.

According to their press release, OneStat sampled 2 million users from those who visited their customers . We’ll deal with the potential bias issues later. But first let’s settle a statistical question, what sample size would be required to know results to 0.005%? This depends on the population size, the number of internet users, which in 2004 was estimated to be 840,000,000 so I’ll use a nice round billion (1,000,000,000) as an estimate for 2006.

There are a number of survey calculators on the web. I use this one from Creative Research Systems. Plug in the numbers into the Determine Sample Size form:

Confidence level = 95%
Confidence interval 0.005
Population: 1000000000

Press Calculate and you will see that the required sample size is around 280 million. So a sample of only 2 million users, even if perfectly sampled, will not allow you to state numbers like 96.97%. It is off by a factor of 100.

So the question then is, how accurate are the results can one expect from “only” 2 million users. You can use the second calculator on that page, and get an answer of around 0.07%. That isn’t bad at all and may allow you to say 97.0 +/- 0.1%, which is nothing to sneeze at.

(You can also use that form to discover some interesting facts, like a random sample of 384 people is enough to represent a population of any size to a 5% confidence level. It is this type of asymptotic behavior which allows market research firms to make predictions about the preferences of people all over the world, doing many small surveys, though you may find that you yourself may never be surveyed in your entire life.)

Now all of this is moot if the 2 million user sample is not representative of the total population. The results may be precise to one decimal place, but are they accurate? Are the people who visit the web sites of OneSite’s customers reflective of all all web users? Are they typical in terms of country, language, income, age, gender, etc? No supporting info is given.

Sampling bias can be a treacherous thing. For example, let’s look at this blog. Over the past few weeks I’ve received 30,807 visitors, of which 6,512 were running Linux and 14,335 were running Firefox. Based on those numbers, and assuming a world-wide web population of 1 billion, I can issue a press release stating the following:

With 95% confidence Linux has a global usage share of 21.1% (+/- 0.1%) and Firefox as a world wide usage share of 46.5% (+/- 0.1%)

Based purely on the numbers, a have a sample size suffiicent to support the stated precision. But do I think those numbers accurately reflect all web users?

In the end, it is a waste of time to do a survey of 2 million users unless you are rock solid sure that they are randomly selected and representative of the entire population. On the otherhand, if you have a truly unbiased sample, you could tell the OS breakdown of the web to 1% precision with a sampling under 40,000 users.

The lesson? Don’t be awed by numbers. There is often less there than meets the eye.

Comments

Anonymous says

2006/08/22 at 12:52 am

That’s some cool shit.

Anonymous says

2006/08/22 at 9:45 am

Ok, so instead of 0.36%, Linux usage is actually 0.3% (+/- 0.1%). Is that your point? I´m impressed.

Rob says

2006/08/22 at 11:08 am

No, I did not write an entire blog post just about a decimal point.

I was illustrating a point about precision and accuracy. The numbers in the OneStat press release exhibited what is called “false precision”, giving the appearance of more precision than warranted by the data. To the average reader, this false precision also carries with it an implication of high accuracy, which is not necessarily true. The accuracy of the survey is not substantiated. We were given no indication that the survey was representative of “global use”, although the results claimed to demonstrate that. I also gave an example where I could legitimately calculate a high precision answer that Linux has a 21.1% market share.

The point is a company can have a slick web site, issue a slick press release with a bunch of numbers, have that press release copied and quoted all over the world, even have the word “stat” in their name and still come up with a survey that would get a failing grade in a intro to statistics class at a community college. I don’t blame them. I blame everyone else. 98.3432% of people simply live in awe of numbers, and don’t question them.

Anonymous says

2006/09/17 at 1:57 pm

Man, are some of your readers dumb or what? Scary that some people just will never be able to get it. Really scary. Oh, look out my window, there is the result; now I get it. Heh.

LB says

2007/01/23 at 9:21 pm

Whether or not the last digit in “0.36%” is significant or not is not important.

Of course Rob is entittled to his opinion. Sampling bias also applies to Rob’s own site. We don’t know if Rob’s cross section of hits represents the Web as whole either. Who knows, maybe Rob is lying or maybe Rob has lots of friends who use FF.

The real difference is that OnesStat sampled 2 million hits while old Rob had a mere 37,000 hits. Businesses pay cold hard cash for OneSat’s services while old Rob runs a blog for free.

You can believe whoever. I trust OnesStat over old Rob.

Rob says

2007/01/23 at 10:17 pm

Hi LB,

I have no doubts that my web site traffic is atypical. I never claimed it was. My point is that the number of hits you get (whether tens of thousands or millions) doesn’t matter so much as the way you do your sampling.

It is well-known to survey practitioners that a survey of only 400 people can represent the opinions of the entire country within 5%, if you pick a unbiased representative sample. Google for “survey of 400” to see how often that magic number is used.

Similarly, a survey of 10 million people can be useless if it is not representative.

Also, I’m not old.

Reader Interactions

Comments

Leave a Reply Cancel reply