The press release puts out numbers of awesome import. We finally have the answers we seek, the science of web analytics and super-duper tools has laid all doubts to rest:
Amsterdam – August 14 2006 – OneStat.com, the number one provider of real-time intelligence web analytics, today reported that Microsoft’s Windows dominates the operating system market with a global usage share of 96.97 percent. The leading operating system on the web is Microsoft’s Windows XP with a global usage share of 86.80 percent. Microsoft’s Windows 2000 has a global usage share of 6.09 percent and is the second most popular OS on the web.
The global usage share of Apple’s Macintosh is 2.47 percent and the global usage share of Linux is 0.36 percent.
So what’s wrong with this picture? The first thing that hits me is that the survey quotes results to four significant digits. This is unusual in a survey of this kind, since it implies error bars of only +/- 0.005%. Now, what probably really happened here is that 96.97 % of the sampled users were running Windows. But to apply that level of precision to the entire population as they do when they call it “a global usage share of 96.97 percent”, that is something else altogether. Just because you can calculate a number does not mean that you know a number.
According to their press release, OneStat sampled 2 million users from those who visited their customers . We’ll deal with the potential bias issues later. But first let’s settle a statistical question, what sample size would be required to know results to 0.005%? This depends on the population size, the number of internet users, which in 2004 was estimated to be 840,000,000 so I’ll use a nice round billion (1,000,000,000) as an estimate for 2006.
There are a number of survey calculators on the web. I use this one from Creative Research Systems. Plug in the numbers into the Determine Sample Size form:
- Confidence level = 95%
- Confidence interval 0.005
- Population: 1000000000
Press Calculate and you will see that the required sample size is around 280 million. So a sample of only 2 million users, even if perfectly sampled, will not allow you to state numbers like 96.97%. It is off by a factor of 100.
So the question then is, how accurate are the results can one expect from “only” 2 million users. You can use the second calculator on that page, and get an answer of around 0.07%. That isn’t bad at all and may allow you to say 97.0 +/- 0.1%, which is nothing to sneeze at.
(You can also use that form to discover some interesting facts, like a random sample of 384 people is enough to represent a population of any size to a 5% confidence level. It is this type of asymptotic behavior which allows market research firms to make predictions about the preferences of people all over the world, doing many small surveys, though you may find that you yourself may never be surveyed in your entire life.)
Now all of this is moot if the 2 million user sample is not representative of the total population. The results may be precise to one decimal place, but are they accurate? Are the people who visit the web sites of OneSite’s customers reflective of all all web users? Are they typical in terms of country, language, income, age, gender, etc? No supporting info is given.
Sampling bias can be a treacherous thing. For example, let’s look at this blog. Over the past few weeks I’ve received 30,807 visitors, of which 6,512 were running Linux and 14,335 were running Firefox. Based on those numbers, and assuming a world-wide web population of 1 billion, I can issue a press release stating the following:
With 95% confidence Linux has a global usage share of 21.1% (+/- 0.1%) and Firefox as a world wide usage share of 46.5% (+/- 0.1%)
Based purely on the numbers, a have a sample size suffiicent to support the stated precision. But do I think those numbers accurately reflect all web users?
In the end, it is a waste of time to do a survey of 2 million users unless you are rock solid sure that they are randomly selected and representative of the entire population. On the otherhand, if you have a truly unbiased sample, you could tell the OS breakdown of the web to 1% precision with a sampling under 40,000 users.
The lesson? Don’t be awed by numbers. There is often less there than meets the eye.