Antezeta LogoAntezeta Web Marketing

Reflections on search engine optimization, web analytics and web marketing

Antezeta Web Marketing header image 2

Web statistics for internet market research: pick a number, any number

by · 8 Comments ·

How to perform competitor research using web statistics while avoiding lies, damned lies, and …statistics?

Comparison with competitors is a fundamental element of business; even innovators need to know how far ahead they are in their market. The Internet seems to offer fertile terrain for capturing accurate marketing statistics on website usage and position relative to other players in a given market. Indeed, most of us have often heard web statistics from Nielsen//NetRatings, Alexa or comScore cited in the press and elsewhere. Practitioners of Search Engine Optimization and web marketing know that web analytics is not just silo analysis of a company’s website: it also entails looking at how a website and its business performance metrics measure up in the overall web ecosystem.

Statistics reliability

So how valid are the web usage and site ranking statistics offered by web measurement companies? Despite the existence of organizations such as the Web Analytics Association (WAA) and the Interactive Advertising Bureau (IAB), the ugly answer is “We don’t know“. What we do have is a proliferation of the “mine’s bigger than yours” syndrome as one website cites its high ranking using one set of statistics and a competitor responds, citing a different, incompatible, source.

Is there a statistician in the House?

Most of us don’t have a background in statistics. We easily fall for the authority of pretty pictures and lines that go up and down. Darell Huff covered the topic well in his 1954 classic, How to Lie with Statistics [UK edition]. That his book is still in print is a testament to the power of his message. More recently, Edward Tufte has tried to teach us the errors of our ways in his renowned The Visual Display of Quantitative Information. [UK Edition]

Methodology transparency

To be able to effectively evaluate the reliability of the Internet statistics published by the various measurement companies, we would need to have access to information on the complete methodology used to derive the final statistics. Basic elements include the sample selection, sample size, and any corrections made to offset sample bias and skew. Yet even when the major collectors of internet usage do discuss methodology, they usually don’t get beyond sample size and a very general discussion of sample sources. Indeed, the IAB has recently challenged two of the companies, Nielsen//NetRatings and comScore, to accept a standardized audit and accreditation process. In this challenge exists the risk that processes which are fundamentally flawed in their conception may end up being accredited simply because the method becomes documented.

Common measurement approaches

There are three principle ways to measure overall Internet usage. A panel of users can be measured at their computers with installed software (user-centric), marketers can monitor how visitors interact with a specific website (site-centric), or data can be collected directly from ISP networks (network-centric).

User-centric internet usage measurement

User based measurement involves convincing users to install software which will track most, if not all, their Internet usage. Browser toolbars are one of the most obvious approaches. Toolbars have many limitations. They are limited to tracking standard website traffic. They won’t know about Skype and other non browser based internet applications. Toolbar based measurement is limited to a self-selected population which decides to install the toolbar, although there are some cases where a toolbar may be bundled with a new computer. Toolbar suppliers often exclude Firefox and other non-Internet Explorer browsers.

A second approach is to convince users to install an application which tracks both browser and other Internet application usage. Panel based measurement takes this approach. In some cases the users are well aware of their participation in a panel (and may alter their behavior accordingly); in other cases, users have chosen to install software to receive a specific benefit – the user might not be aware of the software’s true purpose.

Website-centric usage measurement

Measurement at the website level involves cooperating with website owners to install a web analytics system, usually based on web server log files or on tracking code inserted in all of a site’s web pages. In Italy, Audiweb publishes website-centric data of major media companies (requires registration). Due to the need for site owners to cooperate, use of this approach is limited. Some web analytics hosting companies make limited website-centric data available as an activity secondary to their primary business.

Network-centric usage measurement

Intercepting data between users and websites at the network level potentially offers sample sizes which are much larger than traditional panels (but smaller than the data sample available through collection directly at a website).

The first consideration in evaluating the data’s potential reliability is to ask about the sampling methodology. Which ISPs have been selected? Is their typical user demographic different from those not included? Users of Telco ISPs, such as Telecom Italia’s Alice, are usually following the path of least resistance. Those who have chosen economical ISPs such as Tele2 will have a different marketing profile. At the oposite end of the spectrum are the ISPs which focus on premium high speed connections, such as Italy’s Fastweb. Do the selected ISPs include corporate traffic, or just small business and residential users?

The measurement companies

Alexa

Amazon.com’s Alexa is one of the better known supplier of web statistics, and one of the few offering broad international data. Data is collected through the Alexa Toolbar installed by those wishing to view site ranking data for competitor sites, rank the sites they visit, and more recently, perhaps due to incorporated anti-phishing tools. It not difficult to imagine that this self-selected community is comprised of web masters in the know looking to increase their own site’s ranking.

Alexa’s methodology is fairly well documented, including significant disclaimers.

Although Alexa should be lauded for documenting their methodology, the disclaimers inaccurately minimize the IE Windows centric nature of Alexa data. While Alexa states they collect data from toolbars on Macintosh and Linux platforms, Alexa fails to note that they offer just one toolbar: for Internet Explorer on MS Windows. While there are some third party tools for Firefox, it is doubtful that they have the same uptake percentage as the official toolbar for IE. One, the “About this site” extension, would only ping Alexa’s servers upon a manual rank verification request, certainly not the same as the automated data collection offered by the IE toolbar. One semi-official solution, the A9 toolbar, has been abandon by Amazon. As Alexa’s site and toolbar are only available in English, English speaking regions may be overrepresented.

Update: Alexa has released “Sparky”, a toolbar for Firefox, 16 July 2007.

Notable has been Alexa’s commitment to web developer API‘s, allowing alternative services to provide alternative visual representations of Alexa’s base data. One example is Statsaholic, although it been accused of scraping Alexa’s data.

Alexa does offer some pretty icons to illustrate a site’s ranking in the Alexa community – particularly useful for those who have manipulated Alexa’s rankings.


corriere.it vs repubblica.it: page views of the past 6 months.

Antezeta considerations: Alexa’s data, based on a self-selected sample, lacks scientific rigor. We wouldn’t make financial investments based on Alexa rankings. Admire the pretty graphs and move on.

Compete

Founded in 2000 by Bill Gross of Overture fame, Compete employs a browser toolbar (Internet Explorer and Firefox supported), panels and ISP data to capture two million plus community members in the US. Sites whose audience is primarily outside the US are not officially supported; limited non-US tracking appears to have begun in July 2006 as seen in a comparison of visits to Italy’s two leading newspapers:

Corriere vs Repubblica

Compete’s methodology is fairly well documented; a PDF is available.

Antezeta considerations: Judicious leveraging of data from three different sources could potentially lead to more reliable data than that provided solely by small panels or self-selected toolbar audiences. With a US focus, Compete data is not reliable for sites whose main visitor population is outside the US.

comScore uses a panel of over 2 million participants who have been recruited by software offering other benefits such as an E-mail antivirus and free prizes. comScore’s methodology has proved very controversial, with some labeling the now discontinued MarketScore antivirus software spyware. In theory this somewhat self-selected sample (it is not clear how many participants signing up for free software really understand the true motive behind the free software) is normalized to represent overall Internet demographics, but, as in the case of the other services, we’ll have to take comScore at their word on this.

Through a Canadian subsidiary SurveySite, comScore conducts quantitative & qualitative research online.

Antezeta considerations: From a marketing demographic perspective, is an educated, savvy Internet user going to install software of dubious provenance? Are companies with IT processes in place going to allow their employees to install this software?

Quantcast

Launched in September 2006 by “a team of engineers and scientists from NASA, Stanford and AltaVista“. Quantcast primarily collects data from ISPs and advertisers with a US focus. Site owners can tag their pages with Quantcast code for more accurate site level reporting. As with most web statistics suppliers, Quantcast produces some impressive reports, but the exact methodology used is not publicly disclosed. Quantcast do describe their overall approach as a mix of panal data sampling and pixel tracking. Unfortunately, the devil is, as always, in the details: how is the panel recruited? How representative is it of the public as a whole? Are panel members aware that their navigation will effect Quantcast’s data, and if so, does that influence their navigation? How is the data processed, and with what assumptions?

View Quantcast profiles for la repubblica and corriere della sera.

Antezeta considerations: Without full disclosure of Quantcast’s methodology, it is unclear if their results could withstand peer scrutiny. Benchmarking of sites using Quantcast tracking code should be fairly reliable assuming (big assumption here) site pages have been properly tagged.

Nielsen Online, formerly Nielsen//NetRatings

Nielsen//NetRatings, long established as Nielsen in the audience measurement field, is perhaps one of the most cited source of web statistics in the general press.

Nielsen captures high level internet data using sample panels selected by the random digit dialing (RDD) method. In Italy, the sample size was just increased from 5,000 to 15,000 with a stated goal of 20,000 by the end of 2007. A press release also notes rather vaguely that the RDD selection is being augmented by on-line recruiting; it is not clear as to how nor why. It is this panel data which is usually referred to in press statements on overall Internet usage trends. Update 2009-03-28: according to a December 2008 Audiweb press release, the Nielsen panel size in Italy reached about 20,000 in October 2008. Not sure what they mean by about.

Nielsen also captures web analytics data at the website level, i.e. much more accurate, for clients who install Nielsen//NetRatings’ “SiteCensus” tracking code. SiteCensus is based on the former Red Sheriff product.

Nielsen Online offer a great sourcing Guidelines document for their clients and other consumers of their data but this document seems to be focused primarily on protecting Nielsen Online’s image and revenue streams rather than clarifying the underlying methodology and statistical error. Try to find a comprehensive discussion of Nielsen//NetRatings’ methodology used to justify their statements. Does Nielsen metering software run on Macintosh and Linux computers? Or does it track just Windows users? Does RDD call both cell and landline phone numbers? How many people refuse to participate? Are they more affluent, time pressed professionals who can’t be bothered? The best you’ll find is a marketing document touting “Nielsen//NetRatings’ precise information”.

So want does precise mean? Consider the Italian panel of 15,000 members. Italy’s population is about 59,000,000 (Source: Istat). Website X is leader in the home mortgage business, a very important sector both for home buyers and lending institutions. Website Y is a valid contender. Nielsen’s sample size is .025% of the entire Italian population, .038% of the 15-64 year olds. Yes, you understood correctly: Nielsen//NetRanking’s calculations as based on a whopping 0% of the population, after rounding. In the case of our Home Mortgage example, site X has ~120,000 unique visitors every month. Site Y, the closest competitor, attracts ~75,000 unique visitors a month. So how many of these real visitors can Nielsen//NetRankings track? About 46 and 29 respectively, assuming the site traffic and panel composition is between 15 and 64.

Italy PopulationNielsen Online Sample SizeSample as % of PopulationUnique Monthly Visitors Site XNumber of visitors captured by Nielsen OnlineUnique Monthly Visitors Site XNumber of visitors captured by Nielsen Online
Adult (15-64)39,058,00015,0000.038%120,0004675,00029

Antezeta considerations: Nielsen Online does not publicly release data for specific sites other than what appears in their press releases. Consider if a small panel is an appropirate measurement technique for the Internet.

Experian

Australian based Hitwise uses ISP data to report on Internet use in its primary markets: the United States, United Kingdom, Australia, New Zealand, Hong Kong, Singapore, France and Brazil. ISP data is augmented with data from “opt-in panel partners”, presumably in order to add demographic data. Worldwide, Hitwise tracks data on 1 million websites across 160+ industries.

CountrySample Size% PopulationPopulation
Australia3,000,00013.9%21,515,754
Braziln.a.201,103,330
Canada100,0000.3%33,759,742
France90,0000.1%64,768,389
Hong Kong1,800,00025.4%7,089,705
New Zealand460,00010.8%4,252,277
Singapore1,500,00031.9%4,701,069
United Kingdom8,000,00012.8%62,348,447
United States10,000,0003.2%310,232,863

Table: Hitwise data sample size, 12/2010. Source: Hitwise websites; Population data from CIA World Factbook.

See the previous discussion of network-centric data collection to understand some of the advantages and limitations inherent in using ISP data.

A number of sample reports, such as top search engines, are available in Hitwise’s data center. Hitwise does not publicly release data for specific sites other than in press releases and blog entries. Hitwise does not currently have partnerships with Italian ISPs, thus Hitwise does not report on the Italian Market.

Antezeta considerations: ISP data collection is the most promising approach short of direct site measurement. Evaluate Hitwise’s sampling techniques for your markets before purchasing their data.

Netcraft

Netcraft has tracked web server statistics across the internet since 1995. In December 2004, Netcraft began to track website popularity through a user installed toolbar, promoted as an anti-phishing tool. According to Netcraft, the site ranking is based on the weekly hit rate. Netcraft estimates that the toolbar usage is in the hundreds of thousands and reflects the general population of the internet.

Antezeta considerations: Data suffers all the limitations of a toolbar based self-selected sample.

Ranking.com

Ranking.com collects data via a browser toolbar which offers a “trust gauge”, site ranking information and a “browser accelerator” (search engine search box) among its features. The toolbar is only available in English for Microsoft’s Internet Explorer on Windows. The tech savvy Firefox crowd is not part of this demographic nor are Macintosh aficionados. Data methodology discussion says updates are monthly.


Ranking.com Website

Antezeta considerations: Data limited by self-selection methodology and Internet Explorer only support.

Website ranking services at a glance

CompanySincePrimary metrics (publicly available)Sample Size (world-wide)Sample Selection MethodologyClick data sourcesModifiable Site Profile?Data APIGeographic Scope
Alexa Logo1996Rank and Reach“an installed based of millions of toolbars”; 180,000 (third party estimate)Self selectionBrowser Toolbar (IE / Windows only)YesYes. Fee based.World-wide
Compete Logo2000Visitors, Engagement2,000,000Self selectionBrowser ToolbarNoYes.US
comScore Logo1999Visitors, Rank2,000,000Self selectionUser installed software / spyware from opinion square, Permission Research and potentially others.NoNoUS, Canada, UK, France, Germany and others.
Hitwise Logo1997Rank, market share25 million Internet users for the markets they measure.Hitwise ISP agreementsISPsNoNoUS, UK, Australia, New Zealand, Singapore
netcraft Logo1995Ranknot disclosedSelf SelectionBrowser Toolbar (IE, Firefox)NoNoWorld-wide
Nielsen//NetRatings Logo1997not disclosed (15,000 in Italy)Mostly RDD?Opt-in PanelsNoNo?World-wide
quantacast Logo2006Rank“1.5 million U.S. Internet users, working to grow that by an additional one million in the near term”1Mixedpanels, ISPs, web site publishersYes – demographic profile; measurement method.Site name and ranking for top million sites can be downloadedUS focus. International data provided for publisher submitted sites.
Ranking.com Logo1998Rank, Trust Gauge, Links215,000Self selectionBrowser Toolbar (IE / Windows only)YesNoEnglish centric due to English only toolbar.

1E-mail from Krista Thomas, Vice President, Marketing Communications, Quantcast

This data was complied from information on the services respective websites, augmented in a few cases by queries to a company. Please do contact us with updates and clarifications. Last update: April 2007.

Additional Web Statistics Data Sources

The following are additional sources of web analytics data. Bear in mind that each of these sources is probably subject to limitations based on many of the issues discussed earlier in this article.

Blog measurement

BlogBabel, covering the Italian and Spanish markets, aggregates third party blog metrics to arrive at a BlogBabel ranking.

FeedBurner

FeedBurner, now a Google property, manages the RSS feeds for many blogs. RSS Feed usage statistics are available to site owners and if enabled by the site owner, via a FeedBurner API.

Technorati

Technorati’s authority ranking of a blog is classified as the number of incoming links to a blog in the last 6 months. The 100 most popular blogs are listed in order.

Click Fraud

Click fraud, also called “invalid clicks“, is a term used in pay per click marketing to refer to clicks on links intended to defraud the advertiser paying for the proportional link. One company, Click Forensics, publishes a Click Fraud Index™. As you might imagine, not everyone thinks these numbers are accurate.

Web Analytics Suppliers

By virtue of tracking most, if not all, clicks on a client’s website, suppliers of hosted web analytics systems have an excellent view of how a segment of internet sites are performing. Data sampling is limited to companies which have selected to use one of these hosting services.

Fireclick

Fireclick provides web analytics services to clients in a variety of market segments. Selected Business and Marketing Conversion metrics derived from Fireclick customer data are published on a weekly basis. Business measures include Shopping Trolley / Cart Abandonment, global conversion rates and visit information. Marketing conversion data includes e-mail, keyword and affiliate program tracking.

Shiny Stats

A Web Analytics provider based in Italy, Shiny Stats publishes a daily ranking of top sites based on visits to sites using Shiny Stats tracking. Data can be viewed by category or by a free text search on a site’s description in the Shiny Stats system.

Audiweb

Aggregate data from major portals and media properites in Italy are published by Audiweb. (In Italian. Registration required.)

Google Analytics

Google Analytics began to support benchmarking of a site’s web statistics against various industry categories in March 2008. Benchmarking data is only available for sites which have opted in to this program. (added 2008-03-06)

Showdown: How do Web Ranking Services Rank each other?

For each of the public ranking services listed in the left column, see how they rank across different services. Sort by a column to rank the rankers by a specific ranker!

Ref.RankerAlexa RankCompeteNetcraftQuantcastRanking.com
1.Alexa Logona2763911750954
2.Compete Logo278912058305916231322387
3.comScore Logo77727978013206823325535429
4.Hitwise Logo7141589201850401325437384
5.netcraft Logo4227104099175980948963
6.Nielsen//NetRatings Logo58257329013467726477076523
7.quantacast Logo276881291355172067021534
8.Ranking.com Logo19147737923018722909414293

Survey 2007-04-16. Compete: value for march 2007

Technical Website Performance Benchmarks

At first glance, technical benchmarks may seem to be more of an IT purview. Yet marketing professionals should be concerned as well. A slow loading website makes for a frustrating user experience. Google advises website page loading time will be considered when assigning an Adwords Quality Score to a landing page, certainly a point of acute interest to search marketing practitioners. It isn’t too hard to imagine page loading time impacting organic search results as well. Many companies provide website technical performance measurement and monitoring; the Apdex (Application Performance Index) membership is a good list. Few offer publicly available benchmark data. Section added 2008-03-23.

AlertSite Market Index

AlertSite offers limited benchmarking data for the Computer, Financial Services, Information Services, Manufacturing, Retail and Telecommunications markets in the US.

Gomez Website Performance Benchmarks

Gomez calculates technical performance benchmarks for response time, availability, and consistency for selected websites in some of the most popular business sectors on the web. Data is available for Canada, China, Germany, the UK and the US. Gomez simulates typical business transactions, such as looking for a hotel room, from diverse geographic regions. To understand the exact steps Gomez measures for a given sector, you’ll probably need to contact Gomez.

Keynote Industry Benchmarks

Keynote provides a similar service to Gomez in the US and UK. Data limited to “top firms” (mostly national) is also available for the Benelux, French, German, Portuguese and UK markets. According to Keynote, sites selected

should be part of the index based on the following criteria: online brand awareness, third-party published traffic to the site, ability to be measured by Keynotes measurement computers and percentage of revenues driven via the online channel.

Conclusion

As the above overview demonstrates, there are many sources of Internet statistics. Unfortunately, most of these numbers fail to offer any degree of confidence as to their reliability – rendering them practically useless. Few of the documented methods employ scientific rigor (e.g. they exclude all Internet users who don’t use Internet Explorer, etc.); worse still, many suppliers don’t document their methods to any degree that lends itself to outside validation.

In the absence of anything better, there’s a strong temptation to think these statistics might be better than nothing. Yet, numbers derived without scientific rigor are worse than nothing as they provide a false sense of confidence in business decision making, a confidence which lacks a solid foundation based in reality.

It is our hope that industry associations like the IAB and the Web Analytics Association (WAA) will spotlight the need for modern methods and accountability in internet statistics gathering and reporting.

Feedback and Comments

We welcome your feedback on this article. If you have information to help us clarify or improve any part of it, please contact us directly. Should you require web analytics or search engine optimization consulting, please let us know.

Similar Posts:

Registration is now open for the next SEO Course (May 14 and 15) and Google Analytics Course (May 9 and 10) in Milan. Don’t miss the opportunity!

Originally published July 2nd, 2007

  • Sean Carlos is a web marketing consultant & teacher, assisting companies with their Search (SEO + PPC = SEM), Social Media & Digital Media Measurement strategies. Sean first worked with text indexing in 1990 in a project for the Los Angeles County Museum of Art. Since then he worked for Hewlett-Packard Consulting and later as IT Manager of a real estate website before founding Antezeta in 2006. Sean is an official instructor of the Digital Analytics Association and collaborates with the Bocconi University. He is a co-author of the Treccani encyclopedic dictionary of computer science, ICT & digital media. Born in Providence, RI, USA, Sean received Honors in Physics from Bates College, Maine. He speaks English, Italian and German.


8 Comments so far ↓

Leave a Comment

Warning: Comments are very welcome insofar as they add something to the discussion. Spam and/or polemical comments without a rational justification of the author's position risk being mercilessly deleted at the sole discretion of the administrator. Yes, life is hard :-).