The Hillwatch Analytics Brief Series for Public Sector Managers

Getting Down to the Right Numbers, Part 2: Itsy Bitsy Spiders

In this brief we will:

  • Explain how automated (non-human) visitors distort the volume of web site traffic
  • Demonstrate how web traffic software typically underreports the presence of automated traffic; and
  • Show how sites typically have one fifth to one third less traffic than they realize ... and often report.

Not everyone visiting your site is a real, living, breathing human being.  A surprisingly high number are 'programs' � automated visitors otherwise known as spiders, bots, link checkers, crawlers, etc.  While the majority of these automated visitors are sent by search engines, an increasing number are from e-mail address harvesters, accessibility checkers and, unfortunately, hackers.  Non-human traffic on the web is surprisingly high, and much higher than commonly recognized.  Additionally, as the competition for search engine dominance intensifies we have noticed a distinct increase in the percentage oftraffic driven by search engines of all stripes. 

However, web traffic software programs often report visitor traffic with spiders and bots traffic included.  The details surrounding non-human traffic are usually at the back of the reports, if present at all.   It is up to the Web traffic software end users to both recognize this and decide whether to adjust their totals downwards to provide a more accurate visit count.   It takes additional effort and processing time and busy IT departments cannot always make it a priority.

 

Returning to the case study Web site used in our previous Analytics Brief, Table II illustrates the amount of additional traffic generated by automated visitors.  In this example, fully 14% of all traffic is non-human, a substantial figure.  Note that our figures for the non-human visitors are based directly on those identified by the leading Web traffic analysis software.  

 

Table II: Comparison data for visits, by visitor type

Metric

Traffic data

# Of visitor sessions over six month period

381,315

# Of visitor sessions less non-human visitors from Web traffic software

327,531

% Traffic driven by non-human visitors

14%

 

The issue of over reporting due to non-human visitors becomes a much larger problem when comparing traffic data between organizations or branches within organizations.  Comparability of data is only feasible if everyone agrees to a common standard � either non-human visits are included or excluded, but not both.  Further, since new automated visitors are constantly being �unleashed� due to the fluid nature of the Internet, there must be agreement as to which non-human visitors are to be excluded or included.

The spider outbreak is larger than reported!

 

Web traffic software companies create the impression that their software identifies all (or almost all) of the non-human visitors to a Web site.  Unfortunately, our experience clearly indicates that this is not the case. 

 

Hillwatch maintains and updates a filter list of non-human visitors on a monthly basis, which we use to filter Web traffic data for our analytics clients.   This involves systematically cataloguing all the search engine bots and principal e-mail harvesting programs, as well as periodically �sore thumbing� individual client logs to identify traffic patterns that suggest non-human traffic.  Whenever such patterns are found we trace the visitors back to their source.  If the source is found to be generating automated visits, it is then added to our filter list.

 

Through this process we have routinely found that 25% to 35% of all traffic to Web sites will be non-human visitors - and web traffic software consistently and grossly underreports this information.  This is underscored in Table III below, which presents the same traffic data by visitor type for the case-study Web site.   Note the significant reduction in visitor sessions following �scrubbing� of the data using the Hillwatch non-human visitor filter list � 30% for this case-study!

 

Table II: Comparison data for visits, by visitor type

Metric

Traffic data

# Of visitor sessions over six month period

381,315

# Of visitor sessions less non-human visits from Web traffic software

327,531

# Of visitor sessions less all non-human visits based on Hillwatch filter

265,389

% Traffic driven by non-human visitors

30%

 

Failure to properly scrub log data of all non-human visitors inevitably results in inflated traffic numbers and spurious metric reports.  Worse, it skews and/or invalidates attempts at trend or ratio analysis that site operators would typically use to measure and improve site performance.

 

If a boutique analytics firms such as ours can fix this, it is a little unclear why large web traffic software firms cannot. This may be due to the automated nature of their software which cannot recognize all these entities on the web. In some cases, the software appear American-centric. Spiders from the main US based search engines are identified but many of the foreign ones are missed. As many Canadian sites are bilingual and attract French language bots, a significant volume of automated traffic is overlooked. However, the bottom line is that most sites are receiving somewhere in the range of one fifth to one third less traffic than they realize� and often report.

 

Previous Hillwatch Web Analytics Brief: Getting Down to the Right Numbers, Part 1: Hits are Good in Hockey, not so Good on the Web

 

 

 

Page Printed: Http://www.hillwatch.com/Publications/Research/Getting_Down_to_the_Right_Numbers.aspx

Hillwatch Inc., suite 200, 334 MacLaren St., Ottawa ON K2P 0M6 tel: (613) 238-8700 fax: (613) 234-9823