Antezeta LogoAntezeta Web Marketing

Reflections on search engine optimization, web analytics and web marketing

Antezeta Web Marketing header image 2

Search Engine Crawlers: Who’s visiting my site and why?

by · No Comments ·

Organizations implementing (SEO) strategies will sooner or later consider monitoring search engine crawling activity. Before a web page can appear in search results, the content has to be discovered through a crawling or spidering process. This is done through software which automatically navigates the web, finding and downloading web content for the search engine to parse, index and rank.

search engine spider
A “spider”, also known as a “crawler”, “robot” or simply “bot”, finds and retrieves web pages. Once a search engine finds your site, either through a link from another site or through a submission form, the “spider” will begin to crawl your site.

Search engine crawling activity is an early sign that SEO is functioning or a potential warning sign of site issues impeding content discovery.

  • Frequent and deep crawling indicates that a Search Engine has detected that content is usually up to date and is thought highly of by others. It also provides the first indication as to the success of SEO – crawling happens before index updates are visible on-line.
  • Insufficient crawling activity is an advance warning that SEO site improvements won’t be on-line anytime soon – perhaps site content hasn’t been updated in several years? Perhaps no other sites have linked in?

Some commercial web server log file based web analytics reporting tools, such as ClickTracks, provide reports an organization can use to verify:

  • which search engine bots are finding a website’s content
  • the percentage of pages found
  • the recency of bot visits which translates to site page freshness in engine results

Open source web analytics tools such as Analog and AWStats can provide a subset of this information.

Major Search Engine bots

The following is a list of the major search engine and search engine related service bots we have come across, with a brief note on their usage. Use this information based on your website’s priorities to monitor specific crawling activity.

When tracking search engine , it is important to track not only the robot’s name in the useragent data, but to verify the robot’s host name as well. Some users and will spoof a well-known search engine robot when navigating your site.

Google

  • AdsBot-Google · AdWords crawler reviews and measures ad landing page
  • Feedfetcher-Google · RSS Feed Crawler. Used for Google’s personalized homepage and Google Reader. (Sample IP: 72.14.199.xxx)
  • Googlebot · Standard Web Crawler (Sample host name: crawl-66-249-66-14.googlebot.com)
  • Googlebot-Image · Image Crawler (Sample host name: crawl-66-249-66-236.googlebot.com)
  • Googlebot-Mobile · Mobile Content Crawler
  • Google-Sitemaps · Looks for sitemaps authorization file. First seen August 2006.
  • gsa-crawler · Google Search Appliance Crawler. Traffic comes from Google customers who have deployed Enterprise search devices.
  • Mediapartners-Google · AdSense Crawler. Also feeds into standard web crawler database. (Sample host name: crawl-66-249-66-236.googlebot.com)

NoteGoogle Wireless Transcoder · Google Mobile’s phone browser proxy. Not a bot as traffic is based on human page requests. Ref: http://www.google.com/xhtml.

All Google bots use a host name ending in googlebot.com. Google has documented how to verify googlebot is really from Google rather than a spoofed user agent.

Yahoo!

Due to the Microsoft – Yahoo Search Alliance deal, Yahoo! will most likely be phasing out its search crawlers, with the exception of possible research.

, once known as Microsoft MSN / Windows Live

  • bingbot · Mozilla/5.0 (compatible; bingbot/2.0 +http://www.bing.com/bingbot.htm), the primary crawler as of 1 October 2010.
  • msnbot · Main Web Crawler (Host name: livebot-(ip address).search.live.com; formerly msnbot.msn.com)
  • msnbot-media · Images & all other media. While Microsoft provides a reference page, this specific bot is, at the time of this writing, undocumented.
  • psbot · Picsearch Image crawler. Images are licensed by Ask and MSN
  • msnbot-Products · Microsoft Products & shopping bot
  • msnbot-news · MSN News bot
  • msnbot-NewsBlogs · News and blogs
  • msnbot-Academic · Academic search

2010: all Microsoft bots use a host name in the form msnbot-<ip>.search.msn.com. Previously they used search.live.com. Microsoft has documented how to verify msnbot is really from Live Search rather than a spoofed user agent.

  • Ask Jeeves/Teoma · Standard Web Crawler (Sample host name: egspd42146.ask.com; the 5 digit number is variable.)
  • Bloglines · Ask’s blog and rss feed reader service.
  • psbot · Picsearch Image crawler. Images are licensed by Ask and MSN
 

Additional Robot Resources

Comprehensive Bot listings are maintained by several sites:

Note Web analytics system configuration tip: Crawler reports are not usually pre-defined in web analytics systems. Fortunately, they are easy to add. The key is to report on „Pages by User Agent”, where the user agent is the name of the Search Engine bot. Embedded tag solutions may not be able track non-human traffic — check for these limitations before making a commitment to a particular solution; while embedded tag solutions have some significant advantages over web log analysis systems, this, despite what you might be told, is not one of them.

Similar Posts:

Registration is now open for the next SEO Course (May 14 and 15) and Google Analytics Course (May 9 and 10) in Milan. Don’t miss the opportunity!

Originally published June 17th, 2006

  • Sean Carlos is a web marketing consultant & teacher, assisting companies with their Search (SEO + PPC = SEM), Social Media & Digital Media Measurement strategies. Sean first worked with text indexing in 1990 in a project for the Los Angeles County Museum of Art. Since then he worked for Hewlett-Packard Consulting and later as IT Manager of a real estate website before founding Antezeta in 2006. Sean is an official instructor of the Digital Analytics Association and collaborates with the Bocconi University. He is a co-author of the Treccani encyclopedic dictionary of computer science, ICT & digital media. Born in Providence, RI, USA, Sean received Honors in Physics from Bates College, Maine. He speaks English, Italian and German.


No Comments so far ↓

There are no comments yet...Kick things off by filling out the form below.

Leave a Comment

Warning: Comments are very welcome insofar as they add something to the discussion. Spam and/or polemical comments without a rational justification of the author's position risk being mercilessly deleted at the sole discretion of the administrator. Yes, life is hard :-).