In order to index and display web content in their search results, search engines need to be able to find the content. The first generation of Internet search engines relied on webmasters to submit a site’s primary URL, the site’s “home page”, to the search engine’s crawler database. The crawler would then follow each link it found on the home page. Problems soon emerged – much site content can be inadvertently hidden from crawlers, such as that behind drop-down lists and forms.
Update: Google Sitemaps was renamed Google Webmaster Tools on 5-Aug-2005 to better reflect its more expansive role.
Fast forward to 2005. Search engine crawlers have improved their ability to find sites through from other sites – site submission is no longer relevant. Yet many web sites are still coded in ways which impede automatic search engine discovery of the rich content often available in larger, complex web sites.
As part of their mission to index the world’s information, Google’s Engineers, vexed with the problem of web site content discovery, came up with an elegant solution – like most elegant solutions, simple in its nature. They basically asked themselves, “What if we provide a mechanism for webmasters to provide us with a list of crawlable URLs?”
Google engineers settled on a xml file format for webmasters to provide a list of their site’s publicly available files along with each file’s last modified date and time. Google thus has a complete view of what pages it can crawl and how fresh they are – extremely useful information for optimizing its crawling.
A python script site maps generator program has been made available to site administrators to generate the site map. As part of a move to encourage webmasters, and the greater search industry, to use this format, Google released the site maps generator under an open source license; the source code is available from the source forge open source project repository.
Many third parties have developed programs with a graphical user interface which will accomplish the same task for those uncomfortable with an operating system’s command line. Modules are also available for many popular dynamic page (php, asp) and CMS systems. The resultant XML file, which can and should be compressed, is placed on the site’s server. The xml file should be generated every time site content is updated.
Google also supports other structured formats for sitemaps. These data formats are not to be confused with a HTML sitemap a site may have. HTML formats are not supported.
Google won’t start to look for a sitemap unless it is told to do so. This requires a Google registration and provides access to the Google webmaster dashboard. Once logged in, the user will be presented with a classic “Specify a URL” form; the newly uploaded sitemap.xml.gz should be entered using a complete URL, i.e. http://www.mysite.com/sitemap.xml.gz. After the file has been found, a minimalist dashboard appears which provides information on googlebot downloads of the sitemaps file. A submit option allows webmasters to actively notify Google of updated sitemaps files; this is superfluous for frequently crawled sites.
To get the most out of the Google Dashboard, Google requires site owners to prove their identity by placing an empty file on their website. Google provides the filename in instructions which appear on the verify tab. This process is fairly simple. In many cases the verification is instantaneous. There have been delays due to Google server overload. Google also checks for proper 404 HTTP status handling (File not found) to insure it really found the correct file and not just a generic error message tagged with a 200 HTTP success status. If a site is improperly managing File not found statuses, then the Google Dashboard won’t validate. The verify tab only appears when a site needs to be verified (and occasionally, reverified).
Stats Tab – The Google Dashboard.
Several months after Google released the first Google Sitemaps functionality, they began to add additional information to the sitemaps control panel. Thus was born “The Google Dashboard“, a.k.a. the Stats Tab. It currently contains
Search Query stats
The Search Query stats panel may contain the following information, depending on the completeness of Google’s site profile for a given site. This appears to be related to the overall “priority” (read: PageRank) Google has assigned a site. In some cases, the message “Data is not available at this time.” appears; in others, the category doesn’t even appear.
Available categories include:
- Top 20 search queries
- Top 20 search query clicks
- Top 20 searches from mobile devices
- Top 20 searches from the mobile web
The top search queries indicate which user searches returned pages from a site – apparently within the top 100 results. A first question for those wanting to optimise their site’s search engine visibility is “Are my pages even showing up for my top keywords?”. Until recently, sites had to rely on tools such as Web Position to answer this question. Unfortunately, many didn’t realize that the tool, promoted by a leading Web Analytics company, violates Google’s Terms and Conditions of Service. A second question is, “For pages appearing in the top 10 results, are users clicking through to my site?” The answer to this question is found in the second column, “Top search query clicks”. If users are not clicking through, why?
Much of the data is nuanced and will require some reflection. For example, a search phrase may appear on the left with an average top position of 31. The same phrase will appear on the right with a top average position of 28. Why? It would appear that users would click on that phrase if it appeared within the first 3 pages of results (most users probably have not changed Google’s default 10 listings per page). They would not click on it if it appeared on the third page or higher. Similarly, phrases may appear on the right click through list but not in the top search list on the left – because they weren’t in the top 20 which displayed pages from the site, but elicited click-thoroughs none-the-less.
The query stats offer a wealth of information, both through the data which is present and that which is not. Sites which are serious about search engine marketing should be devoting significant analysis time to this data.
The data can be downloaded for use in spread sheet programs such as OpenOffice’s free multiplatform Calc or for import into a database, allowing analysts to track progress over time.
Google says the data covers the “last three weeks”. The update frequency appears to be at least daily.
Page Crawl Status
The crawl stats section is extremely useful for showing a rough percentage of site crawl and for uncovering errors which prevent content indexing in the first place. Each errors link opens the Errors tab which provides a detailed view of pages and sources of crawling errors. Errors attributed to the site maps file will either indicate a real problem with the site or simply a problem in generating the sitemap (you included content prohibited in the robots.txt file?). Errors found thorough web crawling will usually be due to old or incorrect links on external sites. A short term solution is to add 301 permanent redirects to current content. The longer term solution is to ask webmasters to correct their links to your site.
Page ranking distribution
This section shows the distribution of your pages among three gradations of PageRank: High, Medium and Low. It probably too high level to be of much use. Keep in mind that the update frequency is not clear – the toolbar PR is only updated every several months.
Your page with the highest PageRank (monthly)
Indicates top page URL for a given Month. In many cases, this will be your home page. Interestingly enough, the actual PR is not included. This information does not seem to appear currently for sites with a PR 3 or less.
This panel is devoted to information processed by the Google indexer.
A Content section provides the distribution of processed files by mime content type, i.e.
The text type encoding distribution is also displayed, i.e ascii and utf-8.
While the technical details probably won’t interest most, some site data will include the top 20 single keywords found in pages (a type of keyword density analysis over the entire site) and, perhaps even more interesting, in external links to the site. Currently words like “it’s” and “with” are showing up – stop word lists, if used, seem to be limited. This data, while interesting, is already available to webmasters as they already are able to perform this type of analysis on their pages. For many who are new to keyword analysis, the information may be a eye-opener. Do keep in mind that search engine algorithms are interested in more complex linguistic properties such as word order and word proximity. As with the search query data, these lists can be saved in csv format.
One panel is dedicated to advanced search queries already available through the standard Google interface. Here the queries are preset as links with the site’s name:
|Indexed pages in your site||site:www.mysite.com/|
|pages that refer to your site’s URL||allinurl:www.mysite.com/|
|pages that link to your site||link:www.mysite.com/|
|The current cache of your site||cache:www.mysite.com/|
|Information we have about your site||info:www.mysite.com/|
|pages that are similar to your site||related:www.mysite.com/|
Keep in mind that some of these queries do not work as expected. For example, Google’s link query command only shows a small subset of inbound links known to Google. It would be better to provide no information rather than misleading information.
Google has added a information about the current robots file it has found along with a simulation of robots.txt parsing. Many sites, including famous corporate sites, have botched robots.txt deployment.
Defacto Google Webmaster Dashboard
While it is not clear that Google originally conceived of the sitemap management panel as a Google Webmaster Site Dashboard, it is becoming that. For sites actively managing their search engine optimization, signing up for Google Sitemaps is a must.
Google has launched sitemaps as an open standard – the map generation script is available from the noted open source program repository source forge. Will the other search engines follow?
Yahoo!, me too!
Yahoo responded by adding a brief note to their web site submission form, indicating that sites could also provide a text file of site URLs – a fully manual process. Fortunately, Yahoo! is starting to show signs of a desire to offer search engine optimization resources similar to Google’s Webmaster Central. While Yahoo has not yet provided a full blown webmaster console, in a related article, we cover some of the very nice things Yahoo! provides for Search Engine Optimization using Yahoo! Site Explorer.
MSN (Windows Live), Ask, A9 / Alexa: not yet.
None of the other major search engines have developed either a webmaster dashboard or a site maps function. Let’s hope this changes soon.
Official Site Maps Generation Software
- Google Sitemap Standard Adopted by Leading International Search Engines
- Google Zeitgeist has a Sibling, Google Trends
- Domain & URL Strategies for Multilingual & Multinational Sites
- Microsoft / Yahoo Search Alliance: a closer look at Bing Webmaster Tools & a free data export tool
- Now there are 6 ways to keep website content out of search engines