Antezeta LogoAntezeta Web Marketing

Reflections on search engine optimization, web analytics and web marketing

Antezeta Web Marketing header image 2

How do search engines, such as Google, handle JavaScript and CSS?

by · No Comments ·

A frequent question is “how do such as Google handle JavaScript and ?

Historically, search engines processed web pages much like an old text video browser such as lynx. A search engine only “saw” what the simplest browser could display – simple html.

Much for this reason, search engine optimization consultants have long advocated that site developers keep site coding simple, avoid hiding navigation systems in JavaScript menus and the like.

Today the situation is more complex. Google and the other search engines will try to extract links from anything they can – from PDF files to JavaScript embedded in a web page. This process is not foolproof, however – a site should still avoid relying solely on a JavaScript based navigation system, especially when CSS is a better choice.

We can verify that Google knowingly downloads JavaScript and CSS code when this code is packaged in an external include file. The verification process is fairly simple if you have access to your website’s web server log files. Some hosting companies, like Italy’s otherwise well thought of Aruba, don’t provide access to server logs with their shared hosting services. You might want to exclude companies which don’t fully support web analytics from consideration when choosing your hosting.

To verify Google is downloading your css and/or JavaScript files, search for googlebot and your file, i.e.

grep Googlebot access.log | grep "\.js"

where access.log represents your web server log file, your external javascript file has a .js suffix and your operating system knows what grep is (if it doesn’t, try grep for windows or change your operating system!).

You should see output that includes lines similar to this:

66.249.66.73 – - [04/May/2007:16:09:36-0700] “GET /j/newslink.js HTTP/1.1″ 200 943 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

We only have one type of link to this file, a script declaration: <script type="text/javascript" src="/j/newslink.js"> (clearly tagged as javascript). The file is not listed in an xml sitemap.

So it appears that, yes, Googlebot is identifying and downloading our JavaScript files!

But unfortunately, we’re not yet finished. We need to verify that Googlebot is really Googlebot and not someone pretending to be Googlebot. Why someone would want to spoof Googlebot is a subject for another post, but suffice to say, it is easy to do, using Firefox’s UserAgent Switcher or other tools.

So how can we verify that Googlebot came from Google? The easiest way is to insure the IP address maps to Googlebot’s crawler, and back. This is an example using Linux:

$ host 66.249.66.73

73.66.249.66.in-addr.arpa domain name pointer crawl-66-249-66-73.googlebot.com.

$ host crawl-66-249-66-73.googlebot.com

crawl-66-249-66-73.googlebot.com has address 66.249.66.73

So we’ve verified Googlebot, from 66.249.66.73 aka crawl-66-249-66-73.googlebot.com, is actively looking for and downloading our JavaScript include files.

For CSS,

grep Googlebot access.log | grep "\.css"

returns

66.249.66.73 – - [07/May/2007:09:10:07-0700] “GET /c/screen.css HTTP/1.1″ 200 11056 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)”

Naturally, Googlebot probably doesn’t crawl your include files every day, so you’ll need to check log files for an appropriate time period. You can perform similar checks for Yahoo! Slurp (crawl.yahoo.net) and Ask Jeeves/Teoma (Sample host name: egspd42146..com; the 5 digit number is variable).

Currently Google’s use of these files, beyond simple link extraction, is probably limited to search engine spam analysis.

We haven’t seen any sign that text embedded in JavaScript write statements, or hidden by CSS, actually shows up in search engine queries. In other words, from a point of view, Google will see a web page much like the Lynx browser. Until Google supports a partial page noindex mechanism similar to Yahoo!’s robots-nocontent option, that’s probably the way it should be. Google has direct experience of what can go wrong when bots execute web code, albeit on improperly coded websites.

Do keep in mind that what is true today may not be true tomorrow. Google’s technical prowess should never be underestimated.

Similar Posts:

Registration is now open for the next SEO Course (May 14 and 15) and Google Analytics Course (May 9 and 10) in Milan. Don’t miss the opportunity!

Originally published May 10th, 2007

  • Sean Carlos is a web marketing consultant & teacher, assisting companies with their Search (SEO + PPC = SEM), Social Media & Digital Media Measurement strategies. Sean first worked with text indexing in 1990 in a project for the Los Angeles County Museum of Art. Since then he worked for Hewlett-Packard Consulting and later as IT Manager of a real estate website before founding Antezeta in 2006. Sean is an official instructor of the Digital Analytics Association and collaborates with the Bocconi University. He is a co-author of the Treccani encyclopedic dictionary of computer science, ICT & digital media. Born in Providence, RI, USA, Sean received Honors in Physics from Bates College, Maine. He speaks English, Italian and German.


No Comments so far ↓

There are no comments yet...Kick things off by filling out the form below.

Leave a Comment

Warning: Comments are very welcome insofar as they add something to the discussion. Spam and/or polemical comments without a rational justification of the author's position risk being mercilessly deleted at the sole discretion of the administrator. Yes, life is hard :-).