How to Specify an HTML Web Document Language for good SEO


So you just wrote a stunning essay on James Joyce’s Ulysses – in Irish Gaelic. Will Yahoo!, Google, Microsoft and Ask recognize it as Gaelic, hosted as it is on your co.uk domain? Maybe. But you can given them a hint!

The trick is to use all of the HTTP and HTML language code settings available to your advantage to ensure your documents aren’t falsely identified. This article considers HTTP and HTML aspects of website for search engine optimization.

Why is Language Recognition a Problem?

Search engines try to match a web searcher’s language (based on ip geo location recognition or user specified preferences) to web documents when determining the best matches for a search query. In some cases, a user may specify that results be limited to a specific language. Left to their own devices, search engines have a few clues to determine the human language of a document:

  • The country domain of the site
  • The country where the site is hosted
  • The language of documents linking to the document.
  • A pattern analysis of the document’s text.

Each of these approaches is fraught with difficulties. Let’s consider a few:

A website’s country domain suffix: While it is likely that a site with a .de extension is in German, there is always the chance that a German company has published content in other languages for an international audience. Some country domains, such as Switzerland’s .ch, are used by countries with multiple official languages, in this case German, French, Italian and Romanish.

Where the site is hosted: Many sites host in geographic jurisdictions far from their target audience due to inexpensive hosting options.

The language of linked documents: While the internet is indeed a web of hubbed networks, it is rather common for web pages to cite an authority, even if the authority is in another language (English, for example)

Text pattern analysis: This is probably the most accurate method, especially for longer documents. While the search engines don’t reveal their approach(es), consider the perl Lingua::Identify module which currently recognizes 33 languages. Lingua::Identify uses a combination of four text pattern matching methods; here we quote from the perl Lingua::Identify documentation:

  1. Small Word Technique

    The “Small Word Technique” searches the text for the most common words of each active language. These words are usually articles, pronouns, etc, which happen to be (usually) the shortest words of the language; hence, the method name. This is usually a good method for big texts.

  2. Prefix Analysis

    This method analyzes text for the common prefixes of each active language.

  3. Suffix Analysis

    Similar to the Prefix Analysis but instead analyzing common suffixes.

  4. N-gram Categorization

    N-grams are sequences of tokens. You can think of them as syllables, but they are also more than that, as they are not only comprised by characters, but also by spaces (delimiting or separating words). N-gram data is available from Google Research.

    N-grams are a very good way for identifying languages, given that the most common ones of each language are not generally very common in others.

Yet, with a little effort, you can help a search engine make the right guess.

HTTP Header Options

At the web server level, the http standard includes a HTTP Content-Language setting. For Apache server users, use the AddHeading syntax, i.e.

AddLanguage it .html

for documents in Italian. More elaborate logic can conditionally set this heading based on files and directories, i.e. for content under the directory /it/ set the language to it. Documents under the directory /de/ should get the de setting. See ISO 639 Language Codes for a complete list of valid codes.

HTML Documents Options

There are several options at the html document level. Building on the http header option discussed above, there is an http-equiv meta tag, i.e.

<meta http-equiv="Content-Language" content="it" />

Warning The sometimes used <meta name=”Content-Language” content=”en” /> is incorrect!

As is the case with all http-equiv meta tags, the Content-Language value should really be set at the http header level as the meta equivalent tags are often ignored by the different consumers of internet information (i.e. browsers, caches, search engine robots). They’re not really as equivalent as you might think!

The html syntax also provides for a lang attribute to modify most tags. You can set this at the document level, i.e.

<html lang="it">

or for xhtml:

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">

This tag can also be applied to blocks of html which are in other languages. So if a paragraph is in French, just add lang=”fr” to the opening p tag:

<p lang="fr">

Links to documents in other languages can use the hreflang attribute added in html 4.0 to specify the language of the target document:

<a href=”http://www.antezeta.it/” hreflang=”it”>Antezeta.it</a>

Guidance

Up until now, search engines have provided little if any guidance on the steps possible to ensure the language of a site’s pages is properly detected. Yahoo! recently noted:

… we analyze the page content, the top level domain that your site is hosted on, and the language and region of the documents associated with your site. We also use the http-equiv=”Content-Language” tag as an important input. This tag lets you indicate on a web page what language it is should be in, for example:

<meta http-equiv="Content-Language" content="en" />

–indicates that the page is expected to be in English. However, this is not our only input and if this is incorrect, we will still try and make our best judgment on the actual language of the page.

More on Site Internationalization

We also covered site internationalization in our January 2006 article UK and US English Dialect Considerations for Site Internationalization and our September 2006 Accented Characters, Symbols and Special Characters in HTML Documents: Considerations for Search Engine Optimization, Usability and XML Feeds.

Need to put all the pieces of the Puzzle together?

To better understand the nuances of Search Engine Optimization and Web Marketing, let Antezeta help you with your Search Engine Marketing Needs!

Similar Posts:

Registration is now open for the next SEO Course and Google Analytics Course in Milan. Don’t miss the opportunity!


About Sean Carlos

Sean Carlos is a digital marketing consultant & teacher, assisting companies with their Search (SEO + SEA = SEM), Social Media & Digital Media Analytics strategies. Sean first worked with text indexing in 1990 in a project for the Los Angeles County Museum of Art. Since then he worked for Hewlett-Packard Consulting and later as IT Manager of a real estate website before founding Antezeta in 2006. Sean is an official instructor of the Digital Analytics Association and collaborates with the Bocconi University. He is Chairman of the SMX Search and Social Media Conference, 13 & 14 November in Milan. He is also a co-author of the Treccani encyclopedic dictionary of computer science, ICT & digital media. Born in Providence, RI, USA, Sean received Honors in Physics from Bates College, Maine. He speaks English, Italian and German.

One Response to "How to Specify an HTML Web Document Language for good SEO"

Leave a reply

Warning: Comments are very welcome insofar as they add something to the discussion. Spam and/or polemical comments without a rational justification of the author's position risk being mercilessly deleted at the sole discretion of the administrator. Yes, life is hard :-).