So you just wrote a stunning essay on James Joyce’s Ulysses – in Irish Gaelic. Will Yahoo!, Google, Microsoft and Ask recognize it as Gaelic, hosted as it is on your co.uk domain? Maybe. But you can given them a hint!
The trick is to use all of the HTTP and HTML language code settings available to your advantage to ensure your documents aren’t falsely identified. This article considers HTTP and HTML aspects of website internationalization for search engine optimization.
Why is Language Recognition a Problem?
Search engines try to match a web searcher’s language (based on ip geo location recognition or user specified preferences) to web documents when determining the best matches for a search query. In some cases, a user may specify that results be limited to a specific language. Left to their own devices, search engines have a few clues to determine the human language of a document:
- The country domain of the site
- The country where the site is hosted
- The language of documents linking to the document.
- A pattern analysis of the document’s text.
Each of these approaches is fraught with difficulties. Let’s consider a few:
A website’s country domain suffix: While it is likely that a site with a .de extension is in German, there is always the chance that a German company has published content in other languages for an international audience. Some country domains, such as Switzerland’s .ch, are used by countries with multiple official languages, in this case German, French, Italian and Romanish.
Where the site is hosted: Many sites host in geographic jurisdictions far from their target audience due to inexpensive hosting options.
The language of linked documents: While the internet is indeed a web of hubbed networks, it is rather common for web pages to cite an authority, even if the authority is in another language (English, for example)
Text pattern analysis: This is probably the most accurate method, especially for longer documents. While the search engines don’t reveal their approach(es), consider the perl Lingua::Identify module which currently recognizes 33 languages. Lingua::Identify uses a combination of four text pattern matching methods; here we quote from the perl Lingua::Identify documentation:
- Small Word Technique
The “Small Word Technique” searches the text for the most common words of each active language. These words are usually articles, pronouns, etc, which happen to be (usually) the shortest words of the language; hence, the method name. This is usually a good method for big texts.
- Prefix Analysis
This method analyzes text for the common prefixes of each active language.
- Suffix Analysis
Similar to the Prefix Analysis but instead analyzing common suffixes.
- N-gram Categorization
N-grams are sequences of tokens. You can think of them as syllables, but they are also more than that, as they are not only comprised by characters, but also by spaces (delimiting or separating words). N-gram data is available from Google Research.
N-grams are a very good way for identifying languages, given that the most common ones of each language are not generally very common in others.
Yet, with a little effort, you can help a search engine make the right guess.
HTTP Header Options
At the web server level, the http standard includes a HTTP Content-Language setting. For Apache server users, use the AddHeading syntax, i.e.
AddLanguage it .html
for documents in Italian. More elaborate logic can conditionally set this heading based on files and directories, i.e. for content under the directory /it/ set the language to it. Documents under the directory /de/ should get the de setting. See ISO 639 Language Codes for a complete list of valid codes.
HTML Documents Options
There are several options at the html document level. Building on the http header option discussed above, there is an http-equiv meta tag, i.e.
<meta http-equiv="Content-Language" content="it" />
The sometimes used <meta name=”Content-Language” content=”en” /> is incorrect!
As is the case with all http-equiv meta tags, the Content-Language value should really be set at the http header level as the meta equivalent tags are often ignored by the different consumers of internet information (i.e. browsers, caches, search engine robots). They’re not really as equivalent as you might think!
The html syntax also provides for a lang attribute to modify most tags. You can set this at the document level, i.e.
or for xhtml:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en-US" lang="en-US">
This tag can also be applied to blocks of html which are in other languages. So if a paragraph is in French, just add lang=”fr” to the opening p tag:
Links to documents in other languages can use the hreflang attribute added in html 4.0 to specify the language of the target document:
<a href=”http://www.antezeta.it/” hreflang=”it”>Antezeta.it</a>
Up until now, search engines have provided little if any guidance on the steps possible to ensure the language of a site’s pages is properly detected. Yahoo! recently noted:
… we analyze the page content, the top level domain that your site is hosted on, and the language and region of the documents associated with your site. We also use the http-equiv=”Content-Language” tag as an important input. This tag lets you indicate on a web page what language it is should be in, for example:
<meta http-equiv="Content-Language" content="en" />
–indicates that the page is expected to be in English. However, this is not our only input and if this is incorrect, we will still try and make our best judgment on the actual language of the page.
More on Site Internationalization
We also covered site internationalization in our January 2006 article UK and US English Dialect Considerations for Site Internationalization and our September 2006 Accented Characters, Symbols and Special Characters in HTML Documents: Considerations for Search Engine Optimization, Usability and XML Feeds.
Need to put all the pieces of the Search Marketing Puzzle together?
- UK and US English Dialect Considerations for Site Internationalization
- Accented Characters, Symbols and Special Characters in HTML Documents: Considerations for Search Engine Optimization, Usability and XML Feeds.
- Joomla SEO: Removing useless meta tags
- Search engine support of rel=”” link attributes – cheat sheet
- Audio & Video Multimedia Search Engine Optimization