Accented Characters, Symbols and Special Characters in HTML Documents: Considerations for Search Engine Optimization, Usability and XML Feeds.


One issue many international Webmasters face is how to properly manage documents written in languages containing accented and other special, non-English, characters. Does it matter how the are written? Do HTML documents need to contain both accented and non-accented words to be found in search engines?

Continuing our series on website internationalization for search engine visibility, we’ll take a look at how special characters can be specified in a document and how these characters are managed by search engines such as Google, Yahoo, Ask and Microsoft’s MSN.

In the early days of computing, engineers mapped each of the letters of the latin alphabet used by the English language to a specific numeric code. This mapping became known as the ASCII character set. Unfortunately, no provision was made for accented and other special characters found in the many languages which share the roman alphabet.

Eventually various computer hardware manufacturers added support for the special characters, each using a different mapping system. Unfortunately, these mappings are not generally compatible from one system to another. This problem is sometimes seen today when strange characters appear in text files, messages and web pages viewed on computer systems different from which they were written.

Tips for Inserting Special Characters in HTML Documents

Websites containing pages in languages other than English need to pay particular attention to how special characters are managed. Correct character management impacts both site usability and search engine optimization.

Several approaches are available, all of which are compatible with search engines, as we will see later. They can be grouped as:

  1. Avoid the use of special characters.
  2. Insert characters directly from the keyboard.
  3. Use HTML Entity References.

Avoid the use of Special Characters.

Instead of using an accented character, the accent is placed after the character, i.e. sara’ or sara` instead of sarà. This approach is often seen in Italy. While this approach is fine, use of can give a document a more professional look.

Insert Characters Directly from the Keyboard.

Often website content is copied into html from word processing software, such as the OpenOffice Writer, or directly inserted in an html form. In these situations, special characters will often be specified by an operating system specific encoding. If the correct character encoding is not specified in the web page or web server, a user on a different operating system may end up seeing lots of strange characters.

The solution is to ensure that web pages specify the character encoding used in the page. The best approach is at the web server level. Apache provides the AddCharset directive for this purpose. A lesser approach is to add a meta tag in the html page:

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1252" />

This tag should appear in the <head> section, before other tags such as the <title>, which may contain special characters. Microsoft’s developer network lists the common character set values.

Use HTML Entity References.

The best approach is to use special notation to specify special characters in html. This notation uses basic ASCII characters to referrer to special characters, removing the problems associated with getting a html document’s character set encoding right. The basic special notation is called numeric character reference. Every special character is specified using a prefix composed of the ampersand and number/hash sign, &#, a 3 or 4 digit number to indicate the character of interest, and a semicolon ; as a suffix. Thus, è is represented as &#232;. Some of the numeric character entities have corresponding character entity values, i.e. è can be written as &egrave; (egrave replaces #232). Similarly é can be written both as &#233; or &eacute;.

While character entity values are much easier to remember and to read, we strongly recommend the use of numeric character references to avoid several potential problems:

  • Not all of the character entity values which are part of the html 4.0 standard are recognized by all of the software and programs used on the world wild web. This is particularly true with newer symbols such as that for the Euro, €.
  • Much HTML content is being used in format files, such as and sitemap RSS feeds. The XML standard only recognizes 5 character entities (&quot;, &amp;, &apos;, &lt;, &gt;), one of which, &apos;, is not even part of the HTML standard.

Search Engines Preferences

Search engines are designed to process any type of HTML available on the world wide web. As long as your website users see the right characters on Windows, Macintosh and Linux computers, you can be fairly certain search engines will not have any difficulty with how you have used special characters in your HTML. Yahoo does seem to have trouble processing some of the newer characters in the html 4.0 standard such as the left and right arrow quotation marks, « and ». However, this problem is limited to Yahoo and is independent of the use of numeric or character entity references.

What about Special Characters and Search Engine Queries?

If you’re still digesting the above look at how special characters can be specified in a HTML document, you’ll be relieved to know that search engines hide all of this complexity when a user performs a search.

In general, all of the major search engines will correctly return results for words containing special characters, even if a user did not type the special character! To illustrate this concept, we will consider a specific example.

After the German spelling reform, is street Strasse, or Straße? You need not worry. Each of the major search engines recognizes both variants. You can easily verify this by noting that both variants are highlighted in the search results.

Compare search results for Strasse and Straße on Google, Yahoo!, Ask and Microsoft MSN.
Search EngineSimple ASCIISpecial Character
Ask.comStrasseStraße
Ask DeutschlandStrasseStraße
   
Google.deStrasseStraße
Google.comStrasseStraße
   
MSNStrasseStraße
MSN DeutschlandStrasseStraße
   
Yahoo! DeutschlandStrasseStraße
Yahoo!StrasseStraße

Still not convinced? Compare Google searches for attivita and attività, the Italian word for activity. Both queries will probably list the Ministero delle Attività Produttive as the top result.

Behind the scenes search engines have mapped accented and special characters to their plain ASCII equivalents, where possible. Thus ö is usually equivalent to oe, à to a, etc.

Slightly different emphasis may be given to words with and without special characters based on a combination of factors including the user’s search language. A user’s search language can be detected from the user’s search interface language and the country variant of the search engine being used, i.e. www.google.it or it.ask.com.

You can usually specify the language of your search interface and the number of results to return. Yahoo also has a Show Instant Search results feature, similar to Google’s Suggest. Each of the major search engines, Google, Yahoo, Ask and Microsoft MSN support search interface personalization.

Disambiguation: meta vs. metà

There are many cases where an accent or special character changes a word’s meaning, such as in the case of the italian word meta. Meta without an accent means goal or aim. With an accent, metà means half or middle. Fortunately, you can specify your exact intent in Google, by using an advance search operator as a prefix to the word. To specify you mean metà and not meta, just prefix metà with a +, i.e. +metà. Yahoo says it supports exact character search, just place the word in double quotes, i.e. “straßen” o “strassen”. Unfortunately, it doesn’t seem to really work. Try “metà”.

Related Resources in this Website

You may be interested in the other articles we have written on website localization and search engine optimization.

What’s your experience?

What experience have you had resolving internationalization issues?

Contact Us with feedback on your experience or to let us help you with your Search Engine Optimization and Web Analytics needs.

Similar Posts:

Registration is now open for the next SEO Course and Google Analytics Course in Milan. Don’t miss the opportunity!


About Sean Carlos

Sean Carlos is a digital marketing consultant & teacher, assisting companies with their Search (SEO + SEA = SEM), Social Media & Digital Media Analytics strategies. Sean first worked with text indexing in 1990 in a project for the Los Angeles County Museum of Art. Since then he worked for Hewlett-Packard Consulting and later as IT Manager of a real estate website before founding Antezeta in 2006. Sean is an official instructor of the Digital Analytics Association and collaborates with the Bocconi University. He is Chairman of the SMX Search and Social Media Conference, 12 & 13 November in Milan. He is also a co-author of the Treccani encyclopedic dictionary of computer science, ICT & digital media. Born in Providence, RI, USA, Sean received Honors in Physics from Bates College, Maine. He speaks English, Italian and German.

6 Responses to "Accented Characters, Symbols and Special Characters in HTML Documents: Considerations for Search Engine Optimization, Usability and XML Feeds."