While it may seem paradoxical, there are many occasions where you may want to exclude a website or portion of a site from search engine crawling and indexing. One typical need is to keep duplicate content, such as printer friendly versions, out of a search engine’s index. The same is true for pages available both in HTML and PDF or word processor formats. Other examples include site “service pages” such as user friendly error message and activity confirmation pages. Special considerations apply for ad campaign landing pages.
There are several ways to prevent Google, Yahoo!, Bing or Ask from indexing a site’s pages. In this article, we look at the different search engine blocking methods, considering each method’s pros and cons.
Just need to review REP directive support? Jump to the:
1. Use a robots.txt robots exclusion file
Way back in 1994 members of a robots discussion list voluntary agreed on a method to tell well behaved web robots, such as search engine spiders and crawlers, certain site content is off-limits.
The robots exclusion standard, as articulated in the robots.txt protocol, says that spiders should look for a plain text file called robots.txt in a site’s top (root) directory. To exclude all robots from crawling directories called sales and images, the following syntax is used:
User-agent: *
Disallow: /sales/
Disallow: /images/
A common error is to forget the trailing slash – we even spotted this error in a recent google blog post.
User-agent: googlebot
Disallow: /sales
will stop any file beginning with sales* from being indexed – not usually what you want. In this case, we have limited the exclusion to googlebot. See our article on search engine spiders for a list of spider bots associated with each of the major search engines.
1. We recommend using at least a default robots.txt file to avoid logging “404 file not found errors” in your server web logs every time a well-behaved bot looks for an inexistent robots.txt file. The default file would contain the following lines:
User-Agent: *
Allow: /
Note that allow is the default; the only reason to set up such a file is to avoid triggering file not found errors.
Pattern matching
Some search engines support extensions to the original robots.txt specification which allow for URL pattern matching.
| Pattern Character | Description | Example | Search Engine Support |
|---|---|---|---|
| * | matches a sequence of characters | User-Agent: * Disallow: /print*/ | Google, Yahoo, Bing |
| $ | matches the end of a URL | User-Agent: * Disallow: /*.pdf$ | Google, Yahoo, Bing |
References: Google, Yahoo!, Bing (Sad note: Microsoft Bing’s help system is awful – they don’t allow direct linking to a topic section). At the time of this writing, Ask does not officially support these extensions.
Site directory organization considerations
When designing a new site, or revising an existing site, we recommend organizing content to be excluded from search engines in dedicated directories; otherwise a robots.txt file becomes unwieldy. At the time of this writing, the whitehouse.gov robots.txt file contains almost 2000 lines.
Using AdWords? Special considerations for ad campaign landing pages
Many sites create dedicated web pages as starting pages for traffic from specific on-line or off-line advertising campaigns. These pages, called landing pages, provide a means to offer a targeted message to visitors who have responded to a specific promotion. Landing pages also allow marketers to measure the response rate to a particular campaign.
When measuring traffic to a landing page, you should measure unique visitors by excluding internal site referrals, i.e. users who return to the page by using the back button.
Generally, a site would want to block robots from indexing landing pages – the pages should only be accessible to visitors who follow a link from a promotion.
User-agent: *
Disallow: /promo07/
There are however cases where this is not a good idea. Some search engine advertising programs, such as Google’s AdWords, consider the quality of your landing page in their ad placement algorithm. By default, Google’s AdWords quality checking robot, AdsBot-Google, WILL crawl pages unless specifically excluded by name in your robots.txt file:
User-agent: AdsBot-Google
Disallow: /promo07/
In most cases, you do not want to block quality scoring robots such as AdsBot-Google.
Pro
- The robots.txt protocol is widely adopted by web spiders and crawlers.
- Robot visits can be tracked by monitoring robots.txt accesses with a web log based Web Analytics tool (unfortunately, JavaScript based web analytics tools cannot track most robots.)
- Can apply to an entire web site or just a section.
- No extra code to add to site pages.
Con
- robots.txt is ignored by misbehaved bots.
- Anyone can read your robots.txt file – indeed there is even a robots.txt blog. Thus, this is not the place to list “secret” directories and files.
Search Engine robots.txt References
Summary of Search Engine REP robots.txt support
The following table was compiled from search engine help files and blog posts. Unfortunatley the information supplied is not always complete, although Google and Bing have improved significantly over time.
| Directive | Description | Bing | Yahoo | Teoma | Blekko | Naver | Yandex | Rambler | Baidu | Sogou | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Allow | Allow crawling of a particular path | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ? | ✔ | ✔ |
| Disallow | Disallow crawling of a particular path | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ |
| Crawl-delay | Controls the time between successive requests to a site. Generally in seconds. | ✘ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ |
| Pattern Match * | * used to represent multiple characters | ✔ | ✔ | ✔ | ✘ | ✔ | ✘ | ✔ | ✘ | ✔ | ? |
| Pattern Match $ | $ used to terminate the match string | ✔ | ✔ | ✔ | ✘ | ✔ | ✘ | ✔ | ✘ | ✔ | ? |
| Sitemap | Used to specify a sitemap location. Not a good idea – you’re telling the entire world where a list of your files is located. A better approach is to ping each search engine when this file changes. | ✔ | ✔ | ✔ | ✔ | ✘ | ✘ | ✔ | ✘ | ✘ | ✘ |
| searchpreview | Don’t allow search engine to present a preview image of the site in search results | ✘ | Was used by Windows Live Search to disable a page preview thumbnail | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| Clean-param | Specify one or more parameters to be removed from path URL | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ | ✘ | ✘ |
| Host | Used to identify mirror sites which should be excluded from indexing | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✔ | ✘ | ✘ | ✘ |
| Known robots | List of robots used by each search engine. Some use separate robots to crawl images, feeds and other media types. Google uses specific bots for its AdWords and AdSense advertising programs. Some bots are region specific, such as Yahoo! Slurp China. Not all meta tags are applicable to all of the specialized crawlers. | adsbot-google, feedburner, feedfetcher-google, google wireless transcoder, google-site-verification, google-sitemaps, googlebot, googlebot-image, googlebot-mobile, googlebot-news, gsa-crawler, mediapartners-google | bingbot, bingbot-media, msnbot, msnbot-academic, msnbot-media, msnbot-newsblogs, msnbot-products, msnbot-udiscovery | teoma | scoutjet | naverbot, yeti | Yandex | StackRambler | baiduspider, baiduspider-cpro, baiduspider-favo, baiduspider-image, baiduspider-mobile, baiduspider-news, baiduspider-video | Sogou web spider | |
| Geography | International | International | Except US/Canda | US, UK, Germany, France, Italy, Japan, Netherlands, Spain | US | Korea | Russia | Russia | China | China |
2. Use “noindex” page meta tags
Pages can be tagged using “meta data” to indicate they should not be indexed by search engines. Simply add the following code to any page you do not want a search engine to index:
<meta name="robots" content="noindex" />
Keep in mind that search engines will still spider these pages on a regular basis. The continue to crawl “noindex” pages in order to check the current status of a page’s robots meta tag.
There is no need to use an index tag; index is the default option. Using a default tag just adds bloat to your web pages. The only time you might use them is to override a global setting:
<meta name="robots" content="noindex" />
<meta name="googlebot" content="index" />
Pro
- Allows page level granularity of robots commands.
Con
- The use of a noindex meta tag is only possible with html pages (which includes dynamic pages such as php, jsp, asp). It is not possible to exclude other file types such as PDF, DOC, ODT which don’t support html meta tags.
- Pages will still be spidered by search engines to check the current robots meta tag settings. This additional traffic is avoided when using robots.txt file settings.
3. Password protect sensitive content
Sensitive content is usually protected by requiring visitors to enter a username and password. Such secure content won’t be crawled by search engines. Passwords can be set at the web server level or at the application level. For server level logon setup, consult the Apache Authentication Documentation or the Microsoft IIS documentation.
Pro
- An effective way to keep search engines, other robots, and the general public away from content destined for a limited audience.
Con
- Visitors will only make an effort to access protected website areas if they have a strong motivation to view that content.
4. Nofollow: tell search engines not to spider some or all links on a page
As a response to blog comment “spam”, search engines introduced a way for websites to tell a search engine spider to ignore one or more links on a page. In theory, the search engine won’t “follow”, or crawl, a link which has been “protected”. To keep all links on a page off-limits, use a nofollow meta tag:
<meta name="robots" content="nofollow" />
To specify nofollow at the link level, add the attribute rel with the value nofollow to the link:
<a href="mypage.html" rel="nofollow" />
Con
- Our tests show that some search engines do crawl and index nofollow links. The nofollow tag will probably diminish the ranking value a link will provide but it cannot be reliably used to stop search engines from following a link.
5. Don’t link to pages you want to keep out of search engines
Search engines won’t index content unless they know about it. Thus, if no one links to pages nor submits them to a search engine, a search engine won’t find them. At least this is the theory. In reality, the web is so large, one can assume that sooner or later a search engine will find a page – someone will link to it.
Con
- Anyone can link to your pages at any time.
- Some search engines can monitor pages visitors view through installed toolbars. They may use this information in the future as a means to discover and index new content.
6. Use X-Robots-Tag in your http headers
In solution 1 above, we noted that use of robots.txt explicitly exposes some of your site’s structure, something you may want to avoid. Unfortunately, solution 2, use of meta tags, only works for html documents – there’s no way to specify indexing instructions for PDF, odt, doc and other non-html files.
In July 2007, Google officially introduced a solution to this problem: the ability to deliver indexing instructions in the http header information which is sent by the web server along with an object. Yahoo! joined Google by supporting this tag in December 2007. Microsoft first mentions x-robots-tag in a June 2008 blog post, although I don’t see their webmaster documentation updated. They do make one mention of X-Robots-Tag in their Bing guide for webmasters.
The web server simply needs to add X-Robots-Tag and any of the Google or Yahoo! supported meta tag values to the http header for an object:
X-Robots-Tag: noindex
Pro
- An elegant way to specify search engine crawling instructions for non-html files without having to use robots.txt.
- Easy to configure using the Apache Header append syntax.
- X-Robots-Tag is officially supported by Google, Yahoo and Bing. Ask has not yet mentioned it.
Con
- Most webmasters are probably not comfortable setting http headers.
- Microsoft IIS support for adding http headers has traditionally been very limited.
Added 2007-07-27. Updated 2009-06-17.
Partially Stop Page Content from appearing in Search Engines
There are times where only a section of a page should be kept out of a search engine. Yahoo supports a class=”robots-nocontent” html tag attribute for this purpose. See our discussion of class=”robots-nocontent” for more details.
Removing pages which have already been indexed.
The best approach is to use one of the above methods. Over time search engines will update their indexes with regular crawling. If you want to remove content immediately, Google offers a tool specifically for this purpose. Pages will be removed for at least six months. This process is not without risk: improperly specify your URL and you may find your entire site removed. Bing has a request form you can use. Yahoo! will consider removal requests for copyright infringement and violation of their search quality guidelines. They don’t currently provide a way to expedite removal of a site owner’s pages.
Removing your content which appears on third party sites
There are occasions when you may find a site ranking in a search engine with content they have “repurposed” without permission from your site. In this case, copyright infringement procedures apply. The best approach to this problem is to directly ask the offending party to remove the copyrighted material. Should this path not prove effective, you should notify each search engine of the copyright infringement. Most of the US based search engines model their copyright violation procedures on the requirements set forth in the American “Digital Millennium Copyright Act” (pdf).
Copyright Infringement procedure
Each search engine provides for notification of copyright violations, a procedure to follow in the event the copyright violator proves non-responsive.
- Google DMCA Notification
- Yahoo Copyright Center
- Ask Copyright Information
- Microsoft Bing [link is broken as Microsoft's help system seems to be from Windows, not for the web - they don't seem care that people cannot direct link to topic sections
.]
Automated Content Access Protocol
Several commercial publishing associations have united behind a project to allow for the specification of more granular restrictions on content use by search engines. The project, Automated Content Access Protocol, appears to be as much a desire to share in the profits that search engines accrue when presenting abstracts of a publisher’s content, rather than a response to limitations in the current robots.txt and meta tag solutions.
At the time of this writing (February 2007), no search engines have yet announced support for this project.
Additional Search Engine Content Display Control
Several search engines also support ways for webmasters to further control the use of their content by search engines.
No archive
Most search engines allow a user to view a copy of the web page that was actually indexed by the search engine. This snapshot of a page in time is called the cache copy. Internet visitors can find this functionality to be really useful if the link is no longer available or the site is down.
There are several reasons to consider disabling the cache view feature for a page or an entire website.
- Web site owners may not want visitors viewing data, such as price lists, which are not necessarily up to date.
- Web pages viewed in a search engine cache may not display properly if embedded images are unavailable and/or browser code such as CSS and JS does not properly execute.
- Cached page views will not show up in web log based web analytics systems. Reporting in tagged based solutions may be incorrect as well as the cached view is on a third party domain, not yours.
If you want a search engine to index your page without allowing a user to view a cached copy, use the noarchive attribute which is officially supported by Google, Yahoo!, Bing and Ask:
<meta name="robots" content="noarchive" />
Microsoft documents the nocache attribute, which is equivalent to noarchive, also supported by Microsoft; there is no reason to use it.
No abstract option: nosnippet
Google offers an option to suppress the generation of page abstracts, called snippets, in the search results. Use the following meta tag in your pages:
<meta name="googlebot" content="nosnippet" />
They note that this also sets the noarchive option. We would suggest you set it explicitly if that is what you want. In a 2008 article, Bing says it supports nosnippet too. In November 2010 Google introduced website previews like Ask’s Binoculars and Bing’s site preview. Google says the nosnippet tag will suppress instant previews.
Page title option: noodp
Search engines generally use a page’s html title when creating a search result title, the link a user clicks on to arrive at a website. In some cases, search engines may use an alternative title taken from a directory such as dmoz, the open directory, or the Yahoo! directory. Historically, many sites have had poor titles – i.e. just the company name, or worse, “default page title“. Use of a human edited title from a well known directory was often a good solution. As webmasters improve the usability of their sites, page titles have become much more meaningful – and often better choices than the open directory title. The noodp metatag, supported by Microsoft, Google and Yahoo, allows a webmaster to indicate that a page’s title should be used rather than the dmoz title.
<meta name="robots" content="noodp" />
Similarly, Yahoo! offers a “noydir” option to keep Yahoo! from using Yahoo! Directory titles in search results for a site’s pages:
<meta name="slurp" content="noydir">
Bing Site Preview
Microsoft’s Bing offers a thumbnail preview of most search results, what Bing calls Document Preview. This isn’t new, Live Search offered a preview of the first six search results in some geographies. Ask.com also offers a similar feature called binoculars. Bing’s preview can be disabled by specifying nopreview as a meta robots value for a page. Microsoft also notes support for x-robots-tag: nopreview in http headers, the first time I’ve noted Microsoft mentioning support for the x-robots-tag.
Microsoft previously supported different methods to disable the thumbnail previews. They were the searchpreview robot in the robots.txt file,
User-agent: searchpreview
Disallow: /
or by using a meta tag containing “noimageindex,nomediaindex”:
<meta name="robots" content="noimageindex,nomediaindex" />
This meta tag was used by AltaVista at one point; it is not known to be used by any of the other major search engines.
Expires After with unavailable_after
One problem with search engines is the delay which occurs from when content is removed from a website and when that content actually disappears from search engine results. Typical time dependent content includes event information and marketing campaigns.
Pages removed from a website which still appear in search engine results generally result in a frustrating user experience – the Internet user clicks through to the website only to find themselves landing on a “Page not found” error page.
In July 2007, Google introduced the “unavailable_after” tag which allows a website to specify in advance when a page should be removed from search engine results, i.e. when it will expire. This tag can be specified as a html meta tag attribute value:
<meta name="robots" content="unavailable_after: 21-Jul-2037 14:30:00 CET" />
or in an X-robots http header:
X-Robots-Tag: unavailable_after: 7 Jul 2037 16:30:00 GMT
Google says the date format should be one of those specified by the ambiguous and obsolete RFC 850. We hope Google clarifies what date formats their parser can read by referring to a current date standard, such as IETF Internet standard RFC 3339. We’d also like to see detailed page crawl information in Google’s Webmaster Tools. Not only could Google show when a page was last crawled, they could add expiration information, confirming proper use of the unavailable_after tag. At one point, Google did show an approximation of the number of pages crawled relative to the number specified in a sitemap, but that feature was removed. This is one case where Google should follow Yahoo’s example.
Pro
- A nice way to ensure search engine results are synchronized with current website content.
Con
- Old date specification RFC 850 is too ambiguous, thus subject to error.
- unavailable_after support is currently limited to Google. We do hope the other major search engines embrace this approach as well.
Added 2007-07-27.
Crawl Delay
While not directly related to content, I was asked about regulating crawling speed in a SEO class, so here’s the formal answer. Both Yahoo! and Microsoft’s Bing support the robot exclusion protocol value crawl-delay. Yahoo cites a delay value in the form x.x where 5 or 10 is “high”. While Yahoo doesn’t specify the delay units, Microsoft uses seconds.
User-agent: Slurp Crawl-delay: 0.5 User-agent: msnbot Crawl-delay: 4
Google does not support Crawl-delay nor will bots which are imposters. For Google, there is a setting which can be changed in Google’s Webmaster Tools for a site. Now that you know you can set a crawl delay, you probably shouldn’t. Search engine crawlers need to access your site’s contents to find any changes – new pages, deleted pages, changed pages. It is in your interest that they do this frequently. Except in rare occurrences, the major search engines won’t be hammering your site. Imposters, maybe, but they won’t look at the robots.txt content.
Meta Tag Summary
The following table summarizes the page level meta tags which can be used to specify how a search engine crawls a page. Positive tags, such as follow, are not listed as they are the default. Tags are case insensitive and can usually be combined.
| Tag | General Description | Bing | Yahoo | Ask | Blekko | Naver | Yandex | Rambler | Baidu | Sogou | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| noindex | Don’t index a page (implies noarchive / nocache) | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ? | ? |
| nofollow | Don’t follow, i.e. crawl, the links on the page | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ? |
| noarchive | Don’t present a cached copy of the indexed page | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✔ | ✘ | ✔ | ? |
| nocache | Same as noarchive | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ? |
| none | Equivalent to noindex, nofollow | ✘ | ✘ | ✘ | ✔ | ✘ | ✘ | ✔ | ✔ | ✘ | ? |
| nosnippet | Don’t display an abstract for this page. For Google, also implies noarchive and, as of Nov 2010, disables site instant preview. | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ? |
| noodp | Don’t use an Open Directory title for the page | ✔ | ✔ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ? |
| noydir | Don’t use a Yahoo! Directory title for the page | ✘ | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| nopreview | Don’t display a site preview in search results | ✘ | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| noimageindex | Don’t crawl images specified in this page | ✔ | Was used by Windows Live Search to disable a page preview thumbnail | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| nomediaindex | Don’t crawl other media specified in the page | ✘ | Was used by Windows Live Search to disable a page preview thumbnail | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| unavailable_after: <date in one of the RFC 850 formats> | Don’t offer in search results after this date and time. In reality, Google says: This information is treated as a removal request: it will take about a day after the removal date passes for the page to disappear from the search results. We currently only support unavailable_after for Google web search results. | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| syndication-source | indicates the URL which is the definitive source of a syndicated article. Only used by Google News. | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| original-source | indicates the URL which first reported on a news story. Only used by Google News; not yet active (11/2010) | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| notranslate | Don’t allow automatic translation of a page. Notranslate was introduced by Google with apparently little thought. The syntax takes the noun “google” instead of “robots”, e.g. <meta name=”google” content=”notranslate” /> or Google bot | ✔ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ | ✘ |
| msvalidate.01 | verify ownership for Bing Webmaster Tools | na | ✔ | na | na | na | na | na | na | na | |
| google-site-verification (was verify-v1) | verify ownership for Google Webmaster Tools | ✔ | na | na | na | na | na | na | na | na | |
| Known robots | List of robots used by each search engine. Some use separate robots to crawl images, feeds and other media types. Google uses specific bots for its AdWords and AdSense advertising programs. Some bots are region specific, such as Yahoo! Slurp China. Not all meta tags are applicable to all of the specialized crawlers. | adsbot-google, feedburner, feedfetcher-google, google wireless transcoder, google-site-verification, google-sitemaps, googlebot, googlebot-image, googlebot-mobile, googlebot-news, gsa-crawler, mediapartners-google | bingbot, bingbot-media, msnbot, msnbot-academic, msnbot-media, msnbot-newsblogs, msnbot-products, msnbot-udiscovery | teoma | scoutjet | naverbot, yeti | Yandex | StackRambler | baiduspider, baiduspider-cpro, baiduspider-favo, baiduspider-image, baiduspider-mobile, baiduspider-news, baiduspider-video | Sogou web spider | |
| Geography | International | International | Except US/Canda | US, UK, Germany, France, Italy, Japan, Netherlands, Spain | US | Korea | Russia | Russia | China | China |
Source: search engine help files, blog posts. Last Update:November 2010 – updated table
Similar Posts:
- Dear Google, thanks for the instant preview, now please support the “nopreview” meta tag
- Now there are 6 ways to keep website content out of search engines
- Keep sections of web pages out of Yahoo! with class=”robots-nocontent”
- Joomla SEO: Removing useless meta tags
- Google Sitemap Standard Adopted by Leading International Search Engines
Registration is now open for the next SEO Course and Google Analytics Course in Milan. Don’t miss the opportunity!

By Jody 2009-02-08 - 7:10:37
Even though, as a newbie, much of this went over my head, it served the teaching function of alerting me to something that I should know about and even provided useful links for further exploration. Thank you!
By James 2009-02-18 - 18:22:24
Great post and exactly what I have been looking for. Thanks.
By Vilnis 2009-07-06 - 18:42:37
Hi. Thanks for he article re “noindex” etc.
We removed a pdf from our server, that had been indexed by Google. The “view as HTML” link still appears in a Google search result, and still shows an HTML rendering of the PDF – how can we get this listing removed from searches?
By Shelley 2009-07-07 - 13:53:22
Nice article!
I’m having the same issue as Vilnis..someone uploaded a PDF and some XLS to the server and they coming up in the search with the usual 2 line blurb. I’ve changed the actual document to have no information in it, but in the search results it is showing information from the old content.
How do I get rid of this, any ideas?
By bill 2009-07-08 - 18:28:51
Nice summary. I am looking to keep all the engines from indexing a certain block of text on each page. Does anyone know of a bing and google equivalent to yahoo’s class=”robots-nocontent” ?
By uttam Kadam 2009-07-23 - 7:17:32
deep and clear description is found.
Which is very useful.
Thanks for this post.
By Matt 2009-09-22 - 16:20:04
Great article and very informative. Hopefully in time Bing will improve the way it reads robots.txt
By Binary Soldier 2009-10-08 - 20:16:33
A good selection of the various methods which are often used.
I plan to stop hot-linking of my imges and it seems CPanel makes this fairly easy.
By Interface 2010-03-18 - 20:56:25
Robots.txt has always been very effective for us. That, coupled with Google’s Webmaster Tools, makes ranking management relatively straightforward.
By Corl 2010-04-19 - 21:19:23
Very informative article.
Since I can only use the meta robots tags – I am stuck with a quandary. It’s great that the bots index my site, I just don’t want my images to be indexed and to show up in the Google, Bing, Yahoo!, or others(?) image searches. Is there a tag or another way without access to the robots.txt file, to tell all the search engines not to index the images on my site?
Thanks,
Corl
By chris 2010-04-22 - 11:50:55
Guys,
I am trying to stop google indexing certain search features, as its indexing them as duplicate content, let me try and explain this.
From Webmaster tools,
Pages with duplicate title tags Pages
Thermolife Products – Sport and Supplements
Go to URL/brands/Thermolife.html
Go to URL/brands Thermolife.html?sort=alphadesc
Go to URL/brands/Thermolife.html?sort=newest Go to URL/brands/Thermolife.html?sort=priceasc
Go to URL/brands/Thermolife.html?sort=pricedesc
Sorry about the text above, but as you can see im getting all these ?String indexed, which is not great, I have tried to search google and use there help docs to restrict it in the robots txt, but Im lost and need a little help please.
Regards
Chris
By Damn Amazing Things 2010-07-07 - 14:37:18
Thanks For This information about google robots
By Magson Fernandes 2010-09-07 - 22:38:21
if i use .htaccess to protect a folder will it stop search engines from accessing my content from my folder.
By David 2010-09-10 - 7:26:27
Great and comprehensive post. Many thanks for including so much detail!
By John 2010-10-01 - 19:07:58
I am also thinking of how to block my content being crawled using .htaccess. Any one has any idea?
By Better-IT 2010-10-07 - 16:24:03
A very comprehensive post, which I’m bookmarking as a reference.
Might be good to add techniques such as javascript links which search engines still can’t follow (I think?)
By liza 2011-03-02 - 11:36:15
This is a very comprehensive resource, I must say. My question is somewhat related to this but here’s the scenario.
We have a website running on two servers. So the files/website is synchronized in these two servers. We have a second server for internal purposes. Let’s name the first server as www and ww2 for the second server. ww2 is automatically updated once the files are updated in www.
Now, Google is indexing the ww2 which I want to stop and just let www be crawled and indexed. My questions are: 1. How can I removed those crawled pages in ww2 removed from Google index? 2. How can I stop google from indexing ww2?
By Nur Mohammad 2011-05-09 - 22:04:40
Thanks for the great article!!!
i am gonna give a try to protect a Folder in my website.
By Chicago With Kids 2011-05-13 - 16:42:59
Just what I was looking for. Usually I want search engines to find my pages, but not in this case. Glad to know I can keep them at bay.
By menno 2011-07-23 - 2:38:04
about
what do you need to do to prevent cache on third party websites only but to keep them on the original domain?
By Katie @ Women Magazine 2011-08-23 - 13:29:57
Obviously, you must be overwhelmed with lots of “thank you for sharing” comments but this is something which is really good and informative.
I would request you to write how to block all unwanted spiders and allow only specific crawlers. Is it even possible?
By William Lake 2011-09-21 - 20:52:35
Hi,
I’ve got the following code in my robots.txt
User-agent: *
Disallow: /
Yet google has indexed 44 pages and directories and yahoo 1.
I’m confused, very confused. Any ideas?
By Dhika - Pang Iklan 2011-10-14 - 14:20:16
in robot.txt file, can I write like http://www.mysite.com/images/ if I want prevent just 1 folder ?
By jaime 2012-02-07 - 21:31:22
this nice
By Mick 2012-06-11 - 14:39:37
Google/Bing ignore robots.txt, so don’t bother with this…
If you want to block them, use .htaccess
By Arun rana 2013-04-12 - 18:43:30
Used Robots.txt method , thanks for the help
By adwords offerte 2013-05-05 - 0:58:57
Just desire to say your article is as astounding.
The clearness in your post is simply cool and i could
assume you are an expert on this subject. Well with your permission allow me to grab
your RSS feed to keep up to date with forthcoming post.
Thanks a million and please carry on the enjoyable work.
By Diuroimmowrat 2013-05-11 - 4:33:47
When purchasing a professional house it is very important take into account the cash flow it generates. In the event the business house in can be a awful element of city it may be difficult for the home to produce any revenue. An agent should be able to conduct a property valuation on the home to help you see the probable cash flow it may create. When choosing a tooth paste, be it natural or from your pharmacy, it’s essential to consider 1, which contains fluoride. It will help to build up the potency of your tooth to make certain they don’t bust, produce oral cavaties or end up with other difficulties. Powerful pearly whites are healthier the teeth, in fact. Games are an easy way for folks to lower their anxiety by actively playing out their hostility through an avatar. They are not simply for young children anymore and a excellent aggressive activity could be just what exactly you need as a way to eliminate your high levels of poor tension. Don’t try to concentrate on both cardio and energy simultaneously. This is simply not to state you should not execute cardio exercises when you find yourself wanting to develop muscle mass. In fact, cardiovascular is an integral part of exercise and fitness. Nonetheless, you must not heavily train aerobic, including getting ready for a marathon, if you are trying to focus on muscle building. The 2 types of exercises can conflict, decreasing effectiveness on both fronts. Should you be battling an acne breakouts breakout, use groundnut gas and lime juices onto the skin. In the event you mix jointly 1 tablespoon of each one of the products, it would build a liquefied you could use directly to your imperfections. It helps look after your acne as well as stop pimples from developing. Allow yourself sufficient time. Industrial discounts may take as much as two times so long as residential types, so permit yourself the time and patience you must view the deal by way of. Renovations and renting also take additional time to accomplish. Try not to imagine business qualities so as to make money fast, due to the fact that is not the way they work.