This article was first written and published by me in 2000/2001, but no longer exists on the Web. It's still accurate - although search engines (notably Google) have taken steps to correct some of the problems described below, they can and do still arise.
There are two common protocols for the prevention of indexing of Web resources:
- The robots.txt protocol
- The robots meta tag protocol
This article describes:
- The theory and practice of these two protocols
- Anomalies and inadequacies in the protocols
The robots.txt protocol
A search engine spider is a Web robot and, as such, may choose to obey the robots.txt protocol. The robots.txt protocol was invented in 1994 and has remained as the de facto standard for controlling robots’ access to a Web site. Most search engines claim to support it, but no robot, including a search engine spider, has to support it.
The protocol is described in the document "A Standard for Robot Exclusion". That is the page that most search engines that support the robots.txt protocol will refer you to if you require more details. However, if you read that page, you will see that it contains no reference to search engines at all. The introduction to the page says:
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.
So, the purpose of the robots.txt protocol is to provide a mechanism for WWW servers to indicate to robots which parts of their server should not be accessed, i.e. to prevent robots from reading parts of their server. How does this purpose relate to preventing a search engine from indexing a particular resource? Unfortunately, the general answer to this question is "It doesn’t".
The Disallow line in a robots.txt file means "disallow reading", but that does NOT mean "disallow indexing". In other words a disallowed resource may be listed in a search engine’s index, even if the search engine obeys the protocol. The most obvious demonstration of this is Google. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file. In so doing it is not breaking the robots.txt protocol because it is not reading any disallowed resources, it is simply reading other web sites' links to those resources.
The Disallow line in a robots.txt file means "Disallow reading", it does not mean "Disallow indexing". A resource does not necessarily need to be read in order to be indexed.
Let’s return to the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index. In practise most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them adding resources to their index, as follows. Most search engines interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index, and if it is already in their index (placed there by previous spidering activity) they remove it. This last point is important, and an example will illustrate the point.
A particular resource may have been published to a particular Web site on 1st January 2000. That resource may have been indexed by a search engine on 1st February 2000. On 1st March 2000, the site owner may have modified the site’s robots.txt file to disallow the resource from being read by the search engine spider. On 1st April 2000, the search engine spider may re-visit the Web site and note the new entry in the robots.txt file. The search engine spider may now simply choose not to read the resource but to leave the copy of the resource in its index unchanged, and this would not be breaking the robots.txt protocol. But most search engine spiders will both:
- not read the resource and
- remove the resource from their index.
In this example, note that throughout March the resource was in the search engine’s index even though it was disallowed by the robots.txt file.
In practice, most search engines interpret a Disallow line as meaning "Do not index this resource and, if you already have an index of this resource, remove it". It may take some time from the point a resource is Disallowed to the point that resource is removed from a particular search engine’s index. If you want to ensure a particular resource is never indexed, ensure it is prevented from being indexed by a Disallow line in the robots.txt file before publishing the resource for the first time.
Now let’s consider how the robots.txt protocol can be used to prevent binary resources, such as images (e.g. GIF files), from being added to a search engine’s index. Let’s suppose a particular Web site put all its images in a directory called /images, and had the following robots.txt file:
You might think that this would prevent the site’s images being indexed by image search engines. But think again about what we have learned about the robots.txt file. It prevents Web robots, including search engine spiders, from reading a resource. But search engines do not need to read an image before adding it to their index. Many spiders just read the ALT text of the IMG tags that refer to the image, rather than reading the image itself. Since the spiders are not reading the image, they are not in breach of the robots.txt protocol if they index the image. This scenario is analogous to Google building an index of a resource without reading that resource: an image search engine can build an index of an image without reading an image.
Once again, in practise most image search engines interpret a Disallow line referring to an image as meaning "Do not index this image and, if you already have an index of this image, remove it". It may take some time from the point an image is Disallowed to the point that image is removed from a particular image search engine’s index.
Finally, a question that exposes the worst flaw of the robots.txt protocol: a webmaster wishes to make all pages of a Web site, EXCEPT the home page (i.e. "/"), accessible to robots; how can she do this using the robots.txt protocol? The answer - "She can't".
The robots meta tag protocol
The robots meta tag protocol was invented after the robots.txt protocol. It was originally designed to allow HTML developers that did not have permission to write the robots.txt file to the root of a server to have control over the indexing of Web pages. Unlike the robots.txt protocol, the robots meta tag protocol:
- specifically states whether a resource may or may not be indexed
- can help, but cannot prevent, a particular resource from being read
- does not allow large-scale (wildcard) prevention of indexing
- cannot be used to prevent anything except HTML files from being indexed, since the meta tag can only be placed in HTML files (if following the strict definition of the protocol)
Note in particular point 2: the robots meta tag protocol cannot prevent a particular resource from being read because a resource must be read in order to obtain the tag it contains. You may think that if every document that linked to a particular resource contained a robots meta tag NOFOLLOW attribute, that resource could never be read – but what if a new document is added to anywhere on the Web, and that document links to the resource? Or what if somebody submits the resource directly to the Add URL page of a search engine? In both these cases, a search engine will read the resource before discovering the robots meta tag. So the problems the robots.txt protocol was designed to fix - e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting) – are not addressed by the robots meta tag protocol. In other words, there is no "NOREAD" attribute!
So, we’ve said what the robots meta tag is not, but what is it? The robots meta tag is included in a HTML file and defines separately whether the file may be indexed (using the INDEX attribute) or spidered (using the FOLLOW attribute). However, the robots meta tag enjoys less support than the robots.txt file. It is unclear how much of the standard search engines support. Would every search engine, for example, correctly interpret a "noindex, follow" set of attributes?
Since the robots meta tag can only be used within a HTML file, and the NOINDEX attribute only refers to the file that contains it, it cannot be used to prevent binary resources (such as images) from being indexed. Some search engines have invented extensions to the protocol to overcome this problem, but the extensions are not part of the protocol. For example, AltaVista has invented its own robots meta tag attribute (NOIMAGEINDEX) to prevent images from being indexed.
The behaviour of these extension tags is not well defined. An example will illustrate the main problem:
- a particular Web site, let’s call it www.example-one.com, consists of 10 pages
- each of the 10 pages includes an image at www.example-one.com/images/example.gif
- nine of the ten pages contain a robots meta tag like this: <META NAME="robots" CONTENT="index,follow,noimageindex">
- however, www.example-one.com’s home page contains the following robots meta tag: <META NAME="robots" CONTENT="index,follow">
The "noimageindex" attribute is only understood by AltaVista’s image spider. So, when AltaVista’s image spider reads the site, will it add example.gif to AltaVista’s image index? The answer to this is question is undefined – nine out of ten pages say it’s not OK to index the image, but one out of ten pages says (implicitly) that it is OK. So the image spider might, or might not, index the image. It all depends on the order the spider reads the pages, the number of pages read by the spider (it might only read the home page), and a multitude of other factors.
To make matters worse, now suppose that there is another Web site called www.example-two.com, every page of which also includes www.example-one.com/images/example.gif. None of the pages on www.example-two.com include a robots meta tag. Would an image spider add example.gif to its index now? Again, the answer to this question is undefined.
Now a question to test the theory so far ... A site owner attempts to exclude a page from being indexed by search engines by both adding a Disallow line in the site robots.txt file and a meta robots tag with noindex attribute into the page itself, before publishing the resource for the first time. Is there any way that a search engine that obeys the robots.txt protocol and the robots meta tag meticulously can have a reference to the resource in its index?
Let's work this through.
- Suppose the resource is called noindex.htm and it contains the following robots meta tag: <META NAME="robots" CONTENT="noindex,nofollow">
- The URL http://www.example-three.com/robots.txt is then created as follows:
- noindex.htm is then published to www.example-three.com/noindex.htm for the first time.
Surely noindex.htm can’t possibly be indexed by a search engine that obeys the robots.txt protocol and the robots meta tag protocol? Can it? It can. In fact, only a search engine that completely obeys both standards can index it. Here’s how.
Our very obedient search engine works a little like Google. So, while its spider is spidering the Web, it finds references to noindex.htm. Each time it finds a reference, the spider creates a better picture of noindex.htm in its index, without ever reading noindex.htm. Sooner or later, the spider visits www.example-three.com. The first thing it does is read robots.txt to find pages it is not allowed to read. The only page it is not allowed to read is noindex.htm, so it doesn’t read that page. It doesn’t remove the page from its index, because, strictly speaking, that is not what the robots.txt protocol means. Because the spider cannot read noindex.htm, it cannot find the robots tag on that page preventing it from indexing that page. Therefore, the page remains in the search engine’s index.
Future posts will address the new features in robots.txt, the robots meta tag and Webmaster tools, that address some of the above problems.