8th March 2009
a robots.txt equivalent to rel=canonical
In my last post I looked at the rel=canonical tag and finished by promising to look at some of the limitations of rel=canonical and consider some alternatives.
Many of the alternatives have existed for some time - the use of redirects and cookies, for example. However, the introduction of a rel=canonical tag was an opportunity for search engines to also introduce other, more efficient, standards. These are the alternatives I would like to consider - alternatives that don't exist yet, which the search engines could have introduced this time around and may introduce in future.
I see the rel=canonical tag as analogous to the meta robots tag, and therefore suffering from many of the same limitations:
- The rel=canonical tag is located in a HTML file, and that HTML therefore needs to be fetched and parsed in order for the tag to be seen and acted upon. Therefore, the tag does not save bandwidth or CPU for the Web site or search engine.
- The rel=canonical tag is located in a HTML file and gives instructions about that file. Therefore, it cannot be used to solve canonical issues for non-HTML files such as images, PDF files or Flash movies.
- The rel=canonical tag acts at a micro-level rather than a macro-level. Therefore it is difficult to review that a site-wide policy has been correctly implemented using rel=canonical; Every possible file has to be inspected. Also, code changes have to be made in order to write the rel=canonical tag. This may slow its implementation.
Where the above issues apply to rel=canonical, and similar issues apply to the meta robots tag, it struck me that an opportunity has been missed to also solve canonical issues through the robots.txt file. Any fix applied through robots.txt would not suffer from the above problems.
Extensions to robots.txt could be made in a number of ways. For example, a mod_rewrite-type syntax could be introduced. However, I'm not sure anything so advanced is needed. Most canonical issues arise from three things:
- the use of query parameters in dynamic URLs.
- www versus non-www versions of a site (and other subdomains).
- inconsistent use of default index page URLs.
Some simple robots.txt fields to control these issues would fix most problems without the pain and errors that a mod_rewrite implementation would create.
Google Analytics and Yahoo Site Explorer are two examples of tools that allow simple manipulation of URL query parameters. Yahoo's Dynamic URL Help lists some of the crawling, indexing and ranking benefits of this approach.
Yahoo Site Explorer allows you to remove a query parameter or set a query parameter to a default value within a URL. Using this, a URL such as
could be crawled and indexed as
The session id has been dropped and the referrer has been overwritten as yhoo_srch, meaning all traffic sent by Yahoo Search could be attributed to Yahoo Search rather than the affiliate. This functionality could be implemented in robots.txt using a new syntax something like the following:
meaning that the sid query parameter is to be dropped (as it is preceded by '-') and the refby query parameter is to be overwritten with a default value (as a default value is provided). The same effect could be achieved with a single line:
QueryParam: -sid, refby=yhoo_srch
One problem with both Google Analytics and Yahoo Site Explorer is that you must list the query parameters you wish to drop from URLs - not the ones you wish to keep. Because third parties can link to your site, you're not in control of the links they create and the query parameters they use. Therefore, canonical issues can only truly be solved by specifying the query parameters you wish to keep, rather than those you wish to drop. To solve this, wildcards could specify the default action to be applied to all non-listed query parameters. Therefore I propose the following syntax:
- retainParam[=value]: specfies a query parameter you definitely want to keep, and an optional default value you want it set to
- -dropParam: specifies a query parameter you definitely want to drop
- *: means keep all query parameters not specified (default)
- -*: means drop all query parameters not specified
Default domain and Index Pages
Two further, much simpler additions to robots.txt could clear up the majority of other canonical problems. These are Domain and IndexPage:
defaultDomain specfies the default domain for this robots.txt file. For example, if the search engine retrieves http://www.example.com/robots.txt and finds ...
...it would know to index all URLs under the non-www domain. This would allow multiple parked domains to share the same content and robots.txt file without needing redirects or causing canonical issues, which is currently a common problem.
The IndexPage field specifies a default index page for the domain, i.e. a page for which the following two URLs are considered equivalent:
In this post I've proposed three new fields to add to robots.txt to provide an alternative to the rel=canonical tag, just as the current robots.txt fields are themselves alternatives to the meta robots tag, with their own advantages and disadvantages. The chief advantages I see of canonicalising through robots.txt are:
- Acting through robots.txt means that a resource does not have to be fetched and parsed in order for the canonicalisation instructions to be followed. Therefore, bandwidth and CPU is saved for both the Web site and search engine.
- Acting through robots.txt means that canonical issues can be solved for non-HTML files such as images, PDF files or Flash movies.
- Acting through robots.txt means large scale changes can be made very quickly and easily without the need for any code changes. It's also much easier to review the changes that have been made.
The Domain, IndexPage and QueryParam fields would all be optional and independent of each other. It would be great if the search engines could introduce some or all of these ideas into robots.txt.