March 2009

alan's picture

a robots.txt equivalent to rel=canonical

In my last post I looked at the rel=canonical tag and finished by promising to look at some of the limitations of rel=canonical and consider some alternatives.

Many of the alternatives have existed for some time - the use of redirects and cookies, for example. However, the introduction of a rel=canonical tag was an opportunity for search engines to also introduce other, more efficient, standards. These are the alternatives I would like to consider - alternatives that don't exist yet, which the search engines could have introduced this time around and may introduce in future.

I see the rel=canonical tag as analogous to the meta robots tag, and therefore suffering from many of the same limitations:

  • The rel=canonical tag is located in a HTML file, and that HTML therefore needs to be fetched and parsed in order for the tag to be seen and acted upon. Therefore, the tag does not save bandwidth or CPU for the Web site or search engine.
  • The rel=canonical tag is located in a HTML file and gives instructions about that file. Therefore, it cannot be used to solve canonical issues for non-HTML files such as images, PDF files or Flash movies.
  • The rel=canonical tag acts at a micro-level rather than a macro-level. Therefore it is difficult to review that a site-wide policy has been correctly implemented using rel=canonical; Every possible file has to be inspected. Also, code changes have to be made in order to write the rel=canonical tag. This may slow its implementation.

Where the above issues apply to rel=canonical, and similar issues apply to the meta robots tag, it struck me that an opportunity has been missed to also solve canonical issues through the robots.txt file. Any fix applied through robots.txt would not suffer from the above problems.

Extensions to robots.txt could be made in a number of ways. For example, a mod_rewrite-type syntax could be introduced. However, I'm not sure anything so advanced is needed. Most canonical issues arise from three things:

  1. the use of query parameters in dynamic URLs.
  2. www versus non-www versions of a site (and other subdomains).
  3. inconsistent use of default index page URLs.

Some simple robots.txt fields to control these issues would fix most problems without the pain and errors that a mod_rewrite implementation would create.

Query Parameters

Google Analytics and Yahoo Site Explorer are two examples of tools that allow simple manipulation of URL query parameters. Yahoo's Dynamic URL Help lists some of the crawling, indexing and ranking benefits of this approach.

Yahoo Site Explorer allows you to remove a query parameter or set a query parameter to a default value within a URL. Using this, a URL such as

could be crawled and indexed as

The session id has been dropped and the referrer has been overwritten as yhoo_srch, meaning all traffic sent by Yahoo Search could be attributed to Yahoo Search rather than the affiliate. This functionality could be implemented in robots.txt using a new syntax something like the following:

User-Agent: Slurp
Disallow:
QueryParam: -sid
QueryParam: refby=yhoo_srch

meaning that the sid query parameter is to be dropped (as it is preceded by '-') and the refby query parameter is to be overwritten with a default value (as a default value is provided). The same effect could be achieved with a single line:

User-Agent: Slurp
Disallow:
QueryParam: -sid, refby=yhoo_srch

One problem with both Google Analytics and Yahoo Site Explorer is that you must list the query parameters you wish to drop from URLs - not the ones you wish to keep. Because third parties can link to your site, you're not in control of the links they create and the query parameters they use. Therefore, canonical issues can only truly be solved by specifying the query parameters you wish to keep, rather than those you wish to drop. To solve this, wildcards could specify the default action to be applied to all non-listed query parameters. Therefore I propose the following syntax:


QueryParam: retainParam[=defaultValue]
QueryParam: -dropParam
QueryParam: [-]*

where...

  • retainParam[=value]: specfies a query parameter you definitely want to keep, and an optional default value you want it set to
  • -dropParam: specifies a query parameter you definitely want to drop
  • *: means keep all query parameters not specified (default)
  • -*: means drop all query parameters not specified

Default domain and Index Pages

Two further, much simpler additions to robots.txt could clear up the majority of other canonical problems. These are Domain and IndexPage:


Domain: defaultDomain
IndexPage: defaultIndexPage

defaultDomain specfies the default domain for this robots.txt file. For example, if the search engine retrieves http://www.example.com/robots.txt and finds ...


Domain: http://example.com/

...it would know to index all URLs under the non-www domain. This would allow multiple parked domains to share the same content and robots.txt file without needing redirects or causing canonical issues, which is currently a common problem.

The IndexPage field specifies a default index page for the domain, i.e. a page for which the following two URLs are considered equivalent:

http://www.example.com/path/
http://www.example.com/path/defaultIndexPage

Conclusion

In this post I've proposed three new fields to add to robots.txt to provide an alternative to the rel=canonical tag, just as the current robots.txt fields are themselves alternatives to the meta robots tag, with their own advantages and disadvantages. The chief advantages I see of canonicalising through robots.txt are:

  • Acting through robots.txt means that a resource does not have to be fetched and parsed in order for the canonicalisation instructions to be followed. Therefore, bandwidth and CPU is saved for both the Web site and search engine.
  • Acting through robots.txt means that canonical issues can be solved for non-HTML files such as images, PDF files or Flash movies.
  • Acting through robots.txt means large scale changes can be made very quickly and easily without the need for any code changes. It's also much easier to review the changes that have been made.

The Domain, IndexPage and QueryParam fields would all be optional and independent of each other. It would be great if the search engines could introduce some or all of these ideas into robots.txt.

alan's picture

rel=canonical tag

So, Google, Yahoo, Microsoft and, more recently, Ask have announced the new "canonical" link type or, more colloquially, the rel=canonical tag. Much has already been written about this tag and its purpose: to help prevent duplicate content issues. Probably the best summary is this Matt Cutts video

:

This tag is a welcome addition to the armoury in the fight against duplicate content issues. In addition to Matt's comments, I would make the following points:

Copyright Protection

Scrapers are forever copying content and publishing it on their own sites/splogs. Sometimes they are exceptionally lazy or stupid, even to the extent that they copy Adsense code onto their own sites. If they copy your rel=canonical tag onto their site, that would give a strong "hint" to the search engine that you were the original owner of the content:

<link rel="canonical" href="href="http://www.mysite.com/my/content/" />

Microsoft Platforms

Matt made reference to the Microsoft platform in his video, but I would emphasise the point. Microsoft's implementation of RFC 2396 is flawed. The path component of a URL is supposed to be case sensitive, but Microsoft makes it case insensitive. If there are n alphabetic characters in the path, then a Microsoft implementation gives 2n possible variations of that path, where there should be only one. For example, if n=1 and the path is "/a/". Microsoft would allow "/a/" and "/A/"; if n=2 and the path is "/ab/". Microsoft would allow "/ab/", "/aB", "/Ab" and "/AB/"; and so on. 2n variations gives vast potential for duplicate content and it is a big issue with sites built on the Microsoft platform. The rel=canonical tag makes it very easy to specify the correct, case-sensitive path on a Microsoft platform:

<link rel="canonical" href="http://www.mysite.com/my/case/sensitive/path/" />

Static Web Content

Static web content is content that is stored in the format in which it is delivered. Typically, static content is served under a static URL (a URL that does not contain a question mark). However, it is possible to link to static content and append query parameters, even though these query parameters will have no impact on the content that is served. One example of when this might happen is when a referrer parameter is passed to a JavaScript function within the static content:

<a href="http://www.mysite.com/?referrer=myAffiliate0001">Affiliate Link</a>

Thousands of links can be created to a single, static URL, each with a different referrer query parameter attached. For sites built on static content, trying to manage such links has been difficult in the past. Now, it's relatively easy. Each page of static content simply needs to contain a rel=canonical tag:

<link rel="canonical" href="http://www.mysite.com/my/static/url.html/" />

Conclusions: rel=canonical

For the reasons stated above, I would recommend the use of a rel=canonical tag in all static content. In fact, I would recommend its use in all content, static or dynamic - with appropriate care of course. It's a powerful tag and using it wrongly could have dire consequences. In the next post I'll look at some of the limitations of the rel=canonical tag and consider some alternatives.