a robots.txt equivalent to rel=canonical

Alan's picture
| 8th March 2009
Share

In my last post I looked at the rel=canonical tag and finished by promising to look at some of the limitations of rel=canonical and consider some alternatives.

Many of the alternatives have existed for some time - the use of redirects and cookies, for example. However, the introduction of a rel=canonical tag was an opportunity for search engines to also introduce other, more efficient, standards. These are the alternatives I would like to consider - alternatives that don't exist yet, which the search engines could have introduced this time around and may introduce in future.

I see the rel=canonical tag as analogous to the meta robots tag, and therefore suffering from many of the same limitations:

  • The rel=canonical tag is located in a HTML file, and that HTML therefore needs to be fetched and parsed in order for the tag to be seen and acted upon. Therefore, the tag does not save bandwidth or CPU for the Web site or search engine.
  • The rel=canonical tag is located in a HTML file and gives instructions about that file. Therefore, it cannot be used to solve canonical issues for non-HTML files such as images, PDF files or Flash movies.
  • The rel=canonical tag acts at a micro-level rather than a macro-level. Therefore it is difficult to review that a site-wide policy has been correctly implemented using rel=canonical; Every possible file has to be inspected. Also, code changes have to be made in order to write the rel=canonical tag. This may slow its implementation.

Where the above issues apply to rel=canonical, and similar issues apply to the meta robots tag, it struck me that an opportunity has been missed to also solve canonical issues through the robots.txt file. Any fix applied through robots.txt would not suffer from the above problems.

Extensions to robots.txt could be made in a number of ways. For example, a mod_rewrite-type syntax could be introduced. However, I'm not sure anything so advanced is needed. Most canonical issues arise from three things:

  1. the use of query parameters in dynamic URLs.
  2. www versus non-www versions of a site (and other subdomains).
  3. inconsistent use of default index page URLs.

Some simple robots.txt fields to control these issues would fix most problems without the pain and errors that a mod_rewrite implementation would create.

Query Parameters

Google Analytics and Yahoo Site Explorer are two examples of tools that allow simple manipulation of URL query parameters. Yahoo's Dynamic URL Help lists some of the crawling, indexing and ranking benefits of this approach.

Yahoo Site Explorer allows you to remove a query parameter or set a query parameter to a default value within a URL. Using this, a URL such as

could be crawled and indexed as

The session id has been dropped and the referrer has been overwritten as yhoo_srch, meaning all traffic sent by Yahoo Search could be attributed to Yahoo Search rather than the affiliate. This functionality could be implemented in robots.txt using a new syntax something like the following:

User-Agent: Slurp
Disallow:
QueryParam: -sid
QueryParam: refby=yhoo_srch

meaning that the sid query parameter is to be dropped (as it is preceded by '-') and the refby query parameter is to be overwritten with a default value (as a default value is provided). The same effect could be achieved with a single line:

User-Agent: Slurp
Disallow:
QueryParam: -sid, refby=yhoo_srch

One problem with both Google Analytics and Yahoo Site Explorer is that you must list the query parameters you wish to drop from URLs - not the ones you wish to keep. Because third parties can link to your site, you're not in control of the links they create and the query parameters they use. Therefore, canonical issues can only truly be solved by specifying the query parameters you wish to keep, rather than those you wish to drop. To solve this, wildcards could specify the default action to be applied to all non-listed query parameters. Therefore I propose the following syntax:


QueryParam: retainParam[=defaultValue]
QueryParam: -dropParam
QueryParam: [-]*

where...

  • retainParam[=value]: specfies a query parameter you definitely want to keep, and an optional default value you want it set to
  • -dropParam: specifies a query parameter you definitely want to drop
  • *: means keep all query parameters not specified (default)
  • -*: means drop all query parameters not specified

Default domain and Index Pages

Two further, much simpler additions to robots.txt could clear up the majority of other canonical problems. These are Domain and IndexPage:


Domain: defaultDomain
IndexPage: defaultIndexPage

defaultDomain specfies the default domain for this robots.txt file. For example, if the search engine retrieves http://www.example.com/robots.txt and finds ...


Domain: http://example.com/

...it would know to index all URLs under the non-www domain. This would allow multiple parked domains to share the same content and robots.txt file without needing redirects or causing canonical issues, which is currently a common problem.

The IndexPage field specifies a default index page for the domain, i.e. a page for which the following two URLs are considered equivalent:

http://www.example.com/path/
http://www.example.com/path/defaultIndexPage

Conclusion

In this post I've proposed three new fields to add to robots.txt to provide an alternative to the rel=canonical tag, just as the current robots.txt fields are themselves alternatives to the meta robots tag, with their own advantages and disadvantages. The chief advantages I see of canonicalising through robots.txt are:

  • Acting through robots.txt means that a resource does not have to be fetched and parsed in order for the canonicalisation instructions to be followed. Therefore, bandwidth and CPU is saved for both the Web site and search engine.
  • Acting through robots.txt means that canonical issues can be solved for non-HTML files such as images, PDF files or Flash movies.
  • Acting through robots.txt means large scale changes can be made very quickly and easily without the need for any code changes. It's also much easier to review the changes that have been made.

The Domain, IndexPage and QueryParam fields would all be optional and independent of each other. It would be great if the search engines could introduce some or all of these ideas into robots.txt.

You might also like

blog
Would Your Business Benefit Most From PPC or SEO?
Creating the best marketing strategy for reaching your goals means taking a step back, looking at the bigger picture, and finding out which marketing avenues are most likely to work for you. Here are five points to consider when deciding whether PPC or SEO is right for your business.
By:
Sam Rose
On: 26th July 2019
Posted In:
PPC | SEO
blog
SEO Best Practices for Retail Product Pages
For online retailers, perhaps the most important pages of a website are the product pages. Here I’ll go through some ideas for improving SEO on your product pages, product listing pages, and on-site search results pages.
By:
Sam Rose
On: 17th May 2019
Posted In:
SEO
blog
How to Align Your SEO Content with User Intent
When was the last time you reviewed your content in terms of user intent? We’re going to take a look at why you should consider the intent and buying stage of your audience when creating content, and how you can make sure each of your website pages is attracting users at different stages of the marketing funnel.
By:
Sam Rose
On: 8th March 2019
Posted In:
SEO

Why Choose SilverDisc

Track Record Of Success

Continuously getting it right for our clients is of paramount importance and we have some great testimonies to this.

Credibility

Being Google Premier Partners and Bing Select Partners gives you the assurance that we’re fully trained, that we have demonstrated exceptional account management across a range of client spends and industries, and that our track record will continue.

Thought Leaders

We see where digital marketing is heading and position our clients to meet it. For example, we are so far ahead of the SEO game that we helped Google to write their webmaster guidelines, so you can enjoy SEO success without fear of being penalised.

Ethical & Honest

We have your best interests at heart and we give it to you straight, even if that’s painful to you or us. For example, we won’t spend your money marketing a website when it’s plain to us that the money would be wasted. We look for a win-win-win …

Focused

We listen to your individual needs, identify targets and work relentlessly towards meeting them, concentrating on what’s effective.

Driven

We exist to unlock the value of technology for business, and we love what we do. We have 25 straight years in digital marketing.

Friendly, Open and Giving

These words are at the core of our values. We build strong partnerships and go out of our way to help solve problems for clients, suppliers and the wider world.

Let's Get Started

Please fill in the form below or give us a call on 01536 316100

Follow SilverDisc