Technical Architecture

alan's picture

Matt Cutts Interviewed on Site Architecture

Site ArchitectureMatt Cutts has given a very useful interview with Eric Enge, which rounds up a lot of information architecture and technical architecture issues. There's nothing really new here, but it's good to get all this info into one place and to see it confirmed by Matt. Topics covered:

  • crawl budget/indexation cap - the use of Pagerank and host load to control crawl depth and frequency
  • the effect of duplicate content on Pagerank
  • session IDs and affiliate IDs in links/URLs
  • faceted navigation - good to see Matt confirming that the advice I gave at SES London, and will be giving next week at SMX Munich, is all correct.
  • Different ideas for use of the rel=canonical tag
  • 301 redirects and how they differ from 302 redirects
  • Google Webmaster Tools (WMT) ignore parameters
  • Pagerank Sculpting and its effectiveness in the modern world
  • Javascript, IFRAME and PDF handling
  • Paid links and nofollow

Overall, the article strongly reinforces the fact that a successful site architecture is essential to SEO success.

alan's picture

Working With Multi-Regional Web Sites

Google's John Mueller has published a good article on working with multi-regional web sites. He confirms that country-code Top Level Domains (ccTLDs) are the best way to host multi-regional content. He also clears up some of the myths surrounding duplicate content on multi-regional domains, which is most welcome.

John doesn't mention that the same thinking applies even if you are targeting a single country. A ccTLD is the best way to indicate the location of your target market to search engines - and to that market itself, of course.

A URL gives you at least five places to target a country: domain (ccTLD), subdomain (de.domain.com), directory(www.domain.com/de/), path parameters (www.domain.com/;domain=de) and query parameters(www.domain.com/?domain=de). However, there are lots more axes for the content to be split along:

  • Category - Web, Enterprise, Social, Real Time
  • Context - Intranet, Library, Personal
  • Topic - Health, Travel, Jobs, etc.
  • Vertical - Finance, Education, Government, etc.
  • Platform - Desktop, Mobile, Television, Kiosk
  • Format - Text, Image, Audio, Video, Map

(Note: the above is slightly modified from a table provided by Search Patterns, an excellent read)

Given this number of ways of organising content, and the fact that the location and language of your target audience are major considerations (worthy of a major axis), in all but the most trivial cases a ccTLD is the obvious choice for geo-targeting. It's good to see official written confirmation of this from Google.

alan's picture

rel=canonical tag

So, Google, Yahoo, Microsoft and, more recently, Ask have announced the new "canonical" link type or, more colloquially, the rel=canonical tag. Much has already been written about this tag and its purpose: to help prevent duplicate content issues. Probably the best summary is this Matt Cutts video

:

This tag is a welcome addition to the armoury in the fight against duplicate content issues. In addition to Matt's comments, I would make the following points:

Copyright Protection

Scrapers are forever copying content and publishing it on their own sites/splogs. Sometimes they are exceptionally lazy or stupid, even to the extent that they copy Adsense code onto their own sites. If they copy your rel=canonical tag onto their site, that would give a strong "hint" to the search engine that you were the original owner of the content:

<link rel="canonical" href="href="http://www.mysite.com/my/content/" />

Microsoft Platforms

Matt made reference to the Microsoft platform in his video, but I would emphasise the point. Microsoft's implementation of RFC 2396 is flawed. The path component of a URL is supposed to be case sensitive, but Microsoft makes it case insensitive. If there are n alphabetic characters in the path, then a Microsoft implementation gives 2n possible variations of that path, where there should be only one. For example, if n=1 and the path is "/a/". Microsoft would allow "/a/" and "/A/"; if n=2 and the path is "/ab/". Microsoft would allow "/ab/", "/aB", "/Ab" and "/AB/"; and so on. 2n variations gives vast potential for duplicate content and it is a big issue with sites built on the Microsoft platform. The rel=canonical tag makes it very easy to specify the correct, case-sensitive path on a Microsoft platform:

<link rel="canonical" href="http://www.mysite.com/my/case/sensitive/path/" />

Static Web Content

Static web content is content that is stored in the format in which it is delivered. Typically, static content is served under a static URL (a URL that does not contain a question mark). However, it is possible to link to static content and append query parameters, even though these query parameters will have no impact on the content that is served. One example of when this might happen is when a referrer parameter is passed to a JavaScript function within the static content:

<a href="http://www.mysite.com/?referrer=myAffiliate0001">Affiliate Link</a>

Thousands of links can be created to a single, static URL, each with a different referrer query parameter attached. For sites built on static content, trying to manage such links has been difficult in the past. Now, it's relatively easy. Each page of static content simply needs to contain a rel=canonical tag:

<link rel="canonical" href="http://www.mysite.com/my/static/url.html/" />

Conclusions: rel=canonical

For the reasons stated above, I would recommend the use of a rel=canonical tag in all static content. In fact, I would recommend its use in all content, static or dynamic - with appropriate care of course. It's a powerful tag and using it wrongly could have dire consequences. In the next post I'll look at some of the limitations of the rel=canonical tag and consider some alternatives.

alan's picture

URL Canonicalisation and Normalisation

I’ve been meaning to write about the new rel=canonical tag, which was proposed by Google, Yahoo and Microsoft on February 12. I managed to squeeze some thoughts on it into my presentation and workshop at SES London, and I’ll be speaking more about it at SES New Yorknext month, but before I blogged about it I really wanted to write more about URL Canonicalisation and Normalisation in general.

Canonicalisation or Canonicalization? Normalisation or Normalization?

I’m British, so I say Canonicalisation and Normalisation. Your mileage may vary.

What is URL Canonicalisation?

We’re talking about search engines here, so let’s try a definition that applies generally, but leans towards search:

URL Canonicalisation
involves taking a set of different URLs that all serve or lead to the same or similar content, and applying rules to select one URL from that set under which that content should be indexed or presented.

I’ve hyperlinked the terms I think are important to more detail below, but before we go into them let’s try defining URL Normalisation.

URL Normalisation
involves taking a single URL and applying a normalisation algorithm to produce a standard form for that URL.

Others define normalisation and canonicalisation as all part of the same thing, but I like to think of them as separate processes. To my way of thinking:

  • you can normalise a single URL but you can only canonicalise a set of URLs
  • an un-normalised URL will serve the same content as a normalised URL, because it’s the same URL
  • all indexed URLs are normalised; not all are canonicalised
  • normalisation occurs before canonicalisation

Now let’s go back and look at those hyperlinked terms in more detail.

Set of different URLs

This is the key to canonicalisation and why it’s needed: the same content is being presented at a number of different URLs. By different URLs, I mean those URLs are really different to each other – they could potentially show different content but (in this case) they don’t. Here is an example set of URLs:

  • http://www.example.com/
  • http://example.com/
  • http://www.example.com/index.html
  • http://example.com/default.asp
  • http://www.example.com/?referrer=affiliateName
  • http://www.example.com/?sessionid=123456

All serve or lead to the same or similar content

If each of the above URLs served the same, or essentially the same, content, it’s likely that they would be canonicalised to fewer URLs – possibly only one. If they each served completely different content, then it’s much less likely that this canonicalisation would take place. By “or lead to”, I mean that the URL may redirect (e.g. with a HTTP 301 or HTTP 302 redirect) to another URL.

Canonicalisation Rules

The rules for canonicalisation vary from engine to engine and time to time. Here are a few examples of when canonicalisation will take place …

  • If www and non-www versions of the URL exist, then canonicalise
  • If the same base URL is seen with different numbers of query parameters, then canonicalise
  • If the filename component of the URL matches a known set of index pages (e.g. index.*, default.*, etc.) then canonicalise
  • If the home page (“/”) redirects to another page, then canonicalise

… and here are some examples of how canonicalisation will take place:

  • Choose the URL with the highest Pagerank (or similar link-based or other off-page criteria)
  • Obey rel=nofollow webmaster hint
  • Choose the simplest URL (e.g. the shortest URL, or the one with fewest query parameters)

Indexed or presented

Sometimes only one URL from a set will be indexed, which means that it will always be the candidate URL to be presented in a set of search results. At other times multiple URLs may be indexed, even though they are known to be part of the same canonical set. One of these URLs will be selected to appear in a given set of search results. The URL that is selected may vary (for example, by query or by searcher location) – but only one will ever appear on a given search results page.

Single URL

Normalisation operates on a single URL rather than on a set of URLs. That single URL may need be supplemented with other data in order for normalisation to take place. For example, un-normalised URLs may be relative or absolute. A normalised URL will always be a fully-qualified absolute URL so, along with a relative URL, the containing URL or tag will need to be known in order for normalisation to take place.

Normalisation algorithm to produce a standard form

Like canonicalisation rules, the normalisation algorithm may vary from engine to engine and time to time. However, it’s much less likely to vary. Here is an example of the kind of things that are done during normalisation:

  1. convert a relative URL to an absolute URL
  2. convert the scheme and the host name components of the URL to lower case
  3. remove the port component if it matches the default port
  4. escape characters that should be represented as octets (or a +)
  5. unescape octets that are better represented as plain characters
  6. convert all escape sequences to upper case

Here are some examples of each operation:

  1. In http://www.silverdisc.co.uk/ , a link to “/contact.html” would be normalised to http://www.silverdisc.co.uk/contact.html
  2. HTTP://WWW.SILVERDISC.CO.UK/contact.html would be normalised to http://www.silverdisc.co.uk/contact.html
  3. http://www.silverdisc.co.uk:80/contact.html would be normalised to http://www.silverdisc.co.uk/contact.html, because 80 is the default port for HTTP connections.
  4. http://www.silverdisc.co.uk/contact.html?name=Alan Perkins would be normalised to http://www.silverdisc.co.uk/contact.html?name=Alan+Perkins or http://www.silverdisc.co.uk/contact.html?name=Alan%20Perkins, because a space is not a valid character in a URL.
  5. http://www.silverdisc.co.uk/cont%61ct.html would be normalised to http://www.silverdisc.co.uk/contact.html, because %61 is better represented as the character “a” in a URL.
  6. A %2a in a URL would be converted to %2A for consistency

Summary

That completes this introduction to URL canonicalisation and normalisation. In the next post, I’ll look at rel=nofollow.

alan's picture

Dixons Technical Architecture

Reading the Sunday Times this weekend, there was an interesting article on Full HD TVs. The Sharp LC-37XD1E looked good value, so I checked Dixons. They didn't stock it, but they do stock the 42" model, the SHARP LC42XD1E Flat Panel TV.

Enough about TVs. See that great link I just gave Dixons? A deep link direct to a product page, labelled with the product text. That link should help Dixons to rank well for the SHARP LC42XD1E Flat Panel TV. Unfortunately for Dixons, it won't help as much as it could. Just look at the URL of the link:

http://www.dixons.co.uk/martprd/store/dix_page.jsp?
BV_SessionID=@@@@2111405657.1178042616@@@@&
BV_EngineID=ccckaddkkmmglhhcflgceggdhhmdgml.0&
page=Product&fm=null&sm=null&tm=null&sku=317042&
category_oid=-28723

That's a bad URL. It's encoded with Session IDs, Engine IDs and null parameters. That's not the sort of link a search engine would like to crawl, and even if a search engine did manage to crawl and index the content at that URL, it's unlikely that a searcher visiting the URL several weeks later, as a result of searching for a SHARP LC42XD1E Flat Panel TV, would see any content. The session ID would be long expired. This URL is produced by BroadVision, an e-commerce application used by Dixons.

Moral: expensive, high end Web applications don't necessarily produce marketable, search-friendly sites.

Syndicate content