Search Engines

lorna's picture

Online Marketing Solutions to Suit Your Needs

When it comes to online marketing there’s a lot to think about! Do you want more visitors arriving at your site, do you want more “free” traffic, are you struggling to manage the negative reviews your brand receives online, are people just not buying from your website?

Don’t despair, no matter what your needs are, there is an online marketing solution for you. Take a look at the table below to see what will best meet your goals:


Here at SilverDisc we’re experts in online marketing and we offer all of the above services, give us a call if you could do with some sound advice!

alan's picture

Working With Multi-Regional Web Sites

Google's John Mueller has published a good article on working with multi-regional web sites. He confirms that country-code Top Level Domains (ccTLDs) are the best way to host multi-regional content. He also clears up some of the myths surrounding duplicate content on multi-regional domains, which is most welcome.

John doesn't mention that the same thinking applies even if you are targeting a single country. A ccTLD is the best way to indicate the location of your target market to search engines - and to that market itself, of course.

A URL gives you at least five places to target a country: domain (ccTLD), subdomain (de.domain.com), directory(www.domain.com/de/), path parameters (www.domain.com/;domain=de) and query parameters(www.domain.com/?domain=de). However, there are lots more axes for the content to be split along:

  • Category - Web, Enterprise, Social, Real Time
  • Context - Intranet, Library, Personal
  • Topic - Health, Travel, Jobs, etc.
  • Vertical - Finance, Education, Government, etc.
  • Platform - Desktop, Mobile, Television, Kiosk
  • Format - Text, Image, Audio, Video, Map

(Note: the above is slightly modified from a table provided by Search Patterns, an excellent read)

Given this number of ways of organising content, and the fact that the location and language of your target audience are major considerations (worthy of a major axis), in all but the most trivial cases a ccTLD is the obvious choice for geo-targeting. It's good to see official written confirmation of this from Google.

alan's picture

a robots.txt equivalent to rel=canonical

In my last post I looked at the rel=canonical tag and finished by promising to look at some of the limitations of rel=canonical and consider some alternatives.

Many of the alternatives have existed for some time - the use of redirects and cookies, for example. However, the introduction of a rel=canonical tag was an opportunity for search engines to also introduce other, more efficient, standards. These are the alternatives I would like to consider - alternatives that don't exist yet, which the search engines could have introduced this time around and may introduce in future.

I see the rel=canonical tag as analogous to the meta robots tag, and therefore suffering from many of the same limitations:

  • The rel=canonical tag is located in a HTML file, and that HTML therefore needs to be fetched and parsed in order for the tag to be seen and acted upon. Therefore, the tag does not save bandwidth or CPU for the Web site or search engine.
  • The rel=canonical tag is located in a HTML file and gives instructions about that file. Therefore, it cannot be used to solve canonical issues for non-HTML files such as images, PDF files or Flash movies.
  • The rel=canonical tag acts at a micro-level rather than a macro-level. Therefore it is difficult to review that a site-wide policy has been correctly implemented using rel=canonical; Every possible file has to be inspected. Also, code changes have to be made in order to write the rel=canonical tag. This may slow its implementation.

Where the above issues apply to rel=canonical, and similar issues apply to the meta robots tag, it struck me that an opportunity has been missed to also solve canonical issues through the robots.txt file. Any fix applied through robots.txt would not suffer from the above problems.

Extensions to robots.txt could be made in a number of ways. For example, a mod_rewrite-type syntax could be introduced. However, I'm not sure anything so advanced is needed. Most canonical issues arise from three things:

  1. the use of query parameters in dynamic URLs.
  2. www versus non-www versions of a site (and other subdomains).
  3. inconsistent use of default index page URLs.

Some simple robots.txt fields to control these issues would fix most problems without the pain and errors that a mod_rewrite implementation would create.

Query Parameters

Google Analytics and Yahoo Site Explorer are two examples of tools that allow simple manipulation of URL query parameters. Yahoo's Dynamic URL Help lists some of the crawling, indexing and ranking benefits of this approach.

Yahoo Site Explorer allows you to remove a query parameter or set a query parameter to a default value within a URL. Using this, a URL such as

could be crawled and indexed as

The session id has been dropped and the referrer has been overwritten as yhoo_srch, meaning all traffic sent by Yahoo Search could be attributed to Yahoo Search rather than the affiliate. This functionality could be implemented in robots.txt using a new syntax something like the following:

User-Agent: Slurp
Disallow:
QueryParam: -sid
QueryParam: refby=yhoo_srch

meaning that the sid query parameter is to be dropped (as it is preceded by '-') and the refby query parameter is to be overwritten with a default value (as a default value is provided). The same effect could be achieved with a single line:

User-Agent: Slurp
Disallow:
QueryParam: -sid, refby=yhoo_srch

One problem with both Google Analytics and Yahoo Site Explorer is that you must list the query parameters you wish to drop from URLs - not the ones you wish to keep. Because third parties can link to your site, you're not in control of the links they create and the query parameters they use. Therefore, canonical issues can only truly be solved by specifying the query parameters you wish to keep, rather than those you wish to drop. To solve this, wildcards could specify the default action to be applied to all non-listed query parameters. Therefore I propose the following syntax:


QueryParam: retainParam[=defaultValue]
QueryParam: -dropParam
QueryParam: [-]*

where...

  • retainParam[=value]: specfies a query parameter you definitely want to keep, and an optional default value you want it set to
  • -dropParam: specifies a query parameter you definitely want to drop
  • *: means keep all query parameters not specified (default)
  • -*: means drop all query parameters not specified

Default domain and Index Pages

Two further, much simpler additions to robots.txt could clear up the majority of other canonical problems. These are Domain and IndexPage:


Domain: defaultDomain
IndexPage: defaultIndexPage

defaultDomain specfies the default domain for this robots.txt file. For example, if the search engine retrieves http://www.example.com/robots.txt and finds ...


Domain: http://example.com/

...it would know to index all URLs under the non-www domain. This would allow multiple parked domains to share the same content and robots.txt file without needing redirects or causing canonical issues, which is currently a common problem.

The IndexPage field specifies a default index page for the domain, i.e. a page for which the following two URLs are considered equivalent:

http://www.example.com/path/
http://www.example.com/path/defaultIndexPage

Conclusion

In this post I've proposed three new fields to add to robots.txt to provide an alternative to the rel=canonical tag, just as the current robots.txt fields are themselves alternatives to the meta robots tag, with their own advantages and disadvantages. The chief advantages I see of canonicalising through robots.txt are:

  • Acting through robots.txt means that a resource does not have to be fetched and parsed in order for the canonicalisation instructions to be followed. Therefore, bandwidth and CPU is saved for both the Web site and search engine.
  • Acting through robots.txt means that canonical issues can be solved for non-HTML files such as images, PDF files or Flash movies.
  • Acting through robots.txt means large scale changes can be made very quickly and easily without the need for any code changes. It's also much easier to review the changes that have been made.

The Domain, IndexPage and QueryParam fields would all be optional and independent of each other. It would be great if the search engines could introduce some or all of these ideas into robots.txt.

alan's picture

URL Canonicalisation and Normalisation

I’ve been meaning to write about the new rel=canonical tag, which was proposed by Google, Yahoo and Microsoft on February 12. I managed to squeeze some thoughts on it into my presentation and workshop at SES London, and I’ll be speaking more about it at SES New Yorknext month, but before I blogged about it I really wanted to write more about URL Canonicalisation and Normalisation in general.

Canonicalisation or Canonicalization? Normalisation or Normalization?

I’m British, so I say Canonicalisation and Normalisation. Your mileage may vary.

What is URL Canonicalisation?

We’re talking about search engines here, so let’s try a definition that applies generally, but leans towards search:

URL Canonicalisation
involves taking a set of different URLs that all serve or lead to the same or similar content, and applying rules to select one URL from that set under which that content should be indexed or presented.

I’ve hyperlinked the terms I think are important to more detail below, but before we go into them let’s try defining URL Normalisation.

URL Normalisation
involves taking a single URL and applying a normalisation algorithm to produce a standard form for that URL.

Others define normalisation and canonicalisation as all part of the same thing, but I like to think of them as separate processes. To my way of thinking:

  • you can normalise a single URL but you can only canonicalise a set of URLs
  • an un-normalised URL will serve the same content as a normalised URL, because it’s the same URL
  • all indexed URLs are normalised; not all are canonicalised
  • normalisation occurs before canonicalisation

Now let’s go back and look at those hyperlinked terms in more detail.

Set of different URLs

This is the key to canonicalisation and why it’s needed: the same content is being presented at a number of different URLs. By different URLs, I mean those URLs are really different to each other – they could potentially show different content but (in this case) they don’t. Here is an example set of URLs:

  • http://www.example.com/
  • http://example.com/
  • http://www.example.com/index.html
  • http://example.com/default.asp
  • http://www.example.com/?referrer=affiliateName
  • http://www.example.com/?sessionid=123456

All serve or lead to the same or similar content

If each of the above URLs served the same, or essentially the same, content, it’s likely that they would be canonicalised to fewer URLs – possibly only one. If they each served completely different content, then it’s much less likely that this canonicalisation would take place. By “or lead to”, I mean that the URL may redirect (e.g. with a HTTP 301 or HTTP 302 redirect) to another URL.

Canonicalisation Rules

The rules for canonicalisation vary from engine to engine and time to time. Here are a few examples of when canonicalisation will take place …

  • If www and non-www versions of the URL exist, then canonicalise
  • If the same base URL is seen with different numbers of query parameters, then canonicalise
  • If the filename component of the URL matches a known set of index pages (e.g. index.*, default.*, etc.) then canonicalise
  • If the home page (“/”) redirects to another page, then canonicalise

… and here are some examples of how canonicalisation will take place:

  • Choose the URL with the highest Pagerank (or similar link-based or other off-page criteria)
  • Obey rel=nofollow webmaster hint
  • Choose the simplest URL (e.g. the shortest URL, or the one with fewest query parameters)

Indexed or presented

Sometimes only one URL from a set will be indexed, which means that it will always be the candidate URL to be presented in a set of search results. At other times multiple URLs may be indexed, even though they are known to be part of the same canonical set. One of these URLs will be selected to appear in a given set of search results. The URL that is selected may vary (for example, by query or by searcher location) – but only one will ever appear on a given search results page.

Single URL

Normalisation operates on a single URL rather than on a set of URLs. That single URL may need be supplemented with other data in order for normalisation to take place. For example, un-normalised URLs may be relative or absolute. A normalised URL will always be a fully-qualified absolute URL so, along with a relative URL, the containing URL or tag will need to be known in order for normalisation to take place.

Normalisation algorithm to produce a standard form

Like canonicalisation rules, the normalisation algorithm may vary from engine to engine and time to time. However, it’s much less likely to vary. Here is an example of the kind of things that are done during normalisation:

  1. convert a relative URL to an absolute URL
  2. convert the scheme and the host name components of the URL to lower case
  3. remove the port component if it matches the default port
  4. escape characters that should be represented as octets (or a +)
  5. unescape octets that are better represented as plain characters
  6. convert all escape sequences to upper case

Here are some examples of each operation:

  1. In http://www.silverdisc.co.uk/ , a link to “/contact.html” would be normalised to http://www.silverdisc.co.uk/contact.html
  2. HTTP://WWW.SILVERDISC.CO.UK/contact.html would be normalised to http://www.silverdisc.co.uk/contact.html
  3. http://www.silverdisc.co.uk:80/contact.html would be normalised to http://www.silverdisc.co.uk/contact.html, because 80 is the default port for HTTP connections.
  4. http://www.silverdisc.co.uk/contact.html?name=Alan Perkins would be normalised to http://www.silverdisc.co.uk/contact.html?name=Alan+Perkins or http://www.silverdisc.co.uk/contact.html?name=Alan%20Perkins, because a space is not a valid character in a URL.
  5. http://www.silverdisc.co.uk/cont%61ct.html would be normalised to http://www.silverdisc.co.uk/contact.html, because %61 is better represented as the character “a” in a URL.
  6. A %2a in a URL would be converted to %2A for consistency

Summary

That completes this introduction to URL canonicalisation and normalisation. In the next post, I’ll look at rel=nofollow.

alan's picture

How much PageRank does a page that is not in the index have?

If a page has a NOINDEX tag on it, how much PageRank does it have? The intuitive answer would be "None". How can a page that is not indexed have PageRank? Wouldn't it be treated like a dangling link and disregarded during PageRank calculations?

Apparently not. Matt Cutts states on seomoz:

Does a link from a page with meta robots="noindex, follow" carry less weight? no weight?
For Google, I believe such links would carry the same weight as normal links on regular pages.

Hmmm. Does he mean that the unindexed page actually has a PageRank? Or does he mean that the zero Pagerank that the unindexed page has would be divided out among the links on the page, giving nothing to each? I wonder ...

One thing's for sure ... if "NOINDEX, FOLLOW" works as implied, it's a great way to inject spammy content and links.

alan's picture

QDF - WTF...?

Search engine freshness is an issue that's always been of interest to me ... certainly since the very early days of my search marketing adventures, in the mid-1990s. It was of such interest that it lies at the core of one of my patents, "Process for maintaining ongoing registration for pages on a given search engine".

I would define search engine freshness as follows:

freshness
An indication of how recently a search engine has crawled and indexed a page, a part of its index or its index as a whole

I think most search engine engineers would define it in a similar way. For example Matt Cutts, in his blog post "Measuring Freshness" from September 2005, stated

The authors tracked Google, Yahoo, and MSN over 42 days using 38 German webpages that were updated daily and that included a datestamp somewhere on the page. They measured freshness by looking at each search engine’s cached page to see how up-to-date the page was. If you measure success by having a version of a page within 0 or 1 days, Google succeeded a little under 83% of the time, MSN succeeded 48% of the time, and Yahoo succeeded about 42% of the time.

So I'm surprised to find the following quote in the recent, much-publicised New York Times article "Google Keeps Tweaking Its Search Engine":

So [Mr. Singhal] monitors complaints on his white board, prioritizing them if they keep coming back. For much of the second half of last year, one of the recurring items was “freshness.”

Freshness, which describes how many recently created or changed pages are included in a search result, is at the center of a constant debate in search: Is it better to provide new information or to display pages that have stood the test of time and are more likely to be of higher quality? Until now, Google has preferred pages old enough to attract others to link to them.

But last year, Mr. Singhal started to worry that Google’s balance was off. When the company introduced its new stock quotation service, a search for “Google Finance” couldn’t find it. After monitoring similar problems, he assembled a team of three engineers to figure out what to do about them.

Earlier this spring, he brought his squad’s findings to Mr. Manber’s weekly gathering of top search-quality engineers who review major projects. At the meeting, a dozen people sat around a large table, another dozen sprawled on red couches, and two more beamed in from New York via video conference, their images projected on a large screen. Most were men, and many were tapping away on laptops. One of the New Yorkers munched on cake.

Mr. Singhal introduced the freshness problem, explaining that simply changing formulas to display more new pages results in lower-quality searches much of the time. He then unveiled his team’s solution: a mathematical model that tries to determine when users want new information and when they don’t. (And yes, like all Google initiatives, it had a name: QDF, for “query deserves freshness.”)

"Query deserves freshness"? I thought that all queries deserved freshness! A stale index results in poor quality SERPs that could lead to lots of pages that no longer exist or have been changed since they were indexed. A fresh index solves this problem to a great extent.

This appears to be a new, different use of the word freshness in relation to search engines, referring not to the last indexed date of a page, but to the last modified date of a page - something very different!

Matt Cutts referred to this NYT article in "Five things you didn’t know about Google’s search", without mentioning this variation in the use of the word Freshness, which I find odd - especially as elsewhere in the article, it was claimed that Matt and Mr. Singhal share an office (with two other people). Hmmm ... do they talk, do you think? :D

I think I'd have called it "Query Deserves Recency".

Syndicate content