SEO

Lorna Casswell's picture

A Recipe for the World’s Biggest PPC Ad: Do Too Many Ad Extensions Spoil the Broth?

Google’s launch of the new enhanced ad sitelinks got us talking and led us to ask Google – “How many ad extensions can appear at once in an ad?”

If you use multiple types of ad extensions you’ll be pleased to hear that, in theory, they can all show at once! In the PPC manager’s dream an ad could look like this:

“Fantastic”, you say? It certainly could be (but spare a thought for the SEO troopers who have probably spent years fighting for that top spot in the natural results). An ad of this size means you can probably wave “Bye bye” to competitors as your CTRs and conversions soar.

Ad Extensions in Reality

Sadly, it seems for the most part you’ll struggle to get more than two of your ad extensions to appear at once. It’s ultimately Google who decides which ad extension will be shown. If one ad is deemed more relevant than another to Google then that ad’s likely to have more extensions within it.

If you’ve built up a good review base and seller ratings appear in your ads then these are likely to show pretty much all of the time along with other ad extensions including location extensions, product extensions or sitelinks.

If Google thinks your ads are really, really relevant to a search query you might be rewarded with an ad that displays sitelinks, seller ratings and product extensions – well done you!

 Use Ad Extensions Wisely

Ad extensions can be great for CTRs; ads see an average uplift of 30%. An ad extension can take someone to the most relevant part of your site whether it is a particular product page or your contact details leading to increased conversion rates.

It’s easy to see the attraction of enabling as many ad extensions as possible but before you go too crazy consider their use cautiously. Not all ad extensions are suitable for every business! If they’re not used carefully your potential customers could end up in the wrong place.

More generally, it’s possible that too many different types of ad extensions in Google’s results could cause confusion. A searcher may wonder why ExampleA.com has product images in their ad whereas ExampleB.com may have text links to different parts of their website. Searchers may struggle deciding where and what to click on.

Ad Extension: Quick Tips

·                     Ensure the right ad extension is used in the right place. You don’t want a location extension enabled if from your physical location you do not sell the product or service – opt for sitelinks instead

·                     Where possible highlight offers within sitelinks as this could lead to better CTRs

·                     Be descriptive within sitelinks and make best use of the character allowance you have

·                     If you have a physical location then make sure your Google Places listing is fully optimised! Spelling mistakes, inaccurate information and poor pictures can let you down

·                     When adding a phone number as part of a call or location extension, where possible use a number with a local prefix. We all know that 0800 numbers are not free from a mobile phone

·                     For good product extensions make sure your Google Merchant Centre feed is well optimised and that your product info is as up to date as possible. It’s no good showing a fantastic price that is out of date once a person clicks through to your website

·                     Keep an eye on your competitors and ensure that they are not ahead of your game. If they’re running a fantastic offer in a sitelink – can you match it or beat it? If not, come up with an alternative offer

·                     Ensure your Google+ Page looks good if you’re opted into social extensions. Remember, you’ll need to verify your Google+ Page on your site if you want social extensions to appear in your ads. Click here for more info.

If all this seems too complicated, you could always get SilverDisc to do it for you!

 

 

 

 

 

alan's picture

SEO Is Not Spam, Says Google's Matt Cutts

Google's Head of Webspam, Matt Cutts, has posted a video to the Google Webmaster Youtube Channel explaining what he's been saying in private and on conference platform for years - that SEO per se is not spam; ethical SEO as practiced and long advocated by me (so much so that I worked with Matt when he was putting together the original Google Webmaster Guidelines a decade ago this month) is certainly not spam; but that some forms of SEO, in particular black hat SEO, are spam.  The video is below and, for those of you without video playing capabilities, a transcript prepared by Lynda follows:

 

Transcript of "Does Google Consider SEO to be Spam?" By Matt Cutts

I wanted to take a minute and talk a little bit about search engine optimization and spam, and answer the question “Does Google consider SEO to be spam?”

And the answer in “No. We don’t consider SEO to be spam.”  Now a few really tech savvy people might get angry at that. So let me explain in a little more detail.  SEO stands for Search Engine Optimization.  And essentially it just means trying to make sure that your pages are well represented within search engines.  And there’s plenty...an enormous amount ...of white hat, great quality stuff that you can do as a search engine optimizer.

You can do things like making sure that your pages are crawlable. So you want them to be accessible. You want people to be able to find them just by clicking on links. And in the same way search engines can find them just by clicking on links.

You want to make sure that people use the right keywords. If you’re using industry jargon or lingo that not everybody else uses, then a good SEO can help you find out, oh, these are keywords that you should have been thinking about.

You can think about usability, and trying to make sure that the design of the site is good. That’s good for users and for search engines.

You can think about how to make your site faster. Not only does Google use site speed in our rankings as one of the many factors that we use in our search rankings. But if you can make your site run faster, that can also make it a much better experience.

So there are an enormous number of things that SEOs do, everything from helping out with the initial site architecture and deciding what your site should look like, and the url structure, and the templates, and all that sort of stuff, making sure that your site is crawlable, all the way down to helping optimize for your return on investment. So trying to figure out what are the ways that you are going to get the best bang for the buck, doing AB testing, trying to find out, OK, what is the copy that converts, all those kinds of things. There is nothing at all wrong with all of those white hat methods.

Now, are there some SEOs who go further than we would like? Sure. And are there some SEOs who actually try to employ black hat techniques, people that hack sites or that keyword stuff and just repeat things or that do sneaky things with redirects? Yeah, absolutely. But our goal is to make sure that we return the best possible search results we can. And a very wonderful way that search engine optimizers can help is by cooperating and trying to help search engines find pages better. So SEO is not spam. SEO can be enormously useful. SEO can also be abused and it can be overdone.

But it’s important to realise that we believe, in an ideal world, people wouldn’t have to worry about these issues. But search engines are not as smart as people yet. We’re working on it. We’re trying to figure out what people mean. We’re trying to figure out synonyms, and vocabulary, and stemming so that you don’t have to know exactly the right word to search for what you wanted to find. But until we get to that day, search engine optimization can be a valid way to help people find what they are looking for via search engines.

We provide webmaster guidelines on google.com/webmasters. There’s a free webmaster forum. There are free webmaster tools. There’s a ton of HTML documentation. So if you search for SEO starter guide, we’ve written a beginner guide where people can learn more about search engine optimization.

But just to be very clear, there are many, many valid ways that people can make the world better with SEO. It’s not the case that...sometimes you’ll hear SEOs are criminals. SEOs are snake oil salesmen. If you find a good person, someone that you can trust, someone that will tell you exactly what they’re doing, the sort of person where you get good references, or you’ve seen their work and it’s very helpful, and they’ll explain exactly what they’re doing, they can absolutely help your website. So I just wanted to dispel that misconception.

Some people think Google thinks all SEO is spam and that’s definitely not the case. There are a lot of great SEOs out there. And I hope you find a good one to help with your website.

alan's picture

Analytics Under Attack: Google's Evil, Unethical Move To Remove Referrer Data

Google has announced that it is to cease providing referrer information in some instances.  In the official blog post, Google's Evelyn Kao writes:

When you search from https://www.google.com, websites you visit from our organic search listings will still know that you came from Google, but won't receive information about each individual query.

Initially this change affects people logged in to Google accounts and using Google.com which, Google claims, is a very small percentage of searchers (although still a large number of people).  But it's likely this will change as, according to Google's own blog entry:

As we continue to add more support for SSL across our products and services, we hope to see similar action from other websites. 

To give an example of what Google have actually done, I have searched today for "car insurance" both logged in to my Google account and searching on https://www.google.com, and not logged in to my Google account and searching on http://www.google.com/.  In each case I have clicked through to the same landing page.  Here are the referrers of that landing page in both cases:

  1. Referrer When Not Logged In, Clicking On A Natural Link: http://www.google.com/#hl=en&sugexp=kjrmc&cp=5&gs_id=l&xhr=t&q=car+insurance&qe=Y2FyIGk&qesig=Eeu3hebYxgo0in9YDLhtAA&pkc=AFgZ2tkKH3Xw88yrwvHzg5MkB-5vAi8dBrAzxf3se4-a7_BaiiecMyYZt0D_3TtcaX8K2jJgbEC3Yw7qMsDB65pNgSjYWjDjlA&pf=p&sclient=psy-ab&source=hp&pbx=1&oq=car+i&aq=0p&aqi=p-p1g3&aql=f&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=8e7fa2636e8b849&biw=1680&bih=947
  2. Referrer When Logged In, Clicking On A Natural Link: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&sqi=2&ved=0CHoQFjAA&url=http%3A%2F%2Fwww.moneysupermarket.com%2Fcar-insurance%2F&ei=RoaeTuS6IY_D8QOP8fixCQ&usg=AFQjCNF-UvvfJsjMbuyeGwVVnyzkQmInRA&sig2=NaTkFkfi3cK7R1_V5TsTcg

I have highlighted the key difference in bold red above.  When not logged in, my query "car insurance" is available in the referrer for Analytics to pick up and use to provide the site owner with information about what I was looking for.  When logged in, my query "car insurance" has been stripped, so the site owner is completely clueless about why Google sent me to that page on their website.  Note, then, that when logged in, that referrer is a lie - the page I was visiting before was not the one in the referrer at all.  For example, I was actually on https://www.google.com/, not http://www.google.com/.

This small change has some very large consequences for site owners.  For example, no matter what analytics package you use, any reports that show keywords will become less useful (and, at the extreme, useless).  Check out this short interview from Google Analytics Evangelist Avinash Kaushik, following his keynote at 2010's Search Engine Strategies (which I attended with interest):

If Google removes keywords from referrer data then all of the great keyword ideas, keyword techniques and keyword attribution models that Avinash shares are no longer possible.  Evangelise that, Avinash!

Joking aside, a lot of the great work SilverDisc and others do in making sites better for users will be made more difficult and less effective by this move.

Google's move upsets the ethical balance that exists between searchers, search engines and site owners.  This is the very principle that ethical SEO is based upon - the three stakeholders to be considered are

  • Site owners who produce great content designed to meet their visitors' needs.
  • Search engines who are allowed to crawl and index that content as long as it provides benefit to the site owner.
  • Searchers who get to find the information they need in order to satisfy their enquiry.

From my original ethical SEO paper, the most ethical technique 

  • produces the most good and does the least harm
  • respects the rights and dignity of all stakeholders and treats all stakeholders fairly
  • promotes the common good
  • helps all participate more fully in the goods we share as a community and a society
  • enables the deepening or development of those virtues or character traits that we value as individuals, professions and members of a society

How does Google removing referrer information produce an unethical result?  Let's break it down:

  • produces the most good and does the least harm?
    • site owners can no longer optimise their sites to better match the searcher needs, so they will struggle to produce the best possible websites
  • respects the rights and dignity of all stakeholders and treats all stakeholders fairly?
    • site owners, rather than being treated with dignity, are treated as being "not trustworthy" and are denied a piece of information that the other two stakeholders (Google and the searcher) both have - the search query that resulted in that searcher visiting their site.
  • promotes the common good?
    • the common good is Google working with site owners to produce a better Web, which to be fair does happen a lot in other ways.  This move, however, does not promote the common good - Google gains and the site owner loses.
  • helps all participate more fully in the goods we share as a community and a society?
    • clearly this move prevents full participation of site owners in something they have had available to them since the earliest days of the Web and something upon which  the Web was built - referrer data was provided in the HTTP 0.9 specification and has been there ever since
  • enables the deepening or development of those virtues or character traits that we value as individuals, professions and members of a society?
    • again, this move alienates site owners and does not engender a spirit of cooperation and teamwork among site owners and Google, whose entire service is built on the content that site owners freely provide

So this move is unethical.  But is it evil?  (Note I deliberately use the word "evil", of course, since Google's corporate mantra is "Don't be evil").

What's really evil about Google's announcement is the patronising spin they've put on it.  Google's headline, even on its Analytics blog which is aimed at site owners rather than searchers, is not "We're removing site owners' ability to pull keywords from the referrer";  it is "Making search more secure: Accessing search query data in Google Analytics".  This fails to treat site owners with the respect they deserve.  The whole piece is positioned as making search more secure, for example when using insecure Wifi hotspots, yet at least a couple of things don't stack up if this is the objective:

  • If the user is visiting a secure Web site then Google still strips the referrer (thanks Danny Sullivan at Search Engine Land for this info), even though this is not necessary and, given they don't do this on their Encrypted Search, Google clearly knows it's not necessary
  • Searchers' referrers still contain keywords if searchers click on an ad, rather than a natural result.

That last point really shows where Google's mind is at.  To juxtapose a couple of points from their blog post:

we recognize the growing importance of protecting the personalized search results we deliver. As a result, we’re enhancing our default search experience for signed-in users ...  [but] ... if you choose to click on an ad appearing on our search results page, your browser will continue to send the relevant query over the network to enable advertisers to measure the effectiveness of their campaigns and to improve the ads and offers they present to you

So advertisers who pay Google money get treated one way, site owners who pay Google by providing the content the whole Google service is built on get treated a different way, and searchers' privacy is not really protected.  Nice.  To complete the example I gave earlier, the third link below is the referrer I received on the same website as result 2, but this time clicking on a paid ad rather than a natural result:

  1. Referrer When Not Logged In, Clicking On A Natural Link: http://www.google.com/#hl=en&sugexp=kjrmc&cp=5&gs_id=l&xhr=t&q=car+insurance&qe=Y2FyIGk&qesig=Eeu3hebYxgo0in9YDLhtAA&pkc=AFgZ2tkKH3Xw88yrwvHzg5MkB-5vAi8dBrAzxf3se4-a7_BaiiecMyYZt0D_3TtcaX8K2jJgbEC3Yw7qMsDB65pNgSjYWjDjlA&pf=p&sclient=psy-ab&source=hp&pbx=1&oq=car+i&aq=0p&aqi=p-p1g3&aql=f&gs_sm=&gs_upl=&bav=on.2,or.r_gc.r_pw.r_cp.,cf.osb&fp=8e7fa2636e8b849&biw=1680&bih=947
  2. Referrer When Logged In, Clicking On A Natural Link: http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&sqi=2&ved=0CHoQFjAA&url=http%3A%2F%2Fwww.moneysupermarket.com%2Fcar-insurance%2F&ei=RoaeTuS6IY_D8QOP8fixCQ&usg=AFQjCNF-UvvfJsjMbuyeGwVVnyzkQmInRA&sig2=NaTkFkfi3cK7R1_V5TsTcg
  3. Referrer When Logged In, Clicking On A Paid Link: http://www.google.com/url?http://www.google.com/aclk?sa=L&ai=ClrXYRoaeTt7BI8We8APK543yBYrGqWP-obrkI4TN7AQQBigIUOvGl8f4_____wFgu76ug9AKyAEBqQIwM2eSHF26PqoEGk_QTdQtj7np_5xRJavQhhGPHLhFRZtF9pdvugUTCOT1h6Om9KsCFY9hfAodjzg-lsoFAA&num=9&ei=RoaeTuS6IY_D8QOP8fixCQ&sig=AOD64_10PBtIEOuf9waR5LMaPUiMrDinMA&sqi=2&ved=0CEIQ0Qw&adurl=http://pixel.everesttech.net/1816/cq%3Fev_sid%3D3%26ev_ln%3Dcar%2520insurance%26ev_crx%3D9512830502%26ev_mt%3De%26ev_n%3Dg%26ev_ltx%3D%26ev_pl%3D%26url%3Dhttp%253A//www.moneysupermarket.com/car-insurance/insurance/%253FSource%253DGOO-003881E4%2526keywords%253Dcar%252Binsurance%252B%252BExact%2526p%253D0&rct=j&q=car+insurance 

What can site owners do about this?  Individually, not a lot.  Promoting and using other search engines, such as Microsoft Bing, would be a start.  This strikes me as a great opportunity for Microsoft to build and foster better relationships with site owners, for example by promising never to remove referrer data from its search results.

If they were able to operate as a collective, site owners could do Google serious damage.  In my 2007 post  "Bringing Down Google With Two Simple Lines of Code" I showed how this could be done.

The permission can be taken away with two simple lines of code placed in a site's robots.txt file:

User-agent: Googlebot
Disallow: /

Sure, every site owner in the world would need to publish this file to their sites. But if they did such a thing, the Google search engine could no longer crawl or index any of the Web's content. It would be defunct.

So, fellow site owners, Google's future is in our hands. If you want to go "on strike" and stop Google profiting from the fruits of your labours, simply publish the code. Be warned that your site will eventually be removed from Google's index if you do so. As a unilateral step, this may do you more harm than good. But if we all do it en masse, then beware Google!

That post was written four years ago.  Now, with social media so prevalent that it can lead to regime change in countries, maybe it can lead to regime change among search engines too.  Microsoft, are you listening?

alan's picture

Rel=Prev, Rel=Next and View-All Pages: New Google Guidance

Google has this week launched new advice on how to mark up a series of related pages in order to allow it to better understand the relationship between those pages.  This could offer you the benefit of consolidating the pages into a single page for ranking calculations - which could be very helpful to say the least.  Examples of pages that may gain from using this markup, which involves using rel=prev and rel=next tags in a page's head section, are

  • an article or forum thread spread across multiple pages, perhaps to derive greater advertising revenues or keep the text short and easy to consume
  • a product category consisting of so many products that they can't fit on one page.  An example would be a top level category such as "Family Cars", before many filters had been applied to create smaller sub-categories that could easily fit on a page (e.g. "red 1.8L diesel automatic Volkswagen family cars near Kettering")

For more details on how to implement these tags see Google Webmaster Central: Pagination with rel=“next” and rel=“prev”.  The article is well-written and gives very clear implementation advice.  It includes a reference to a related Google post, Google Webmaster Central: View-all in search results, which describes how a rel=canonical tag can be used to specify a "View-all page", which is simply a single-page version of the content that may be presented elsewhere as a series of pages.  Google makes the claim in this article that "searchers much prefer the View-all, single-page version of content".

But do searchers much prefer View-all pages?  I'm sure they do if the View-all page is relatively short.  Using a couple of Google's own examples of where rel=prev and rel=next may be useful, however:

  • a forum thread spread across multiple pages.  I moderate forums and some threads can easily spread to 1000 or more responses.  It's unlikely a member would want all of these on a single page for viewing
  • a product category consisting of many products.  Again, a top level category could easily consist of over 1000 products.

It's interesting to note that a typical Google search yields millions of results and Google will display up to 1000 of them, by default across 100 pages at 10 results to a page.   Google isn't implementing a View-all page there!

I think the example that Google really has in mind when they state that searchers "prefer the View-all version of content" is the article that might spread over three pages or so: reducing that to one page for indexing.  This seems a fine idea.

But what to do about the long forum threads and product categories?  Should we create View-all pages for those?  I think not.  Such pages could be too big and unwieldy, and could take too long too load, which (especially given that load time is now a ranking factor) could work against the SEO rather than for it.

Another option would be to create a View-all page containing less information, e.g. a cut down version of each post in the forum or each product in the category.  This might be a good solution.  Bear in mind, however, that Google is looking to rank this View-all page in preference to a paginated page, so

  1.  don't cut out content that contains long-tail keywords for ranking and
  2. make sure if this page is going to rank well that it's a good landing page that can help the searcher achieve what you want them to achieve on your site

Another option is to deploy this strategy:

  • if your default posts or products "per-page" count is a small number (such as 10 products/page), consider changing it to a bigger number now (such as 50).  This will reduce the number of pages in your page sequences dramatically.  It will also increase the size of each page but technology has moved on - the 10 number became the standard when the Web was a lot slower than it is now and 50 seems a more appropriate number to me.  It's a good number of products to compare in one go, for example.
  • once you have shorter series of larger pages, use the rel=prev and rel=next tags as described by Google. 
  • If it's a product sequence, add a rel=canonical tag to each page in the series to make the URL of the first page in the series the canonical URL.  It's OK to do this for a product sequence, as Google's rel=canonical documentation stated that "the sort order of a table of products" was an acceptable use of a rel=canonical tag.  Since it's unlikely you would want to change the sort order of a set of article pages or forum posts, it wouldn't be as good to use a canonical tag on those series versus a product sequence.

For example, let's suppose you currently have a category of Family Cars that consists of 238 cars with 10 cars per page giving a series of 24 pages with the following URLs: 

  • /cars/family/1
  • /cars/family/2
  • ...
  • /cars/family/23
  • /cars/family/24

Here's what you could do:

  • Increase the default number of cars per page from 10 to 50.  Now only 5 pages are needed to cover the series: /cars/family/1 ... /cars/family/5
  • Add a rel=next tag to /cars/family/1, a rel=prev tag to /cars/family/5, and both a rel=prev and a rel=next tag to the intervening three pages, as described by Google
  • Add rel=canonical tags to all five pages, citing /cars/family/1 as the canonical URL.
alan's picture

Matt Cutts Interviewed on Site Architecture

Site ArchitectureMatt Cutts has given a very useful interview with Eric Enge, which rounds up a lot of information architecture and technical architecture issues. There's nothing really new here, but it's good to get all this info into one place and to see it confirmed by Matt. Topics covered:

  • crawl budget/indexation cap - the use of Pagerank and host load to control crawl depth and frequency
  • the effect of duplicate content on Pagerank
  • session IDs and affiliate IDs in links/URLs
  • faceted navigation - good to see Matt confirming that the advice I gave at SES London, and will be giving next week at SMX Munich, is all correct.
  • Different ideas for use of the rel=canonical tag
  • 301 redirects and how they differ from 302 redirects
  • Google Webmaster Tools (WMT) ignore parameters
  • Pagerank Sculpting and its effectiveness in the modern world
  • Javascript, IFRAME and PDF handling
  • Paid links and nofollow

Overall, the article strongly reinforces the fact that a successful site architecture is essential to SEO success.

alan's picture

Calling for link spam reports

I see that Matt Cutts of Google is calling for link spam reports.

I'm still very troubled by this paid links issue after all these years!

I agree it's Google's right to penalise or promote any page/site in its natural listings, which represent Google's subjective opinion of relevancy.

However, the idea that all paid links are bad/"evil" is wrong in so many ways:

  • Paid links pre-date Google.
  • There is no machine-readable standard for labelling a paid link. I'll repeat that - there is no machine-readable standard for labelling a paid link.
  • Labelling paid links fails the "Does this makes sense in the absence of search engines?" ethical test. The answer may well be "Yes". (Where the answer is "No", I agree paid links are spam).
  • Labelling paid links fails the "Would I do this if search engines did not exist?" test. In fact, you have to know that Google exists, and that they mind about paid links, in order to label those paid links in the non-standard way that Google asks you to label them. This is perhaps my biggest beef with Google's approach to paid links - they actually violate one of Google's published Webmaster principles.
  • What does "paid" mean anyway? An actual exchange of cash? If you look at the top results for any hugely commercial field, say "car insurance", it's hard to believe that there is no commercial influence in the results! When all that a company does is commercial, then every link (positive or negative) to that company's site is commercial in nature.

I understand that a market in paid links arose because of Google's algorithm.

However, the irony is that in responding to that market by asking all publishers to label paid links in a non-standard way, Google violated its own principles. It started to ask publishers to adapt what they published to suit Google (because Google existed), and called them spammers if they didn't. That's the wrong way around. It's the spammers that do stuff purely because Google exists!

alan's picture

rel=canonical tag

So, Google, Yahoo, Microsoft and, more recently, Ask have announced the new "canonical" link type or, more colloquially, the rel=canonical tag. Much has already been written about this tag and its purpose: to help prevent duplicate content issues. Probably the best summary is this Matt Cutts video

:

This tag is a welcome addition to the armoury in the fight against duplicate content issues. In addition to Matt's comments, I would make the following points:

Copyright Protection

Scrapers are forever copying content and publishing it on their own sites/splogs. Sometimes they are exceptionally lazy or stupid, even to the extent that they copy Adsense code onto their own sites. If they copy your rel=canonical tag onto their site, that would give a strong "hint" to the search engine that you were the original owner of the content:

<link rel="canonical" href="href="http://www.mysite.com/my/content/" />

Microsoft Platforms

Matt made reference to the Microsoft platform in his video, but I would emphasise the point. Microsoft's implementation of RFC 2396 is flawed. The path component of a URL is supposed to be case sensitive, but Microsoft makes it case insensitive. If there are n alphabetic characters in the path, then a Microsoft implementation gives 2n possible variations of that path, where there should be only one. For example, if n=1 and the path is "/a/". Microsoft would allow "/a/" and "/A/"; if n=2 and the path is "/ab/". Microsoft would allow "/ab/", "/aB", "/Ab" and "/AB/"; and so on. 2n variations gives vast potential for duplicate content and it is a big issue with sites built on the Microsoft platform. The rel=canonical tag makes it very easy to specify the correct, case-sensitive path on a Microsoft platform:

<link rel="canonical" href="http://www.mysite.com/my/case/sensitive/path/" />

Static Web Content

Static web content is content that is stored in the format in which it is delivered. Typically, static content is served under a static URL (a URL that does not contain a question mark). However, it is possible to link to static content and append query parameters, even though these query parameters will have no impact on the content that is served. One example of when this might happen is when a referrer parameter is passed to a JavaScript function within the static content:

<a href="http://www.mysite.com/?referrer=myAffiliate0001">Affiliate Link</a>

Thousands of links can be created to a single, static URL, each with a different referrer query parameter attached. For sites built on static content, trying to manage such links has been difficult in the past. Now, it's relatively easy. Each page of static content simply needs to contain a rel=canonical tag:

<link rel="canonical" href="http://www.mysite.com/my/static/url.html/" />

Conclusions: rel=canonical

For the reasons stated above, I would recommend the use of a rel=canonical tag in all static content. In fact, I would recommend its use in all content, static or dynamic - with appropriate care of course. It's a powerful tag and using it wrongly could have dire consequences. In the next post I'll look at some of the limitations of the rel=canonical tag and consider some alternatives.

alan's picture

Robogenic - the one word expression for “search engine friendly”

in

Matt Cutts has stirred up a little hornets' nest with his "What should NOINDEX do?" post. Matt reckons the topic will be colossally boring to some people - but not to me. For some reason I find Robots standards fascinating. Yep, I know I'm weird.

The crux of Matt's issue is ...

The question is whether Google should completely drop a NOINDEX’ed page from our search results vs. show a reference to the page, or something in between?

The obvious response is to completely drop the NOINDEX'ed page. NOINDEX is made up of the two words NO and INDEX; so it means do not index, right?

Maybe not. It's important to be precise here. What exactly does NOINDEX mean?

Often when talking about indexing issues, it's useful to separate in your mind the indexing of a URL from the indexing of the content at that URL. This concept is particularly important in the contexts of URL canonicalization, duplicate content and ... robots standards. I'll restrict this discussion to the NOINDEX part of the robots standards, but an equally interesting discussion exists around robots.txt too.

Once we separate URL and content, the question "What exactly does NOINDEX mean?" can be answered in several ways:

1) Index the URL but not the content
2) Don't index the URL or the content
3) (Somehow, not sure how!) index the content but not the URL

One thing is for sure ... it does not mean index both the content and the URL. :D

In my opinion NOINDEX should definitely mean "Don't index the content". Definitely. No question.

The question of whether it should mean "Don't index the URL" is an interesting one. There are arguments both ways. In my experience, however, there are many, many different examples of when it should mean "Don't index the URL". In these instances, if the URL was indexed, it would result in something bad happening either for searchers, or the site owner, or both. Therefore, generally, I think it should mean "Don't index the URL".

However, there is one specific case where I think it would be acceptable to index the URL, and which would give benefit to both searchers and site owners (very often). That specific case is when the URL is the home page of the site.

Taking the three "problem" URLs cited by Matt in his post:

If high-profile sites like

- http://www.police.go.kr/main/index.do (the National Police Agency of Korea)
- http://www.nmc.go.kr/ (the National Medical Center of Korea)
- http://www.yonsei.ac.kr/ (Yonsei University)

aren’t showing up in Google because of the NOINDEX meta tag, that’s bad for users

These three URLs are all actually home pages. The second and third URLs are obviously so. The first URL is the result of a couple of 302 redirects:

This makes http://www.police.go.kr/main/index.do the home page of the site. The way Google works (correctly IMO) is that a redirect from "/" to a deeper page on a site would normally result in the content of that deeper URL being indexed under "/".

So, I think a reasonable middle ground, that satisfies the best interests of searchers, site owners and search engine, would be the following:

  1. Do not index the content.
  2. Do not link to the URL in the search results, unless the URL is a “home page” (/, or redirected to by /).
  3. If it is a home page with a NOINDEX tag, it’s OK to link to it in the SERPs, but do not index the content; do not provide a snippet; and do not provide a cached copy. Treat it like a “partially indexed page”.
alan's picture

The robots.txt file and the robots meta tag

This article was first written and published by me in 2000/2001, but no longer exists on the Web. It's still accurate - although search engines (notably Google) have taken steps to correct some of the problems described below, they can and do still arise.

There are two common protocols for the prevention of indexing of Web resources:

  1. The robots.txt protocol
  2. The robots meta tag protocol

This article describes:

  • The theory and practice of these two protocols
  • Anomalies and inadequacies in the protocols

The robots.txt protocol

A search engine spider is a Web robot and, as such, may choose to obey the robots.txt protocol. The robots.txt protocol was invented in 1994 and has remained as the de facto standard for controlling robots’ access to a Web site. Most search engines claim to support it, but no robot, including a search engine spider, has to support it.

The protocol is described in the document "A Standard for Robot Exclusion". That is the page that most search engines that support the robots.txt protocol will refer you to if you require more details. However, if you read that page, you will see that it contains no reference to search engines at all. The introduction to the page says:

In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

So, the purpose of the robots.txt protocol is to provide a mechanism for WWW servers to indicate to robots which parts of their server should not be accessed, i.e. to prevent robots from reading parts of their server. How does this purpose relate to preventing a search engine from indexing a particular resource? Unfortunately, the general answer to this question is "It doesn’t".

The Disallow line in a robots.txt file means "disallow reading", but that does NOT mean "disallow indexing". In other words a disallowed resource may be listed in a search engine’s index, even if the search engine obeys the protocol. The most obvious demonstration of this is Google. Google can add files to its index without reading them, merely by considering links to those files. In theory, Google can build an index of an entire Web site without ever visiting that site or ever retrieving its robots.txt file. In so doing it is not breaking the robots.txt protocol because it is not reading any disallowed resources, it is simply reading other web sites' links to those resources.

The Disallow line in a robots.txt file means "Disallow reading", it does not mean "Disallow indexing". A resource does not necessarily need to be read in order to be indexed.

Let’s return to the question of how the robots.txt file can be used to prevent a search engine from listing a particular resource in its index. In practise most search engines have placed their own interpretation on the robots.txt file which allows it to be used to prevent them adding resources to their index, as follows. Most search engines interpret a resource being disallowed by the robots.txt file as meaning they should not add it to their index, and if it is already in their index (placed there by previous spidering activity) they remove it. This last point is important, and an example will illustrate the point.

A particular resource may have been published to a particular Web site on 1st January 2000. That resource may have been indexed by a search engine on 1st February 2000. On 1st March 2000, the site owner may have modified the site’s robots.txt file to disallow the resource from being read by the search engine spider. On 1st April 2000, the search engine spider may re-visit the Web site and note the new entry in the robots.txt file. The search engine spider may now simply choose not to read the resource but to leave the copy of the resource in its index unchanged, and this would not be breaking the robots.txt protocol. But most search engine spiders will both:

  1. not read the resource and
  2. remove the resource from their index.

In this example, note that throughout March the resource was in the search engine’s index even though it was disallowed by the robots.txt file.

In practice, most search engines interpret a Disallow line as meaning "Do not index this resource and, if you already have an index of this resource, remove it". It may take some time from the point a resource is Disallowed to the point that resource is removed from a particular search engine’s index. If you want to ensure a particular resource is never indexed, ensure it is prevented from being indexed by a Disallow line in the robots.txt file before publishing the resource for the first time.

Now let’s consider how the robots.txt protocol can be used to prevent binary resources, such as images (e.g. GIF files), from being added to a search engine’s index. Let’s suppose a particular Web site put all its images in a directory called /images, and had the following robots.txt file:


User-agent: *
Disallow: /images/

You might think that this would prevent the site’s images being indexed by image search engines. But think again about what we have learned about the robots.txt file. It prevents Web robots, including search engine spiders, from reading a resource. But search engines do not need to read an image before adding it to their index. Many spiders just read the ALT text of the IMG tags that refer to the image, rather than reading the image itself. Since the spiders are not reading the image, they are not in breach of the robots.txt protocol if they index the image. This scenario is analogous to Google building an index of a resource without reading that resource: an image search engine can build an index of an image without reading an image.

Once again, in practise most image search engines interpret a Disallow line referring to an image as meaning "Do not index this image and, if you already have an index of this image, remove it". It may take some time from the point an image is Disallowed to the point that image is removed from a particular image search engine’s index.

Finally, a question that exposes the worst flaw of the robots.txt protocol: a webmaster wishes to make all pages of a Web site, EXCEPT the home page (i.e. "/"), accessible to robots; how can she do this using the robots.txt protocol? The answer - "She can't".

The robots meta tag protocol

The robots meta tag protocol was invented after the robots.txt protocol. It was originally designed to allow HTML developers that did not have permission to write the robots.txt file to the root of a server to have control over the indexing of Web pages. Unlike the robots.txt protocol, the robots meta tag protocol:

  1. specifically states whether a resource may or may not be indexed
  2. can help, but cannot prevent, a particular resource from being read
  3. does not allow large-scale (wildcard) prevention of indexing
  4. cannot be used to prevent anything except HTML files from being indexed, since the meta tag can only be placed in HTML files (if following the strict definition of the protocol)

Note in particular point 2: the robots meta tag protocol cannot prevent a particular resource from being read because a resource must be read in order to obtain the tag it contains. You may think that if every document that linked to a particular resource contained a robots meta tag NOFOLLOW attribute, that resource could never be read – but what if a new document is added to anywhere on the Web, and that document links to the resource? Or what if somebody submits the resource directly to the Add URL page of a search engine? In both these cases, a search engine will read the resource before discovering the robots meta tag. So the problems the robots.txt protocol was designed to fix - e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting) – are not addressed by the robots meta tag protocol. In other words, there is no "NOREAD" attribute!

So, we’ve said what the robots meta tag is not, but what is it? The robots meta tag is included in a HTML file and defines separately whether the file may be indexed (using the INDEX attribute) or spidered (using the FOLLOW attribute). However, the robots meta tag enjoys less support than the robots.txt file. It is unclear how much of the standard search engines support. Would every search engine, for example, correctly interpret a "noindex, follow" set of attributes?

Since the robots meta tag can only be used within a HTML file, and the NOINDEX attribute only refers to the file that contains it, it cannot be used to prevent binary resources (such as images) from being indexed. Some search engines have invented extensions to the protocol to overcome this problem, but the extensions are not part of the protocol. For example, AltaVista has invented its own robots meta tag attribute (NOIMAGEINDEX) to prevent images from being indexed.

The behaviour of these extension tags is not well defined. An example will illustrate the main problem:

  1. a particular Web site, let’s call it www.example-one.com, consists of 10 pages
  2. each of the 10 pages includes an image at www.example-one.com/images/example.gif
  3. nine of the ten pages contain a robots meta tag like this: <META NAME="robots" CONTENT="index,follow,noimageindex">
  4. however, www.example-one.com’s home page contains the following robots meta tag: <META NAME="robots" CONTENT="index,follow">

The "noimageindex" attribute is only understood by AltaVista’s image spider. So, when AltaVista’s image spider reads the site, will it add example.gif to AltaVista’s image index? The answer to this is question is undefined – nine out of ten pages say it’s not OK to index the image, but one out of ten pages says (implicitly) that it is OK. So the image spider might, or might not, index the image. It all depends on the order the spider reads the pages, the number of pages read by the spider (it might only read the home page), and a multitude of other factors.

To make matters worse, now suppose that there is another Web site called www.example-two.com, every page of which also includes www.example-one.com/images/example.gif. None of the pages on www.example-two.com include a robots meta tag. Would an image spider add example.gif to its index now? Again, the answer to this question is undefined.

Now a question to test the theory so far ... A site owner attempts to exclude a page from being indexed by search engines by both adding a Disallow line in the site robots.txt file and a meta robots tag with noindex attribute into the page itself, before publishing the resource for the first time. Is there any way that a search engine that obeys the robots.txt protocol and the robots meta tag meticulously can have a reference to the resource in its index?

Let's work this through.

  1. Suppose the resource is called noindex.htm and it contains the following robots meta tag: <META NAME="robots" CONTENT="noindex,nofollow">
  2. The URL http://www.example-three.com/robots.txt is then created as follows:
    User-agent: *
    Disallow: /noindex.htm
  3. noindex.htm is then published to www.example-three.com/noindex.htm for the first time.

Surely noindex.htm can’t possibly be indexed by a search engine that obeys the robots.txt protocol and the robots meta tag protocol? Can it? It can. In fact, only a search engine that completely obeys both standards can index it. Here’s how.

Our very obedient search engine works a little like Google. So, while its spider is spidering the Web, it finds references to noindex.htm. Each time it finds a reference, the spider creates a better picture of noindex.htm in its index, without ever reading noindex.htm. Sooner or later, the spider visits www.example-three.com. The first thing it does is read robots.txt to find pages it is not allowed to read. The only page it is not allowed to read is noindex.htm, so it doesn’t read that page. It doesn’t remove the page from its index, because, strictly speaking, that is not what the robots.txt protocol means. Because the spider cannot read noindex.htm, it cannot find the robots tag on that page preventing it from indexing that page. Therefore, the page remains in the search engine’s index.

Future posts will address the new features in robots.txt, the robots meta tag and Webmaster tools, that address some of the above problems.

alan's picture

Skip Google Hell, Time For Google Heaven

What a dreadful, poorly researched article in Forbes magazine. There's so much wrong with it, I can't find a single good thing to say - so I'll say nothing more about it. :| In brief, the way to Google Heaven is:

  1. Create good quality, unique content
  2. Ensure the content can be crawled and indexed by Google
  3. Take steps to ensure that the content is seen once only, at the best URL for it
  4. Build good quality links to the content from your own sites and those of relevant third parties

It's a shame the sites featured in the Forbes article failed to follow this simple formula.

Syndicate content