The Classification of Search Engine Spam
Published by Alan Perkins on 30th September 2001
This document has been written to allow search engine marketers and other industry professionals to objectively evaluate actions to see whether those actions equate to spamming a search engine. It is hoped that quality search engines, ethical marketers and search industry professionals will agree that this document lays out standards which the industry should strive for.
With standards come definitions. Often in the search industry, the same terms are used by different people to mean different things. These different meanings can cause confusion and give spammers refuge. An objective of this paper, then, is to place absolute definitions on some important terminology.
The following terms are defined:
- Search engine
- Search Engine Spam
- Not Search Engine Spam
- Content spam
- Meta spam
- Link farm
- Link content spam
- Link meta spam
- Agent-Based Spam
- IP Cloaking
The first term we will define is "Search Engine". Generally, a search engine is any program that searches a database and produces a list of results. To work at such an abstract level within this document would limit us to a very theoretical generalist discussion. Therefore, for the purposes of this document, we will apply a more narrow definition of "search engine" as follows:
a system that uses automated techniques, such as robots (a.k.a. spiders) and indexers, to create indexes of the Web, allows those indexes to be searched according to certain search criteria, and delivers a set of results ordered by relevancy to those search criteria. Examples of such search engines are AltaVista, Fast, Google and Inktomi. (Fast and Inktomi deliver their results solely through partners such as Lycos and MSN).
The next term that needs defining is relevancy. Because this document is attempting to classify spam, and spam and relevancy are intertwined, it is essential that we define relevancy in an objective way. That is not to say that relevancy is objective. Far from it. Relevancy is extremely subjective. Every search engine uses its own algorithm to calculate relevancy. Therefore, we define relevancy as follows:
The search engine's measure of how well a particular resource matches the input search criteria
Each search engine measures relevancy using its own algorithm. Therefore, given the same set of resources and the same input search criteria, each search engine will produce a different set of results. This is because the results are ordered by relevancy, and each search engine calculates relevancy differently.
It should be clear that the algorithms that calculate relevancy are the life blood of search engines. Those search engines that deliver the most relevant results to the market they have chosen to focus upon should be the most successful search engines in those markets.
Search Engine Spam
So, what is search engine spam? We define it as follows:
Any attempt to deceive a search engine's relevancy algorithm
And what isn't spam?
Not Search Engine Spam
Anything that would still be done if search engines did not exist, or anything that a search engine has given written permission to do.
The remainder of this document assumes that the search engine has not given written permission. It elaborates upon the meaning of the previous two definitions and places them in a context that should be acceptable to all industry professionals.
In attempting to classify spam, we considered many different instances of spam and architectures for delivering spam. We gradually came to realise that there are only two types of search engine spam:
Data within a part of a Web resource designed for humans (e.g. the of a HTML document) where that data is designed only for search engines to see
Data within a Web resource that describes that resource or another Web resource inaccurately or (when the data should be readable by humans) incoherently
The fact that there are only two types of search engine spam derives from the fact that search engine algorithms use only two basic factors to calculate relevancy; on-the-page factors and off-the-page factors. An example of an on-the-page factor is keyword density - how early and often the keywords (words searched for) appear in the body copy of a page. An example of an off-the-page factor is link popularity i.e. how many other pages on the Web link to a particular page. In fact, depending on the link popularity algorithm, it can be spammed with either content spam or meta spam. This will be described in more detail later.
First of all, we should consider why content spam is possible. It is possible because the same URL can deliver different content (or the same content displayed in different ways) to different visitors to that URL. Even the simplest versions of HTTP and HTML support this, and therefore offer the opportunity to deliver spam. For example, IMG support and ALT text within HTML means that image-enabled visitors to a URL will see different content to those visitors that, for various reasons, cannot view images. Whether the ability to deliver spam results in the delivery of spam is largely a matter of knowledge and ethics.
This document is not designed to provide exhaustive examples of spam. To do so would be counter productive as it could become a reference source for those that wish to spam. Suffice it to say that the following techniques are among those that may be subverted to deliver content spam: tiny text, invisible text, noframes text, noscript text, alt text, longdesc text.
It is extremely important to note that none of the above techniques were designed to deliver spam. Therefore, the use of the technique does not imply that spamming is taking place. So, how can we determine whether the use of the technique constitutes spam? It is relatively simple - apply this test:
Suppose search engines did not exist. Would the technique still be used in the same way?
If the answer to the above question is no, then clearly the content is designed only for search engines to see. Therefore it is spam. If you are a search engine marketer or search engine optimization (SEO) specialist, don't panic at this statement. Consider what it really means.
Take, as an example, ALT text. Why was the tag invented? Not to deliver spam, but to provide a readable version of the page to browsers without graphical capabilities. These include phones, PDAs and screen readers for the visually impaired. This last example is especially important as disability legislation in many countries (e.g. USA, UK, Australia) requires that content is accessible to all. Stuffing the ALT text of clear pixels with lists of keywords is a common SEO technique. Consider this sample piece of HTML, where clear.gif is a 1x1 transparent pixel and an attempt is being made to rank higher for the word "spam":
This turns a page into meaningless garbage when it is read out loud or displayed on a non-graphical browser.
Tags that have been designed to improve access for the disabled, or less capable platforms, are often subverted to deliver spam. Yet it is possible - and professionally essential - to use these tags in the manner for which they were invented. Consider the impact of doing so. The site is usable by more visitors, from more platforms. If marketing is your goal, then you are reaching a wider market. This is an ethically sound policy. It improves access for all and improves your overall marketing capability. At the same time it does not deliver spam which spoils a search engine's ability to calculate relevancy or makes a page meaningless to visitors with lower capabilities.
Now consider meta spam. Meta data is data that describes a resource. Meta spam is data that mis-describes a resource or describes a resource incoherently in order to manipulate a search engine's relevancy calculations.
Think again about the humble ALT tag. Not only does it provide content for a HTML resource, it also provides a description of an image resource. In this description capacity, to mis-describe an image or to describe it incoherently (using, say, a stream of keywords instead of a descriptive sentence or phrase) is meta-spam. Perhaps the best examples of meta spam at present can be found in the <head> section of HTML pages. Remember, though, it’s only spam if it is done purely for search engine relevancy gain.
Meta spam is more abstract than content spam. Rather than discuss it in abstract terms, we will take some examples from HTML and XML/RDF in order to illustrate meta spam and where it differs from and crosses with content spam.
Generally, anything within the <head> section of an HTML document, or anything within the <body> section that describes another resource, can be subverted to deliver meta spam.
Examples of meta spam
The TITLE Tag
Location: <head> section of a HTML document
Example: <title>White Paper : The Classification of Search Engine Spam</title>
Search engines tend to place a lot of emphasis on the title tag in determining relevancy. Basically, if keywords occur in a page's title tag, the page is more likely to be seen as relevant to those keywords. The title of this document is "White Paper : The Classification of Search Engine Spam", which accurately describes (using terminology appropriate to the target audience) the contents of this document. If we had made the title of this document "Spammer's delight - click here to find out how to spam the search engines, SPAM, Spam, spam, ugly spam, obvious spam" then we would have a couple of problems. One, the title would mis-describe this page. Two, the title would be incoherent, yet it is designed for search engine users to see.
Caveat: the <title> tag has several functions beyond search engine listings. If an alternative use can justify using a particular title, then the title is not spam.
The META DESCRIPTION tag
Location: <head> section of a HTML document
Example: <meta name="Description" id="Description" content="The definitive guide to search engine spam." />
Everything said above about the title tag equally applies to the meta description tag. However, the caveat regarding alternative uses is not as strong. The title tag has many uses – the meta description tag is almost exclusively used by search engines.
The META KEYWORDS tag
Location: <head> section of a HTML document
Example: <meta name="Keywords" id="Keywords" content="spam classification search engine optimization optimisation ethical marketing marketer professional" />
Unlike the title and meta description tags, the meta keywords tag is not generally displayed to searchers. Therefore, it does not need to meet the "coherency" condition. In addition, the keywords tag was designed by search engines to assist search engines in determining relevancy. Therefore, it is our opinion that nothing in the keywords tag should be considered to be spam. Instead, the search engine should use the keywords tag either not at all or to guide keyword selection, but not to influence the relevancy calculations of those keywords.
Dublin Core Tags
Location: <head> section of a HTML document
Example: <meta name="DC.title" id="DC.title" content=" White Paper : The Classification of Search Engine Spam" />
The Dublin Core tags can be considered similarly to the meta tags already described.
XML/RDF Tags and Metadata
Location: XML/RDF files or streams or embedded in other Web resources
Example: <dc:title>White Paper : The Classification of Search Engine Spam</dc:title>
XML/RDF tags and metadata can be considered similarly to the meta tags already described. It is important to note that the use of XML/RDF will, in itself, not bring an end to search engine spam. It will simply provide an alternative spam channel. In fact, it could provide a greater opportunity for spam unless careful checks and balances, or contracts and conditions, are applied.
That concludes the discussion of types of search engine spam. The remainder of this document will consider issues such as links, redirection, agent delivery, IP delivery, cloaking, the role of the search engine and the role of the marketer.
With link popularity taking on a greater importance in the calculation of relevancy, the spammer’s attention has turned to how to manipulate this factor. Link popularity has two components: the authority component (number of links from other resources to this resource) and the hub component (number of links from this resource to other resources).
Techniques such as link farms have been developed to subvert both the authority and hub components. What is a link farm?
A network of pages on one or more Web sites, heavily cross-linked with each other, with the sole intention of improving the search engine ranking of those pages and sites.
How can link farm pages be distinguished from other pages? The means of the determination is beyond the scope of this document. Suffice it to say that it can be done (hint: draw Web graphs of some small link farms and look at the patterns that emerge).
Links can be used to deliver both types of search engine spam, i.e. both content spam and meta spam.
Link content spam
When a link exists on a page A to page B only to affect the hub component of page A or the authority component of page B, that is an example of content spam on page A. Page B is not spamming at all. Page A should receive a spam penalty. Without further evidence, page B should not receive a penalty.
Link meta spam
When the anchor text or title text of a link either mis-describes the link target, or describes the link target using incoherent language, that is an example of link meta spam.
Here are some practical examples of link spam:
- an SEO house stuffs the noframes content of a client's framed home page with spam, including a link to the SEO's web site to attempt to influence the authority factor of their site. Result: The client receives a spam penalty for all the spam, including the link. The SEO's web site receives no penalty for the link, in the absence of any further evidence.
- a guerrilla web marketer places a competitor's Web site in a link farm, hoping it receives a spam penalty. Result: the competitor Web site receives no penalty since, because it is not an active participant in the link farm (it does not link to other sites in the farm), there is no evidence of spam abuse. It also receives no credit for the links it receives, because they come from a link farm.
Here is a general rule of thumb to determine whether link spam has taken place - if the link is not designed to be followed by humans, or the page it is on is not designed to be read by humans, then it is spam.
However, since redirection wasn't invented to facilitate spamming, the existence of a redirect should not of itself indicate spam. A search engine robot seeing a HTTP series response or short META refresh should follow the redirect to the target, without indexing the source.
Here are some practical examples of redirection:
- A Webmaster restructures her Web site and inserts Redirect lines into the server configuration file to ensure visitors that follow links to the old pages automatically end up at the correct new pages. The search engine robot therefore receives a HTTP 300 series response when it requests a particular page. Result: the search engine robot should follow the redirect and treat the target page as any other page on the Web. This is not spam.
Agent-Based Delivery and Agent-Based Spam
Agent-Based Delivery was invented at almost the same time as the Web itself. It uses fields of the HTTP request header, in particular the User-Agent field, in order to deliver the content according to features such as the platform and language of the visitor. In other words, different content is delivered from the same URL according to the HTTP request.
Agent-Based Delivery can be subverted to deliver spam to search engines. However, Agent-Based Delivery also has a purpose that does not depend on the existence of search engines. Therefore, the use of Agent-Based Delivery does not necessarily indicate an intention to spam a search engine.
We will now briefly discuss the use of Agent-Based Delivery and some of the implications.
Let's suppose that a webmaster uses Agent-Based Delivery to deliver one version of a Web site to Mozilla browsers, and another version of the same web site to non-Mozilla browsers. A search engine, as a non-Mozilla browser, sees a different version of a Web site than a human visitor that uses a Mozilla browser. Is this spam? The answer is either Yes or No:
Yes If the non-Mozilla version is designed predominantly for search engine robots to read,
No If the non-Mozilla version is designed predominantly for humans to read
In other words, if the non-Mozilla version of the site is designed for humans using a text to speech converter, a Lynx or Mosaic browser, a PDA or WAP phone, an interactive TV set or any other non-Mozilla browser, then the use of Agent-Based Delivery is not spam. It passes this basic test:
Suppose search engines did not exist. Would the technique still be used in the same way?
Note that just because Agent-Based Delivery does not imply spam, this does not prevent Content Spam or Meta Spam being placed on pages served by Agent-Based Delivery. This is analogous to spam being placed in noframes or noscript tags.
We will now define and briefly discuss Agent-Based Spam.
The use of Agent-Based Delivery to identify search engine robots by user agent and deliver unique content to those robots.
This is always spam because the unique content is designed only for the search engine robot to see, not for humans. This cannot be justified if search engines did not exist, so it must have been done only to influence search engine relevancy. Therefore it constitutes spam. Every instance of unique content on the page will be content spam, meta spam or both.
Note: it seems reasonable to permit individual HTML section tags such as title and meta description to be delivered to individual search engines. Reason: Since meta data is not seen on-the-page by humans visiting the page it cannot be content spam (and on its own site the search engine may publish the meta data as it wishes). As long as the meta data describes the page accurately and coherently, it is not meta-spam either. Therefore, it is not spam. To classify this activity as spam would result in webmasters having to conform to the lowest common denominator in delivering meta data, which does not encourage search engines to improve.
Rule of thumb: it is OK to target search engine robots by their agent name and deliver unique content in the section of a HTML document, but not in the section.
IP Delivery and IP Cloaking
IP Delivery is the delivery of content according to the IP name or IP address of the requester. These features in the request header can indicate the ISP and location of the visitor. The two most common reasons to use IP Delivery are to deliver secure content (e.g. within an intranet or across a Virtual Private Network) and to deliver content according to the likely location of the visitor. Both these activities are perfectly valid in the absence of search engines and therefore do not necessarily constitute search engine spam. For example, it would not in itself be spam to use IP Delivery to determine that a search engine was based in Germany, and deliver the same content to that search engine's robot as to other German visitors. The content could contain both content spam and meta spam, though.
We will now define and discuss IP Cloaking.
- The identification of search engine robots by IP name or address and the delivery of unique content to those robots.
Using this definition, all uses of IP Cloaking are spam. This is because the unique content is designed only for the search engine robot to see, not for humans. This cannot be justified if search engines did not exist, so it must have been done only to influence search engine relevancy, therefore it constitutes spam. Every instance of unique content will be content spam, meta spam or both.
IP Cloaking usually involves the building and maintenance, or rental or purchase from a third party, of a database of IP names and addresses used by search engine robots; the identification of search engine robots using this database; and the delivery of unique content to those robots. It therefore requires a lot of effort and/or expense. In return for this effort and expense, the only feature that IP Cloaking offers that other technologies do not offer is preventing humans reading the cloaked page. This very feature means that the content on cloaked pages is spam - designed purely to influence search engine relevancy calculations. There is no non-spam use of IP Cloaking that could not be fulfilled more simply, cheaply and reliably by alternative technologies.
IP Cloaking is excellent for hiding various illegal and immoral practices such as copyright infringement, trademark stealing and bait-and-switch. This is because IP Cloaking is designed to prohibit review of the methods used.
For these reasons, we do not consider IP Cloaking to be an acceptable technique for professionals to associate themselves with. IP Delivery (i.e. delivering content according to the visitor's IP, but not specifically targeting search engine robots) is acceptable. If IP Delivery is deployed, a search engine robot should receive the same content as a human typical of that search engine's users.
Clarification: If it is an IP-based technology that is not delivering what we define as Search Engine Spam, then it is not IP Cloaking but IP Delivery. In short, it's only cloaking if it is spam and it's only spam if it cannot be justified in the absence search engines. If a search engine has given you written permission to deliver unique content to its robots, and supplied its robots' IP addresses to you for this purpose (e.g. to enable a secure transaction) then, using our definition, this is Not Search Engine Spam. Therefore the delivery of unique content to robots with those IP addresses would be classed as IP Delivery rather than IP Cloaking.
"It was a hard path and a dangerous path, a crooked way and a lonely and a long."
JRR Tolkien, The Hobbit
This document has attempted to set out guidelines and principles for classifying search engine spam. It has been written to allow search engine marketers and other industry professionals to objectively evaluate actions to see whether those actions equate to spamming a search engine. It is hoped that quality search engines, ethical marketers and search industry professionals will agree that this document lays out standards which the industry should strive for.
Within this document, we identified two types of search engine spam (content spam and meta spam) and discussed several examples of those types of spam. We would like to conclude this document with a few comments and guidelines to search engines and Web marketers.
To search engines
- It isn't spam if it's valid in the absence of search engines - especially if it makes a site more accessible - so don't penalise it.
- It isn't my spam if somebody else did it outside my control - so penalise them, if anyone, not me.
To Web marketers
- Use Web technologies for the purposes they were designed.
- Make your sites more marketable by making them more accessible.
- Don't cloak.
Alan Perkins : 30/09/2001