Palisade Magazine

 
June 2006
shalini-gupta

Dodging the spiders

by Shalini Gupta |  Discuss this article »»

Web spider is a software program that traverses pages in the World Wide Web in an automated manner and extracts information from web pages. They are also known as web crawlers or web robots. Web spiders can read the HTML content and hence can read both the visible and non-visible parts of the webpage. Spiders cannot access the contents that require authentication or authorization. Spiders can search for data across websites much quicker and deeper than humans can ever do.

Web spiders are primarily used by search engines and website owners. Search engine spiders, typically called as user agents, are used to find and then download the entire page to create a copy of all the visited pages for faster searches. Website owners use spiders for automating maintenance tasks on a web site, such as checking links or validating HTML code. While these are some positive uses, spiders can also be used for detrimental purposes. Some of them include harvesting email addresses for generating spam, uncovering application details for exploitation (e.g. hidden URL’s, test accounts), understanding development techniques and possible code bypasses based upon “hidden” comments and notes left by application developers, social engineering attacks based upon personal data (such as names, telephone numbers, email addresses).

Lets see how Spiders work

Spiders traverse Web by recursively retrieving linked pages. In general, spiders start with a list of URLs to visit provided by the owner. As the spiders visit these URLs, they identify all the hyperlinks in the page and add them to the list of URLs to visit. When a spider visits a website it does the following things:

  1. A well behaved spider looks for the robots.txt file and the robots META tag to see the "rules" set for browsing through the website or the webpage.
  2. Then, it begins to scan the text on the page along with the contents in various HTML tags depending on the purpose of the spider.
  3. Analyses the content and returns the desired information to its owner.

For instance, Google’s spider, known as Googlebot, works in two steps:

  • It fetches the web pages from the website and sends them to the indexer.
  • The indexer sorts through every word on the page and stores them in a database.

Guiding spider activity

As mentioned earlier, well behaved spiders obey the robots.txt file and the robots META tag to know what they should not do. In absence of mentioned file and tag, spiders follow every link and visit every directory of the website. Let’s take a look what these are and how they can be used to guide spiders crawling the website.

The Robots.txt File

The robots.txt file uses the Robots Exclusion Protocol that was designed to restrain the spider’s access to Web sites. It is not an official standard backed by a standards body, or owned by any commercial organization.

According to the protocol, the "robots.txt" file should exist in the root directory (e.g. http://www.abc.com/robots.txt) of the web server. This file specifies an access policy for robots, indicating which parts of the site should not be visited by which robot. The protocol specifies that a spider should first look for this file on the web server before crawling through the website. Since it is not mandatory, it is upto the website administrators to create the file and upto the robots to obey.

A robots.txt file would have entries like:

User-agent: EmailSiphon
Disallow: /
  
User-agent: XYZ
Disallow: /employee/details
 
User-agent: *
Disallow: cgi-bin
Disallow: javascript

Where, ‘User-agent’ specifies the name of the spider and ‘Disallow’ specifies the directories that the particular spider is not allowed to access.

Robots META Tag
META tags are placed between the <head></head> tag. Robots META tag allows web developers to tell the spider whether to index the current page and follow the links on it. Parameters that can be used with the tag are index, follow, all, none, noindex, nofollow separated by a comma. Absence of the META tag, means ‘all’, allowing spiders to index the current page and follow all the links on it. ‘index’ and ‘noindex’ tell the spider whether it can index the current page or not, while ‘follow’ and ‘nofollow’ tell the spider whether it can follow the links on the current page or not. The ‘none’ tag means that the spider is not to index the current page nor follow any links on it.

For example: <META name="robots" content="noindex, nofollow"> tells the spider not to index the current page nor to follow the links. Though this META tag is useful in guiding the spiders, not all spiders support this tag.

Detecting spiders

Web server logs can be a useful resource to detect spiders. A web server log would contain information such as:

  • User Agent of the visitor
  • IP address of the visitor
  • Files requested by the visitor

The log file stores the name of the program used by visitor to access the website in User Agent variable. Popular web browsers like IE and Netscape use the word ‘mozilla’ as the User Agent. Most of the spiders use their own names as the User Agent. Although the User Agent field is an easy way to identify the website visitor, it can be spoofed. The log file also has the IP address of the visitor. This information helps to identify the client machine of the user visiting the website. The log file also keeps track of all requests made by a visitor. This gives an idea about the purpose of the visit. Many a times, such information would help in identifying a spider.

Barricading the spiders

Blocking HEAD Requests

HEAD request is used to know whether a link exists or the content is modified. A few spiders use HTTP HEAD request. Configuring web servers not to respond to HTTP HEAD requests can work as a defense against such spiders.

Drawbacks: Current spiders do not use HTTP HEAD request.

Using ‘REFERER’ Field

Every HTTP request header includes a ‘Referrer’ field, which indicates the source URL from which the request was made. Spiders do not use proper ‘Referrer’ field.
A defense mechanism can be built where every request is validated based on the ‘Referrer’ field value. If the request header has an expected ‘Referrer’ field value, then only the page is served. If the ‘Referrer’ field is incorrect or missing then the user is redirected to the home page of site.

Drawbacks: There are some practical scenarios where the ‘Referrer’ field may be missing or different from the one expected. A few of them include:

  1. Privacy reasons due to which some browsers are configured not to submit ‘Referrer’ field with the request.
  2. Redirected requests from external sources like search engine, emailed link etc.
  3. Direct requests through bookmarked links in the browser.

Manipulating Extensions

When a client receives a response from the server, it looks for file extension of the URL and ‘Content-Type’ field to know how to interpret the data. META tags using HTTP-EQUIV attributes. When the server sends an HTML file with a .jpg file extension and a text/html ‘Content-Type’, the page is correctly rendered to the browser. But most spiders are programmed to ignore the URLs that contain extensions like GIF and JPG. Hence, this method can be used as a defense to make spiders ignore the URL path in the response from the server.

Drawbacks:

  1. Some browsers may not look for the Content-Type field in the response header.
  2. When a user attempts to save the web page, it will be stored as a .jpg file.

Using Refresh Field

Spiders identify links or embedded URLs by presence of “HREF=” tag in HTML content. If any alternative method is used for sending the URL in server response then it is not identified by spiders.

‘Refresh’ field is used to perform a controllable automated client-side redirect. When a user requests for an invalid, nonexistent or recently moved content, the server sends the correct URL within the ‘Refresh’ field in the response header. On receiving this response, the browser is automatically redirected to the correct content location. The ‘Refresh’ field is set by server either in the HTTP header or through META tags using the HTTP-EQUIV attributes.

For eg: The response from server would look like

HTTP /1.0   200 OK
Server: Microsoft-IIS/5.0
Content-Type: text/html
Refresh:0 ; URL = http://www.abc.com/page.html

Or alternatively,

<META HTTP-EQUIV="Refresh" CONTENT="0 ; URL = http://www.abc.com/page.html">

When a spider gets 200 OK response, it searches for the ‘HREF’ tag to identify any link.

Watch the speed

Spiders being software agents, send requests to web server at a speed that is unattainable by a human user. The web server can monitor the time taken between two consecutive requests to identify spiders.

To be continued…

In the next article we will discuss many other defenses and would try to pick the best to defend against crawling spiders…

References

Discussion is open for this article — there are no reader comments yet. Add yours.