More on dodging spiders
In the first part of this article series, we discussed malicious use of spiders and some means to defend against them. In this article, we’ll explore other defenses such as use of onetime links, special links, turing tests and URL tokenization. We will also try to identify the most suitable solution to defend against crawling spiders.
Use Onetime Links
Assign a unique value to links to identify any out of order request. For e.g. a variable ‘tag’ with a random unique numeric value is appended to all the links in the webpage. Say for e.g. Page ‘/Step1.asp?tag=50814911’ has the following links:
Clicking on the ‘link2’, will fetch the page ‘/Step2.asp’, with a new random tag value assigned by server, say ‘tag=40591809’. Now the links in this new page (/Step2.asp?tag=40591809) will be as follows:
On receiving a request, the application verifies the session Id and then checks the tag value. If the tag number is correct then a new value is assigned to the links in the response page. The previous tag number is invalidated so that it cannot be reused. This makes a link ‘Onetime link’. Many Spiders multithread their requests to speed up the discovery/download process and the use of ‘Onetime link’ would therefore impede the crawling activity.
- Application has to dynamically append the tag value to the links in the pages.
- Each page should have ‘Back’ and ‘Forward’ functionality as the default browser ‘Back’ and ‘Forward’ buttons would not work.
- Links from the browser history and bookmarked links would not work.
Use Special Links
Embed Special links (like Commented/Hidden links) within the HTML Body of all pages and these links lead to a continually monitored page. For example, a link say, <!— HREF=“. ./clickme.html”> —>. These links would not be visible to a human and thus will not be requested by legitimate users. Spiders follow the links from the source code of the page and would request for these special links.
When these special links are accessed, the web application is designed to respond accordingly (e.g. Invalidation of Session ID, blocking of the IP address, detailed log analysis, etc.). After the invalidation of Session Id, a default page is served for all requests from that IP address.
Include Turing Test
Differentiation between a Human user and a Spider can be done through a specific test called as Turing tests, for e.g. CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart). Under this approach, a user is forced to interpret onscreen or audio information, and submit a response in order to proceed further. Audio CAPTCHAs help visually impaired individuals navigate the Web easily. We discussed about CAPTCHAs in our December 2005 issue.
Use URL Tokenization
URL Tokenization means appending a token to the URL. At the time of the submission of a request to the server, script included in the webpage appends token (a unique value) to the request URL. These tokens are issued by the server and are embedded in HTML content of the page. Server validates the token. For e.g. the HTML content of a page contains a client side script which appends the ‘token’ value to link.
<SCRIPT> var token= "93810ae733i853" document.write ("</A HREF='http://www.my.com/Page.asp?Token="+token+" '>Click me </A>") </SCRIPT>
Currently, Spiders do not execute scripts. They identify links by the ‘HREF’ entity and would fail to append the token value while following the links.
This method can be made more complex by creating dynamic tokens using client side script from known parameters at client side.
- Legitimate users may not be able to access the links, if client side scripting is disabled on the browsers.
Finding a best possible solution…
Before finding a solution, first let’s be very clear that there is no problem if spiders crawl through web pages which are available for public access. If some of the pages in the website contain sensitive information, then the best way to protect them is to identify all such pages and bring them under authentication or authorization process. Spiders cannot access the pages which require login! In a different scenario, if your website has a few not-so sensitive pages/part of pages which you want to be just accessed by human users then, according to the your choice and after understanding the drawbacks associated, any or combination of the defenses can be implemented.
If you consider that a page is sensitive because it has email Id then instead of using any anti-spidering technique, it’s better to obscure Email Id through java script or hide Id in image. I have discussed these methods in the quiz of March 2006 issue.
Its always to better to sanitize the source code of web pages to refrain them from uncovering application details (e.g. hidden comments, notes, test accounts, personal data left by developers) than to implement defenses against spider in the worry of they finding any sensitive information which can be used for exploitation.
Having talked about defenses, if it is to compare the defenses then no doubt, techniques involving dynamically generated application content provide better protection. Techniques like URL tokenization and ‘Special’ links are more sophisticated and also require changes in the existing application.
As Spiders become more and more advanced, more effective defensive methods need to be adopted.