Thursday, May 27, 2010

Robots.txt "how to"

The robots.txt file gives you the ability to suggestively allow or deny web bots access to directories and files on your site. Do _NOT_ think of it as a bot firewall, but more a list of suggestions. In _NO_ way can you trust the bot to follow the robots.txt file rules. Many web crawlers, spiders and bots will follow any link they see and grab what they want, like email addresses. In fact, malicious web crawlers will search out the robots.txt file just to see what you do not want them to find.
The primary reason you would use the robots.txt file is to help a friendly bot index your preferred pages and ignore the others. It is also a good idea to make a robots.txt file to avoid all the errors the bots are going to generate if they can not access the file. Keep the logs cleaner by at least making a basic file or even putting a a blank robots.txt in place.


Creating the robots.txt

You should create your robots.txt file with a standard text editor. The file should be saved as robots.txt and uploaded to the root directory, so it will look something like http://www.calomel.org/robots.txt
The three directives you have access to are "User-agent", "Allow" and "Disallow". The user agent is the name of the bot which is crawling your site. The "Allow" directive is not fully supported by all bots, but is used to allow one to explicitly specify what you want to allow bots to see. Since, "Allow" is not fully supported we will _not_ be using it. Lastly, the disallow option is the directories or files you do not want searched.


Examples

Allowing all bots to go anywhere: This default robots.txt file will allow all bots to index all pages of your site. This is the same as a blank file or not having any robots.txt file at all. The "*" is generic to refer to all bots or clients and the line "Allow: /" says they can access any file or directory from document root on down.
User-agent: *
Allow: /
Disallow all bots: Here you can specify the "/" and disallow the entire root tree of the web site.
User-agent: *
Disallow: /
Disallowing specific bots: If you want to disallow certain bots to index your pages, you can refer to the bot by name and disallow them. For example, many webmasters prefer not to allow Google's Image Search bot to index any of the images on their site.
User-agent: Googlebot-Image
Disallow: /
Disallowing Pages and/or Directories: If you wish to keep bots from hitting a list of files or directories you can list them out. This is _NOT_ a secure method of keeping bots out of these directories. Just a way to help them save time indexing your site. Keep in mind if you have any private directories listed in your robots.txt you should password protect and/or restrict them by ip. The dishonest bots specifically look to the robots.txt file to see where you do not want them to go and you have now alerted rogue bots (that ignore contents of your robots.txt file or specifically target disallowed pages and directories) as well as any competitors to your private directories. In these cases, you should disallow robots using meta tags as well. The cgi-bin is a common directory to disallow, since it normally contains scripts, many that don't need indexing and which can go through a lot of bandwidth as a result.
Lets tell ALL bots _NOT_ to look at the directories /cgi-bin and /private_dir and _NOT_ the file /secret/large_page.html. Finally, we will allow the bots to see any other directory or file not specifically disallowed.
User-agent: *
Disallow: /cgi-bin/
Disallow: /private_dir/
Disallow: /secret/large_page.html
Allow: /


In Conclusion

When you have decided what you want to allow and disallow in your robots.txt save it and put it on your web server's root directory tree. Also understand that not all bots support compression. Google, MSN, and Yahoo support compressed files, but other bots do not. Your mileage will vary.

No comments: