The primary reason you would use the robots.txt file is to help a friendly bot index your preferred pages and ignore the others. It is also a good idea to make a robots.txt file to avoid all the errors the bots are going to generate if they can not access the file. Keep the logs cleaner by at least making a basic file or even putting a a blank robots.txt in place.
Creating the robots.txt
You should create your robots.txt file with a standard text editor. The file should be saved as robots.txt and uploaded to the root directory, so it will look something like http://www.calomel.org/robots.txtThe three directives you have access to are "User-agent", "Allow" and "Disallow". The user agent is the name of the bot which is crawling your site. The "Allow" directive is not fully supported by all bots, but is used to allow one to explicitly specify what you want to allow bots to see. Since, "Allow" is not fully supported we will _not_ be using it. Lastly, the disallow option is the directories or files you do not want searched.
Examples
Allowing all bots to go anywhere: This default robots.txt file will allow all bots to index all pages of your site. This is the same as a blank file or not having any robots.txt file at all. The "*" is generic to refer to all bots or clients and the line "Allow: /" says they can access any file or directory from document root on down.Disallow all bots: Here you can specify the "/" and disallow the entire root tree of the web site.User-agent: * Allow: /
Disallowing specific bots: If you want to disallow certain bots to index your pages, you can refer to the bot by name and disallow them. For example, many webmasters prefer not to allow Google's Image Search bot to index any of the images on their site.User-agent: * Disallow: /
Disallowing Pages and/or Directories: If you wish to keep bots from hitting a list of files or directories you can list them out. This is _NOT_ a secure method of keeping bots out of these directories. Just a way to help them save time indexing your site. Keep in mind if you have any private directories listed in your robots.txt you should password protect and/or restrict them by ip. The dishonest bots specifically look to the robots.txt file to see where you do not want them to go and you have now alerted rogue bots (that ignore contents of your robots.txt file or specifically target disallowed pages and directories) as well as any competitors to your private directories. In these cases, you should disallow robots using meta tags as well. The cgi-bin is a common directory to disallow, since it normally contains scripts, many that don't need indexing and which can go through a lot of bandwidth as a result.User-agent: Googlebot-Image Disallow: /
Lets tell ALL bots _NOT_ to look at the directories /cgi-bin and /private_dir and _NOT_ the file /secret/large_page.html. Finally, we will allow the bots to see any other directory or file not specifically disallowed.
User-agent: * Disallow: /cgi-bin/ Disallow: /private_dir/ Disallow: /secret/large_page.html Allow: /
No comments:
Post a Comment