|
CIRR.COMWeb Hosting Supportrobots.txt |
What is robots.txt, and why is one needed
Answering the second question first; robots.txt allows you some control over what portions of your web site a web robot (spider) can audit/index. By using a robots.txt file, you can, for example, keep web robot from trying to index your entire dynamically generated online store.
Overview
robots.txt is a plain text file in the top directory of your document space (/docroot on CIRR servers). It should contain a list of directories that are allowed to be searched, and a list of browsers that can search them.An example looks like this:
User-Agent: *
Disallow: /private/
Disallow: /images/
Disallow: /Old/
Disallow: /cgi-bin/
Refusing Access to One Bad Robot
To refuse access by one poorly behaved robot, you could use the following text in therobots.txtfile:User-agent: poorlybehavedbot-1.0
Disallow: /
This disallows all access by the robot which identifies itself as
poorlybehavedbot-1.0in your web server's access logs. (This information appears in the "user agent" field. If you do not have that information your access logs, or in a separate user agent log, we recommend that you complain to your web space provider. It is useful information!)The filename
/is understood to mean that robots should not access any of the files on the server.
Refusing Access by all Robots
You can refuse access to all robots, rather than to individual robots, by the special "user agent" string*. For example:User-agent: *
Disallow: /cgi-bin
In this example, access to the /cgi-bin directory is refused to all "robots" that adhere to the standard.
More details
A deeper tutorial on robots.txt can be found at Search Engine World. Another good resource appears to be on SearchTools.com. The Standard for Robot Exclusion can be found at robotstxt.org.
| Copyright 2000,2001 Central Iowa (Model) Railroad | Contact Us |
Referral Program |
Support |
| $Id: robots.txt.html,v 1.1 2001/08/10 16:56:12 cirr Stable $ | Terms of Service | Privacy Information |