A
crawler, robot (or "bot"), spider or wanderer
is a computer program that searches the Internet
and records information about web pages.
For example, a "spam bot" will search the
pages in web sites and record the e-mail
addresses linked in home pages.
Using
robots.txt
Although you can't keep away all robots, you
can give instructions to certain ones that follow
the "Robot Exclusion Protocol". This
method requires that you create a plain text
file called robots.txt and place it in
your site's root directory. For example, if
your web site URL is http://www.mycompanydomain.com/,
then the file robots.txt must be accessible
from http://www.mycompanydomain.com/robots.txt
in order to restrict your site. It is important
to note that the file must be in your
root directory (ie. ~/public_html/)
and no other.
The
contents of robots.txt consists of mainly
two commands: "User-agent" and "Disallow".
The
"User-agent" command allows you to set restrictions
on a robot with a particular name (or signature).
You can set this to the asterisk (*) to
specify that restrictions apply to all robots
that aren't identified elsewhere in the file.
The
"Disallow" command allows you to deny access
to certain directories in your web site.
Example
of robots.txt
Say your URL is http://www.mysite.com.
If you put a robots.txt file into your
~/public_html/ directory containing the following:
| Figure
1: sample robots.txt file. |
User-agent: * Disallow: /neat_stuff/ Disallow: /my_pvt_stuff/User-agent: WebCrawler Disallow: /
|
In
the example, all robots would have free access
to the web site except for files contained
in the /neat_stuff/ and /my_pvt_stuff/
directories, but the WebCrawler robot is denied
all access to the site.
NOTE:
Since not all robot authors acknowledge the
"Robots Exclusion Protocol", it is
not possible to stop all robots using this method.
However, most search engine robots do follow
this protocol. Please refer to documentation
on their sites for more information on this.
Using
the "ROBOTS" META tag
Using the META tag method, you can specify
restrictions in your web pages individually.
Unfortunately, this method is less recognized
by robots than the robots.txt method.
The META tag has two parameters in its
content. They are INDEX or NOINDEX.
and FOLLOW or NOFOLLOW. See the
following examples:
| Figure
2: sample of "ROBOTS" META tag. |
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
|
In
this example, the robot is instructed to neither
index the current page, nor to follow links
in the page for indexing.
| Figure
3: sample of "ROBOTS" META tag. |
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">
|
In
this example, the robot is instructed not to
index the current page, but allows it to follow
links in the page for indexing. The structure
of the META tag should be clear by now.
NOTE:
All META tags should be specified within
the <HEAD></HEAD> block of
your HTML document.
Additional
Notes
For best results, you should use the robots.txt
method if possible. Simply FTP the file into
your ~/public_html/ directory.
The
META tag method can be used if you
can not use the robots.txt method.
Both
methods can be used together.
If
privacy of your web pages is essential,
password protect your page using an .htaccess file
(shell access is needed for this). Robots can't
enter a password protected page without the
password.