CYBERNET HOME ABOUT CYBERNET CLIENT LOGIN CONTACT US
Cybernet Main Page
Cybernet Personal Service
Cybernet Business Service
Reseller Plans
Cybernet Webmail Access
Cybernet Postini Centre
My Account
 
 

A crawler, robot (or "bot"), spider or wanderer is a computer program that searches the Internet and records information about web pages. For example, a "spam bot" will search the pages in web sites and record the e-mail addresses linked in home pages.

 

Using robots.txt
Although you can't keep away all robots, you can give instructions to certain ones that follow the "Robot Exclusion Protocol". This method requires that you create a plain text file called robots.txt and place it in your site's root directory. For example, if your web site URL is http://www.mycompanydomain.com/, then the file robots.txt must be accessible from http://www.mycompanydomain.com/robots.txt in order to restrict your site. It is important to note that the file must be in your root directory (ie. ~/public_html/) and no other.

The contents of robots.txt consists of mainly two commands: "User-agent" and "Disallow".

The "User-agent" command allows you to set restrictions on a robot with a particular name (or signature). You can set this to the asterisk (*) to specify that restrictions apply to all robots that aren't identified elsewhere in the file.

The "Disallow" command allows you to deny access to certain directories in your web site.

 

Example of robots.txt
Say your URL is http://www.mysite.com. If you put a robots.txt file into your ~/public_html/ directory containing the following:

 

Figure 1: sample robots.txt file.
User-agent: *
Disallow: /neat_stuff/
Disallow: /my_pvt_stuff/

User-agent: WebCrawler
Disallow: /


In the example, all robots would have free access to the web site except for files contained in the /neat_stuff/ and /my_pvt_stuff/ directories, but the WebCrawler robot is denied all access to the site.

 

NOTE: Since not all robot authors acknowledge the "Robots Exclusion Protocol", it is not possible to stop all robots using this method. However, most search engine robots do follow this protocol. Please refer to documentation on their sites for more information on this.

 

Using the "ROBOTS" META tag
Using the META tag method, you can specify restrictions in your web pages individually. Unfortunately, this method is less recognized by robots than the robots.txt method. The META tag has two parameters in its content. They are INDEX or NOINDEX. and FOLLOW or NOFOLLOW. See the following examples:

 

Figure 2: sample of "ROBOTS" META tag.
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

In this example, the robot is instructed to neither index the current page, nor to follow links in the page for indexing.

 

Figure 3: sample of "ROBOTS" META tag.
<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">

In this example, the robot is instructed not to index the current page, but allows it to follow links in the page for indexing. The structure of the META tag should be clear by now.

 

NOTE: All META tags should be specified within the <HEAD></HEAD> block of your HTML document.

 

Additional Notes
For best results, you should use the robots.txt method if possible. Simply FTP the file into your ~/public_html/ directory.

The META tag method can be used if you can not use the robots.txt method.

Both methods can be used together.

If privacy of your web pages is essential, password protect your page using an .htaccess file (shell access is needed for this). Robots can't enter a password protected page without the password.

 

 

 

CYBERNET HOME ABOUT CYBERNET CLIENT LOGIN CONTACT US

POLICY, TERMS and CONDITIONS
Cybernet is a Toronto-based Canadian Company specializing in Web hosting, E-Mail, High Speed ADSL, Dial-up Access and has been serving the North American Market since 1997.
© 2004 Cybernet Communications Inc.

 

Canadian web hosting, Toronto web hosting, Reliable web hosting Toronto, High Speed ADSL, Dial-Up Access, Business web hosting,Web Hosting for Resellers, web hosting Toronto, web site hosting in Canada, E-Mail ADSL Ontario, ADSL Quebec.