Harnessing the Power of Robots.txt

Sometimes, we may want search-engines to not list certain elements of the site, or even ban other SE from the site altogether.

This really is where a simple, little 2 line text file called robots.txt is available in.

Once we've a web site up and running, we need to ensure that all visiting search-engines can access all the pages we want them to look at.

Robots.txt lives inside your websites main directory (on LINUX systems this can be your /public_html/ directory), and looks something like the following:

User-agent: *


The first line controls the bot that'll be visiting your site, the 2nd line controls if they are allowed in, or which areas of the site they are not allowed to go to

If you prefer to handle multiple bots, then easy repeat the above mentioned lines.

Therefore an example:

User-agent: googlebot


User-agent: askjeeves

Disallow: /

This will enable Goggle (user-agent name GoogleBot) to see every page and index, while at the sam-e time banning Ask Jeeves in the site fully.

To locate a fairly up to date list of robot individual names this visit http://www.robotstxt.org/wc/active/html/index.html

Even though you want to let every robot to index every page of your site, its still very advisable to put a robots.txt file in your site. It'll stop your problem logs filling with items from search-engines trying to access your robots.txt file that doesnt exist.

For more information on robots.txt see, the entire set of resources about robots.txt at http://www.websitesecrets101.com/robotstxt-further-reading-resources.