Posted by Miki on October 29th, 2008 in SEO, Web Design

A common question I get asked is just what the heck is a “robots.txt” file and what good is it? Well, in some ways it can be the most important file you have, but it’s often overlooked and disregarded.

It’s important in that making a mistake in your robots.txt file could easily torpedo the success of your site. One single typo, and poof, gone. Let me explain.

The robots.txt file is a manually created that needs to be placed in the website’s root directory (the top). It is used by visiting search engine robots/crawlers (like googlebot and yahoo slurp) to provide them with information on where to search and where not to search on your website.

For probably about 95% of all webmasters the file a simple two-liner that reads:

User-agent: *
Disallow:

And that’s it. The code tells all robots to search and index everything it can find on your site. The “User-agent:” refers to the actual robot name, in this case it’s marked with a wildcard “*” signifying this directive applies to all robots. (if we wanted just to command Google’s crawler we could write “Googlebot”)

The second line “Disallow:” tells the robot that it is restricted from accessing certain files, in this case since it’s blank the robots can access all files it can find.

Now, let’s say you had some private or personal information on a web pages that you didn’t want indexed and showing up on a search engine. In such a case then we could easily alert the visiting bots through the robots.txt file to stay away from that page. If that page name was “personal.html” then your robots.txt file would like this:

User-agent: *
Disallow: /personal.html

What if you wanted to keep the bots out of an entire directory? Easy. Just disallow it. If that directory was named “finances” your line would be:

Disallow: /finances/

But be careful, when you disallow a directory you also prevent the bots from visiting any of its sub-directories if they have any and naturally all files inside.

You can also add multiple disallow rules. For example if you had three files you wanted to prevent from being crawled, you would write each file on its own line like below:

User-agent: *
Disallow: /file1.html
Disallow: /file2.html
Disallow: /file3.html

Make sure you have the correct path to the file. Say you left off the file name leaving just “/”. That would prevent the robot from indexing anything on your site!

If for example you didn’t want Yahoo search engine crawler, Slurp, indexing your images directory, your file would be:

User-agent: Slurp
Disallow: /images/

Also, if a page you don’t want indexed is already index, changing the robots.txt file isn’t going to immediately help. In that case you are best off contacting the search engine directly to have the page manually removed.

There’s plenty of other intricate rules which I won’t get into here, but those are the basics. Many websites though don’t even have such a file, which is fine for the most part as robots when the file is not found will then search the entire site by default. But it’s a good idea to have just in case.

For more on robots.txt usage you can visit its web page by clicking here.


Post Your Thoughts

Name (required)

Mail (will not be published) (required)

Website

Copyright © 2004-2010 First Serve Media, LLC. All rights reserved.