What is robots.txt?
When a search engine crawler comes to your site, it will look for a special file on your site. That file is called robots.txt and it tells the search engine spider, which Web pages of your site should be indexed and which should be ignored.
The robots.txt file is a simple text file (no HTML), that must be placed in root directory, i.e.
http://www.yourwebsite.com/robots.txt
Creating robots.txt file?
The robots.txt file is a simple text file. Open a simple text editor(Notepad) to create it. The content of a robots.txt file consists of so-called “records”.
A record contains the information for a special search engine. Each record consists of two fields: the user agent line and one or more Disallow lines.
User-agent: googlebot
Disallow: /cgi-bin
This robots.txt file would allow the “googlebot”, which is the search engine spider of Google, to retrieve every page from your site except for files from the “cgi-bin” directory. All files in the “cgi-bin” directory will be ignored by googlebot.
The Disallow command works like a wildcard. If you enter
User-agent: googlebot
Disallow: /support
both “/support.html” and “/support/index.html” as well as all other files in the “support” directory would not be indexed by search engines.
If you leave the Disallow line blank, you’re telling the search engine that all files may be indexed. In any case, you must enter a Disallow line for every User-agent record.
If you want to give all search engine spiders the same rights, use the following robots.txt content:
User-agent: *
Disallow: /cgi-bin
Where can I find user agent names?
You can find user agent names in your log files by checking for requests to robots.txt. Most often, all search engine spiders should be given the same rights. in that case, use “User-agent: *” as mentioned above.
Dont’s
If you don’t format your robots.txt file properly, some or all files of your Web site might not get indexed by search engines. To avoid this, do the following:
- Don’t use comments in the robots.txt fileAlthough comments are allowed in a robots.txt file, they might confuse some search engine spiders.
“Disallow: support # Don’t index the support directory” might be misinterepreted as “Disallow: support#Don’t index the support directory“.
- Don’t use white space at the beginning of a line. For example, don’t write
User-agent: *
Disallow: /supportbut
User-agent: *
Disallow: /support - Don’t change the order of the commands. If your robots.txt file should work, don’t mix it up. Don’t write
Disallow: /support
User-agent: *but
User-agent: *
Disallow: /support - Don’t use more than one directory in a Disallow line. Do not use the following
User-agent: *
Disallow: /support /cgi-bin /images/Search engine spiders cannot understand that format. The correct syntax for this is
User-agent: *
Disallow: /support
Disallow: /cgi-bin
Disallow: /images - Be sure to use the right case. The file names on your server are case sensitve. If the name of your directory is “Support“, don’t write “support” in the robots.txt file.
- Don’t list all files. If you want a search engine spider to ignore all files in a special directory, you don’t have to list all files. For example:
User-agent: *
Disallow: /support/orders.html
Disallow: /support/technical.html
Disallow: /support/helpdesk.html
Disallow: /support/index.htmlYou can replace this with
User-agent: *
Disallow: /support - There is no “Allow” commandDon’t use an “Allow” command in your robots.txt file. Only mention files and directories that you don’t want to be indexed. All other files will be indexed automatically if they are linked on your site.
Tips and tricks:
1. How to allow all search engine spiders to index all files
- Use the following content for your robots.txt file if you want to allow all search engine spiders to index all files of your Web site:
User-agent: *
Disallow:
2. How to disallow all spiders to index any file
- If you don’t want search engines to index any file of your Web site, use the following:
User-agent: *
Disallow: /