We have heard of crawlers and bots are crawling our sites to scrap content for various reasons like indexing in search engines, identifying content, scanning email addresses, etc. There are all kinds of crawlers/bots which crawl websites. While some are good which should be allowed access to our site, but we might want to restrict some. In this post we will see how we can do this.
- What is robots.txt?
- When to use is?
- Do all crawlers follow robots.txt?
- How to use it?
- Allow everyone to access the site
- Allow only one crawler to access the site
- Disallow everyone from site
- Disallow access to specific directories
- Disallow access to specific bots
- Disallow different bots from different directories
- Delay crawl rate
- Specify the location of sitemap
- Robots Tag
What is robots.txt
The robots.txt file is a simple text file which is read by bots and crawlers to identify how it should crawl the site. The bots that crawl the website are automated and they check for the robots.txt file before accessing the website. We can specify which crawlers are allowed to crawl the site, which directories should not be crawled, crawl rate, etc.
When to use it?
The robots.txt file is required only when you want to have some content on your site excluded from the search engines. If you don’t want to exclude anything (i.e. include everything) on the search engines than you don’t need robots.txt file.
If you don’t have a robots.txt file sometimes the server might return a 404 or Permission Denied when trying to access the file and this might cause issues, but it is not a big problem. Hence, it is always better to have robots.txt whether it is blank or with code to allow access to everyone.
User-Agent: * Disallow:
I would choose to have a robots.txt file with the above code to allow access to everything for all bots rather than having an empty or no robots.txt file
Do all crawlers follow robots.txt
Most of the reputed crawlers like Google, Bing, etc follow the robots.txt file. However, there are many crawlers/bots which simply choose to ignore the robots.txt file. It is not required by each crawler to follow the robots.txt file, so it is always better to protect the content you don’t want to allow everyone access using passwords.
How to use it?
The robots.txt file is a very simple text file which needs to be in the root folder of your domain. If you do not have access to the root domain, then you cannot use robots.txt file to block access. In this case you can use the robots meta tag. Also, pages included in robots.txt file may be still be indexed if the are linked from some other places. So using the Robots tag on the page would prevent it from getting indexed.
You can have different rules for different crawlers, but should have the rule for all crawlers first and then for specific crawlers. If you have your robots.txt file setup as this then the crawler will use the rules for all crawlers and then the specific crawler, with the rules for specific crawler overriding the rules for all crawlers.
Allow everyone to access the site
To allow access to all crawlers to the all the pages and directories we can have a blank robots.txt or use the following code in the file.
User-Agent: * Disallow:
Allow only one crawler to access the site
To allow access to only one crawler to the site and disallow all other crawlers
User-Agent: GoogleBot Disallow: User-Agent: * Disallow: /
This will allow only “Googlebot” and disallow all other bots.
Disallow everyone from site
To disallow all crawlers from the site use the follow code:
User-Agent: * Disallow: /
Note: If you do this than no crawler can crawl your site and this may result in the site not getting indexed in the search engines. Use this only if you really don’t want your content to be indexed anywhere.
Disallow access to specific directories
When you want to disallow access to specific directories for all the bots.
User-Agent: * Disallow: /disallow_access/ Disallow: /restricted/
The above code will instruct all the crawlers to not crawl the “disallow_access” and “restricted” directories on your domain.
Disallow access to specific bots
You might want to disallow access to specific bots from accessing your site.
User-Agent: Googlebot Disallow: /restricted/
The above code will instruct “Googlebot” to not crawl the “restricted” directory on your domain. If you have only this code in your robots.txt file, than only “Googlebot” would be instructed not crawl the “restricted” directory. All other crawlers are allowed access to that directory.
Disallow different bots from different directories
To have different rules for different crawlers use the following:
User-Agent: * Disallow: User-Agent: Googlebot Disallow: /restricted/ User-Agent: BadBot Disallow: /disallow_access/
This would disallow “GoogleBot” from “restricted”, “BadBot” from “disallow_access” directories.
Delay crawl rate
You can delay the rate with which the crawler crawls the site. This value is relative with the default crawl rate of that particular crawler. It is best not to use this value for the common, well-behaved bots as they automatically determine the best crawl rate for your site.
User-Agent: * Crawl-delay: 1
The value for Crawl-delay should be a positive integer. If no value is specified it means use the default crawl rate. If value is 1 it mean crawl slowly, 5 very slow and 10 extremely slow. This value does not affect how frequently a site is crawled, but only how fast it should process the site when it is crawling.
Specify the location of sitemap
You can specify the location of your site map in the robots.txt file.
The tag can be used to tell the robots not to index the content of a page. It can also be used to allow/disallow the crawler to follow the links of the page.
The syntax is
<html> <head> <title>...</title> <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"> </head>
To prevent the page from being indexed in the search engines but allows the crawler to follow the links present on the page use the follow tag on your page.
<meta name="robots" content="noindex, follow">
Note: All robots may not follow the tag, they can choose to ignore it.