This post was most recently updated on December 28th, 2016
What is Robots Exclusion Protocol?
Robots Exclusion Protocol is a convention where directives are written with an objective to restrict or channel web crawlers to access parts of website. A part of website may be publicly visible and rest is private for all or some web crawlers.
The standard was proposed by Martijn Koster.
The robots.txt file need to be in root directory of your site.
Directives
- User-agent: user agent directive is used to specify the robots/ web crawlers that can access the allowed URLs. A “*” means allowed/ disallowed URLs for all robots. List of user agents for specific robots can be found here.
- Allow/ Disallow: the directives are used to specify allowed or disallowed URLs of a website respectively.
- Crawl-delay: crawl delay directive is supported by major crawlers, set to number of seconds to wait between successive requests to the same server
- Sitemap: sitemap directive specifies a path to sitemap.xml and is recognized by some crawlers.
- Host: host directive is supported by some crawlers allowing websites with multiple mirrors to specify their preferred domain
Robots Exclusion Protocoln can also be applied using meta tags and X-Robots tag in HTTP header.
A “noindex” meta tag:
1 |
<meta name="robots" content="noindex" /> |
A “noindex” HTTP response header:
1 |
X-Robots-Tag: noindex |
The X-Robots tag is only effective when the page is requested and meta tag is effective when the page has loaded, on the other hand the robots.txt is effective before the page has loaded.
If crawler received a no-indexing directive from robots.txt it will ignore the header X-Robots tag and meta tags.