Robots.txt is a tool to instruct web robots (typically programme robots) to crawl pages on their web site. The robots.txt file is an element of the the robots exclusion protocol (REP), a group that regulate robots crawl the web, access and index content, and serve that content to users. The REP also includes directives like meta robots, as well as meta robots, similarly as page-, subdirectory-, or site-wide directions for a way search engines should treat links (such as “follow” or “nofollow”).
In practice robots.txt file show whatever a user agents (web-crawling software) can or can’t crawl a web site. These crawl instructions are specified by “disallowing” or “allowing” the behavior of bound (or all) user agents.
User-agent: [user-agent name]
Disallow: [URL string not to be crawled]
Together, these two lines are considered a complete robots.txt file — though one robots file can contain multiple lines of user agents and directives (i.e., disallows, allows, crawl-delays, etc.).
Within a robots.txt file, each set of user-agent directives appear as a discrete set, separated by a line break:Example robots.txt:
Here are a few examples of robots.txt in action for a www.example.com site:
Robots.txt file URL: www.example.com/robots.txt
Blocking all web crawlers from all content
Using this syntax in a robots.txt file would tell all web crawlers not to crawl any pages on www.example.com, including the homepage.
Allowing all web crawlers access to all content
Blocking a specific web crawler from a specific folder
Blocking a specific web crawler from a specific web page
Where does robots.txt go on a site?
Whenever they come to a site, search engines and other web-crawling robots (like Facebook’s crawler, Facebot) know to look for a robots.txt file. But, they’ll only look for that file in one specific place: the main directory (typically your root domain or homepage). If a user agent visits www.example.com/robots.txt and does not find a robots file there, it will assume the site does not have one and proceed with crawling everything on the page (and maybe even on the entire site). Even if the robots.txt page did exist at, say, example.com/index/robots.txt or www.example.com/homepage/robots.txt, it would not be discovered by user agents and thus the site would be treated as if it had no robots file at all.
In order to ensure your robots.txt file is found, always include it in your main directory or root domain.
Why do you need robots.txt?
Robots.txt files control crawler access to certain areas of your site. While this can be very dangerous if you accidentally disallow Googlebot from crawling your entire site (!!), there are some situations in which a robots.txt file can be very handy.
Some common use cases include:
- Preventing duplicate content from appearing in SERPs (note that meta robots is often a better choice for this)
- Keeping entire sections of a website private (for instance, your engineering team’s staging site)
- Keeping internal search results pages from showing up on a public SERP
- Specifying the location of sitemap(s)
- Preventing search engines from indexing certain files on your website (images, PDFs, etc.)
- Specifying a crawl delay in order to prevent your servers from being overloaded when crawlers load multiple pieces of content at once
If there are no areas on your site to which you want to control user-agent access, you may not need a robots.txt file at all.
Checking if you have a robots.txt file
Not sure if you have a robots.txt file? Simply type in your root domain, then add /robots.txt to the end of the URL. For instance, Moz’s robots file is located at moz.com/robots.txt.
If no .txt page appears, you do not currently have a (live) robots.txt page.
How to create a robots.txt file
f you found you didn’t have a robots.txt file or want to alter yours, creating one is a simple process. This articlefrom Google walks through the robots.txt file creation process, and this tool allows you to test whether your file is set up correctly.
Looking for some practice creating robots files? This blog post walks through some interactive examples.