Robots.Txt – Functions, Syntax, Best Practices, & More
The initial version of robots.txt was created in 1994 by Martijn Koster after his website was overwhelmed with crawlers. The search engines used it to manage their server resources, despite it not being up to the “internet standard”. Since then, it has evolved and become one of the essential components used in SEO practices. In this blog, we discuss what robots.txt is, its functions, its syntax, advantages and disadvantages, and best practices related to it.
Robots.txt Overview
It is a text file containing instructions for search engine robots regarding the pages they have to and do not have to crawl. With these instructions, the behavior of some or all bots is “allowed” and “disallowed”. Simply put, it tells the search engine crawlers which URLs can access a website. To enhance your knowledge of SEO tools and practices consider taking an online digital marketing course.
Just like other files on the website, the robots.txt file is hosted on the server. To find it, type the full URL of the web page with “/robot.txt” at the end. Remember, this file is always to be placed at the root domain level.
Example:
www.sample.com/robots.txt
One can create a robots.txt file by following the steps given below.
- Open a .txt document in a text editor or web browser and name the document “robots.txt”.
- Now add the directives in this file. There can be one or more groups of directives each consisting of multiple lines of instructions.
- Each group will have information such as the user agent or the one to which the group applies, directories that the agent can access or cannot access, and a sitemap for the search engines to know which pages and files are important.
- Save the file to the computer.
- Upload this file to your site and now it is available for the search engines to crawl. Your site’s file structure and web hosting are decisive in how you upload the file.
- The next step is to test your file. To check whether the file is publicly accessible, first, open a private window in the browser and search for the robots.txt file.
- If the content you added is visible then test the markup or HTML code. We can do this with the help of robots.txt Tester in the search console of Google.
- The tester will identify the syntax errors and you can edit them directly on the page and retest.
- Note that the changes made on the page here are not saved.
To implement these changes you have to copy the edited text and paste it into robots.txt file on your website.
Functions of Robots.txt
Here are the functions of this file.
- Blocking Duplicate and Non-public Pages- robots.txt allows one to block duplicate and non-public pages such as staging sites, login pages, internal search results pages, etc. It is because these pages are not created to appear in search engine results pages or SERPs. If these pages are blocked then more pages that matter will appear in search results.
- Crawl Budget Optimization- Crawl budget is the number of pages a search engine will crawl on a website in a given time frame. The number depends on the size of the website, its health, and the amount of backlinks. A website’s number of pages may exceed the website’s crawl budget. In this case, we will have unindexed pages on our website that will not rank on the search results. When we block the pages that are not necessary, we will be saving the crawler budget and utilizing it for necessary pages.
- Hiding Resources- It also helps in keeping the crawler from crawling on PDFs, and media files such as images and videos. These are hidden and the search engine gets to focus on more important content.
Enroll in our Full stack developer course with placement and Get a confirmed ₹35,000 total stipend
Syntax of Robots.txt
The syntax consists of the following components:
- One or more blocks of derivatives. These are the rules for the search engine bots.
- Each block has a specified user agent. The user agent is the search engine.
- Finally, there is an “allow” or “disallow” instruction.
Here is a sample block to look at.
User-agent: Googlebot
Disallow: /not-for-google
User-agent: DuckDuckBot
Disallow: /not-for-duckduckgo
Sitemap: https://www.website.com/sitemap.xml
Various directives used in the syntax are as follows:
i. User-Agent Directive
It is the first line of every directives block that identifies the crawler. Since most of the search engines have multiple crawlers specific for indexing, images, videos, etc. the bot chooses the most specific block of directives available.
ii. Disallow Directive
It is used to specify which content the crawler cannot access. There can be multiple disallow directives that tell the crawler to not access specified parts of a website. An empty disallow line would mean a crawler can access all parts of a website. Keep in mind that the allow and disallow directives are not case-sensitive but the values in them are.
Example:
User-agent: Googlebot
Disallow: /wp-admin/
iii. Allow Directive
The directive allows the search engine to crawl a subdirectory or specific page. It can be done even when a directory is otherwise disallowed.
Example:
User-agent: Googlebot
Disallow: /blog
Allow: /blog/example-post
iv. Sitemap Directive
It informs the search engine where to find the XML site map. It is specifically so for Bing, Yandex, and Google. The sitemaps include the pages one wants the search engine to crawl and index. It is either placed at the top or bottom of the robots.txt file.
Example:
Sitemap: https://www.website.com/sitemap.xml
v. Crawl-Delay Directive
To stop the server from overtaxing, this directive is used to instruct the crawlers to delay their crawl rates.
Example:
User-agent: *
Crawl-delay: 20
vi. Noindex Directive
The robots.txt cannot tell a search engine which URLs to not index and show in results pages. If a web page is disallowed but is linked to an outer source, then it can show on the results page but the bot will not know what information it contains. Since 2019, Google has officially announced that this directive is not supported.
Advantages of Robots.txt file
Here are some advantages :
- Protects Against Server Overload: It helps to protect your server against overloading or being overwhelmed by the traffic.
- Avoiding Indexing Live Site: If our web page or website is still under development but is live, we can use the robots.txt file to prevent the crawler from crawling to this site.
- Manage Paid Link and Advertisement Traffic: There might be different requirements for advertisement and the settings could be made such for different search engine crawlers.
Disadvantages of Robots.txt file
Here are some disadvantages of robots.txt files.
- Rules Not Supported by All Search Engines: It is up to the crawler whether they want to obey the rules of robots.txt or not. Some respectable search engine bots such as Googlebot and others will follow the rules but others might not.
- Different Interpretations by Different Crawlers: Though, all good bots obey the rules by robots.txt not all of them similarly interpret these rules. The situation might arise due to improper syntax for different web crawlers.
- Disallowed Pages can Still be Indexed Under Certain Circumstances: Content blocked by robots.txt will not be indexed by the search engine bots but one can still find and index it using other places on the web. The blocked URL, this way, will still appear on the search results.
Best Practices for Robots.txt in SEO
Here are some practices you can follow to enhance your experience with robots.txt in SEO.
- Use new lines for each directive.
- To keep things simple and organized use each user-agent once.
- Use wildcards (*) for applying a directive to all user agents and matching URL patterns.
- To indicate the end of the URL, use the “$” sign.
- To make the files easy to read, use “#” to add comments. The crawlers ignore everything that starts with “#”.
- Use separate robots.txt files for different subdomains since such files control the crawling behavior only on the subdomain where they are hosted.
Conclusion
Robots.txt is an essential SEO practice that one should be familiar with if we are to deal with website engagement daily. With the information about what robots.txt is, its functions, syntax, advantages and disadvantages, and best practices, you can enhance your page’s visibility in the search engine results pages and accomplish your digital marketing goals.
FAQs
Robots.txt tells the search engine which page to crawl and which page to ignore. XML is an extensible markup language that gives the details of the URLs on the websites.
Robots.txt is good for SEO as it helps in blocking irrelevant pages giving way to relevant pages and content on your website.
No, robots.txt is not the same as sitemap. Sitemap is a component of robots.txt that is used to inform the search engine where to find the XML site map and which website to crawl on. Robots.txt tells the search engine which page to crawl and which page to ignore.
Robots.txt does not contain any legal binding between the site owner and the user.