There has been a lot of debate on the use of robots.txt; whether the robots.txt file in SEO has any importance or not, and how to make the best use of it.
For starters, robots.txt is a set of directions that inform web crawlers about the parts of the website they can access. Almost all the renowned search engines, which include Google, Yahoo, Bing, and Yandex promote the usage of robots.txt. It helps identify which pages should be crawled, indexed, and shown on the search results.
As a website owner, your utmost priority must be to get your site indexed so that it displays on Google. Oftentimes, people encounter issues when their website fails to get indexed. There might be a lot of reasons for this occurring, but in most cases, this happens when there’s an issue with the robots.txt file. This is a very common problem that auditors come across while going through the technical SEO of a site and it’s important to know that an error in robots.txt file can lead to a massive drop in search rankings. Despite the fact that this problem is quite common, we still see even the most experienced SEO experts committing robots.txt errors.
That said, you must understand two very important things:
- What is robots.txt
- How to use robots.txt in all the leading Content Management Systems (CMS)
Once you get the hang of these two things, you will know how to rightly create a robots.txt file that is best suited for SEO. This will also ease the process for web crawlers and will subsequently help in indexing your web pages quickly.
Table of Contents
What is Robots.txt?
Mostly referred to as “robots exclusion standard or protocol” the robots.txt is a text file found in the main directory of your website. According to Moz is basically an instruction file for SEO web crawlers, telling them which part of the website should be crawled and which to be ignored.
The History Of Robots.txt
Mr. Martijin Koster proposed the robots txt file to help regulate how different search engine bots and web spiders will access web content. Here’s how it developed over the years:
Mr. Martijin Koster proposed the robots txt file to help regulate how different search engine bots and web spiders will access web content. Here’s how it developed over the years:
1994: Koster created web spiders back in 1994, but it kind of backfired. The spiders posed a malicious attack on his servers. In order to protect websites from viral SEO crawlers, Koster then created robot.txt. This helped guide search bots to find the right pages and stop them from reaching certain areas of a site.
1997: This is when a draft was formulated that specified web robots control methods with the help of robots txt file. After this, robot txt was used as a tool to prevent or direct spider robots to particular parts of a website.
2019: Google made a formal announcement after rolling out the robots exclusion protocol (REP) specifications, on July 1, 2019. This was made the web standard after almost 25 years from when robots txt file was created.
The main purpose was to detail unspecified scenarios for robots txt so that they can adapt to the modern web standards. The modern draft includes some important points. These are:
- Any Uniform Resource Identifier-based transfer protocol (URI) can make use of robots txt file. These are mainly HTTP, Constrained Application Protocol (CoAP), and File Transfer Protocol (FTP).
- Web developers are told to break down the first 500 kibibytes of a robots txt file to remove any extra strain.
- The robots.txt SEO content is usually cached for around 24 hours so that website owners and developers have sufficient time to update their robot txt file.
- Disallowed pages will not be crawled until the robots.txt file becomes inaccessible due to server issues.
Over the years, a lot of efforts have been made to stretch the robots.txt exclusion mechanisms. But the issue arises when a lot of web crawlers are unable to support the new robots.txt protocols.
To learn more about the robots.txt file and its working, let’s first shed light on web crawlers so that we’re clear on how they work.
What Is Robots.txt Used For?
This syntax is used to manage the spider crawl traffic on your website. This file plays a big role in making your website accessible to search engines and visitors.
Here we have outlined the best ways for you to enhance your SEO strategy using robots txt for WordPress and other CMS:
- Work to improve and increase your website crawlability and indexability
- Do not duplicate any content in the search results
- Prevent the web crawl robots to find pages that aren’t ready to be published
- Work on improving the overall user experience
- Pass link juice to the right pages
- Do not overload your site with Google web crawl and search bots requests
- Use the robots.txt nofollow directives to prevent Google crawl robots from crawling private areas of your website.
- Save your site from bad bots
- Increase your crawl budget
What Are Web Crawlers?
Web crawlers, spider bots, site crawlers, or search bots are almost the same thing. This is basically an internet bot that’s operated by search engines. This bot crawls the web to examine the web pages in order to ensure that information on the page can be viewed or fetched by users at any time.
What’s the role of web crawlers in technical SEO? To answer this, you must first learn about the different types of site crawlers found on the web. Each robot serves a distinct purpose. These are:
- Search Engine Bots
- Commercial web spider
- Personal crawler bot
- Desktop site crawler
- Copyright crawling bots
- Cloud-based crawler robot
Customize Your Robots.txt File For a Specific Search Spider, Disallow Access To Particular Files, Or Control Robots.txt Crawl Delay
The default SEO robots.txt file looks like this:
These are some of the directives you need to follow. We have also provided some robots.txt file examples here for you to have a clear understanding of it.
User Agent
This directive refers to the SEO crawler for which the command was written. This is the first line for any robots.txt format or rule group.
This command uses the * symbol, which means that the directive applies to all the bots.
Every SEO crawler has a unique name. the Google web crawlers are called Googlebot, while the Yahoo spider is called Slurp.
Example 1:
User-agent: *
Disallow: /wp-admin/
Since * was used in the above example, it means the robots.txt will not allow user-agents to access the URL.
Example 2:
User-agent: Googlebot
Disallow: /wp-admin/
Now since Googlebot was named the user-agent, all search spiders will be able to access the URL except Google crawlers.
Example 3:
User-agent: Googlebot
User-agent: Slurp
Disallow: /wp-admin/
The above example shows that all user-agents except Google crawlers and Yahoo’s bots have the access to URLs.
Allow Command
This command states which content can be accessed by the user-agent. The robots.txt allow command is supported by Bing and Google.
The allow protocol has to be followed by the path that can be accessed by Google web crawlers and other bots. If the path isn’t specified, the crawlers will overlook the robot txt allow directive.
Example 1:
User-agent: *
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-admin/
In the example above, robots txt allow directive applies to every user-agents so all spider search engines are blocked from accessing the /wp-admin/ directory except for the page /wp-admin/admin-ajax.php
Disallow Command
This command is used to indicate all URLs that must not be accessed by Google Crawl robots and website crawlers. Similar to the allow command, the robot txt disallow directive should also be followed by the path you don’t want web crawlers to go at.
Example 1
User-agent: *
Disallow: /wp-admin/
In the above example, the command will restrict all user-agents to access /wp-admin/ directory.
Sitemap
The robot txt sitemap command help to direct Google bots and other web crawlers to the XML sitemap. This is supported by search engines like Google, Yahoo, Bing, and Ask.
Example 1
User-agent: *
Disallow: /wp-admin/
Sitemap: https://websitename.com/sitemap1.xml
Sitemap: https://websitename.com/sitemap2.xml
The above example shows that the disallow command is telling all search bots not to access the /wp-admin/. The syntax also tells that there are 2 sitemaps that are on the website. Once you know how to add a sitemap to robots.txt, you can add several XML sitemaps in the robots.txt file.
Crawl Delay
The robots.txt crawl delay command prevents the Google web crawler and other spiders from overtaxing a server. The crawl delay command enables admins to specify the time Google spiders should wait between each Google crawl request. This happens in milliseconds.
Example:
User-agent: *
Disallow: /wp-admin/
Disallow: /calendar/
Disallow: /events/
User-agent: Bingbot
Disallow: /calendar/
Disallow: /events/
Crawl-delay: 10
Sitemap: https://websitename.com/sitemap.xml
In the above example, the robot.txt crawl delay command instructs the spiders to wait for 10 seconds before requesting the next URL.
Also, there are some web spiders, like Google web crawler that don’t support robots.txt crawl delay commands. For this, you must run your syntax on a robots.txt checker before you submit the robots.txt to Google or any other search engine to save yourself from parsing problems.
Where Will You Find Robots.txt in WordPress?
Known as the most famous and widely used CMS, WordPress powers more than 40% of all websites. This is why it’s important to know how to edit robots.txt WordPress.
How will you find robots.txt in WordPress? These are the steps you need to follow to access WordPress robots.txt file:
- Log in to your WordPress dashboard
- Find “SEO”
- Click “Yoast”. This plugin is a must if you want to edit robots.txt
- Click on “File Editor”
- Here you’ll be able to view WordPress robots.txt file and edit it in the WordPress directory.
Wrap Up
It’s not easy to deal with robots.txt optimization and other technical SEO things, especially if don’t have the right skill set, tools, and personnel do the job. However, you don’t need to worry about your website issues if you can easily connect with professionals who can give you the right results, instantly.
Entrust all your SEO work, website maintenance tasks, and digital marketing needs to Digiown, and let us build your online presence.