The robots.txt file: all you need to know about implementation and how it helps you in SEO optimization
Apparently, a simple file on a website, robots.txt is one of the most important elements of a site, both for search engines and for those who are looking for just certain information from the site.
Beyond the aspects of functionality and SEO optimization, the robots.txt file still has a useful tool – it helps a site administrator tell “what pages Google wants to be indexed and which does not.
Because the discussion on this file is wider, this robots.txt article will provide an overview of the functionality of this file as well as a number of implementation tips to avoid problems that may arise from the file wrong.
For beginers, what or who is robots.txt?
For a simple understanding of the optimization concept, let’s take a short example: think that search engines are like a vast library, from which we need to extract only what we are interested in.
Taking into account that they contain millions of web pages from around the world, Google, Yahoo, Bing or other search engines send “spiders” to find new pages or updates of various sites in order to add them to their index.
In other words, the first thing the engines are looking for is the robots.txt files, which specify which of the pages are to be read and which are not.
If this file does not exist on the site, search engines will index all the pages you have found.
Robots Exclusion Standard has been developed since 1994 to announce the different search engines that you need to consider when inspecting a website.
Although robots.txt is often confused with “robots meta tags,” the difference between the two is that the first file stops the visibility of certain pages in the search engines, while the second one can only control how they are indexed.
Once a robots.txt file is implemented on a site, it will stop search engines from indexing files, folders, or links with sensitive data, such as a word or PDF file assigned to a secret folder.
Although the rules imposed by the Robots Exclusion Standard are respected by Google, Yahoo or Bing, there are others, weaker or suspicious programs that do not take into account the stipulated rules and index what they want.
Robots.txt is a simple file placed on a server that tells Googlebot, for example, whether or not to access a particular file.
It contains a protocol with a small set of commands, through which access to different sections of the site is allowed, as can be seen in the above image.
Here are some examples of robots.txt files on large sites:
How a robots.txt file looks like
Robots.txt is a text file, as can be seen from its extension, which means it can be edited according to preferences. Please note the following:
- – using small letters to create the file – Robots.TXT excluded;
- -it must be loaded into the base directory of the website;
- – if subdomains are used, separate robots.txt files must be created for each of them.
Primary, Robots Exclusion Standard has drafted two main and standard directives in the Robots.txt file:
- 1. User-agent: Defines a search engine, as a rule, to apply
- 2. Disallow: Tells a search engine not to search for an index a file, page, or entire folder.
The standard file looks like this:
User-agent: [The parameter to select the robots to index the site]
Disallow: [Unavailable URLs]
If you want to exclude several pages or folders, you can repeat using the “Disallow” parameter.
It is recommended that the location of the admin area and other private areas on the site are not included in robots.txt.
For example, if you want to exclude “wp-admin”, “admin”, “cgi-bin” and “contact.html” and “about.html”, the robots.txt file will look like:
If you do not want to index the site by any robots (which is obviously not recommended), the robots.txt file will take the form:
The star (*) can be used in general with User-agent to guide all search engines. For example, you can add what’s next in the site’s robots.txt file to block search engines from fully indexing it.
Some sites use this directive without the oblique bar (/) to declare that the site can be indexed.
To exclude a particular search engine, such as Bing, here’s what the robots.txt transformation will be:
To allow a specific or specific search engine:
Here’s a list of some of the top spiders of major search engines:
- Bingbot – Bing
- Googlebot – Google
- Googlebot-Image – Google Images
- Googlebot-News – Google News
- Teoma – Ask
The robots.txt file first removes the 404 error when a robot does not find it on the website and will prevent “file not found” messages.
The difference between a site with an active robots.txt file and one without one can be seen in the image below:
On the other hand, implementing a robots.txt file also has advantages in the following cases:
- if there are pages or directories on the site that do not want to appear in SERP;
- if you want to ignore duplicate pages – ideal in cases where CMS generates multiple URLs for the same content;
- if it is desired that the internal search results on the pages of the site are not indexed;
- to provide the search engines with information about the location of the sitemap;
- to use paid links or advertisements that require special instructions for robots;
- help tracking Google’s instructions.
Find out if the robots.txt file blocks important pages
A method to see if the robots file was written correctly and do not block important pages that should not have been banned from indexing is using the free tool: Google guidelines.
If you have access, you can use the robots file checker directly from Google Search Console.
Can Google index a page even if it is added to robots with the disallow parameter?
Yes. It may be that Google and other search engines index a file even if we want to block it by adding it to the robots with the disallow parameter.
To ensure that a page is not indexed, we recommend using the noindex meta tag.
META NAME=”ROBOTS” CONTENT=”NOINDEX, NOFOLLOW”;
;META NAME=”ROBOTS” CONTENT=”NOINDEX, FOLLOW”;
Robots.txt verification and testing
To check if a site contains the robots.txt file, just add “/robots.txt” at the end of the domain, as in the example: www.my-website.com/robots.txt.
But to check the robots.txt correctness, the following are used:
Google Webmaster Tools – after opening it, click on the “Blocked URL’s” option, as in the image below:
This tool displays the contents of the robots.txt file from the last copy that Google found saved on the site. If the file has been modified, it is likely not to be present at this time and, fortunately, you can enter any code you want in that window.
On the other hand, you can test this file through any URL. The Googlebot crawler is used to test robots.txt by default. However, you can choose from 4 other User-agents: Google-Mobile, Google-Image, Mediapartners-Google (Adsense) and Adsbot-Google (Adwords).
Conclusions about Robots txt file.
When a search engine spider accesses a site, it will look first if there is this special site file called robots.txt. It contains clues for search robots (for everyone or for some), related to indexing certain pages in the site.
Robots Exclusion Standard is a tool that helps create robots.txt files that make search engines understand what needs to be indexed and what not. Also, the robots.txt file is very important to block duplicate or non-user-friendly links.