One of the things that excited me most about the development of the web is the increase in learning resources. When I went to college at 19, it was exciting to find magazines, get access to thousands of dollars of textbooks, and download open-source software.

The things computers are designed to explain in our marketing world are:

Search queries – especially those that look more like programming constructs than natural language queries such as [site: distilled.net -inurl: www]
On-site part of setting up analytics – setting custom variables and events, adding virtual pageviews, modifying e-commerce tracking, and so on
Robots.txt syntax and rules
Creates HTML links, meta page information, alt attributes, etc.
Skills like Excel formulas that many of us find an important part of our day-to-day work
I am slowly building a codec-style interactive learning environment for all these things for our online training platform DistilledU, but most of them are only available to paid members. I thought it would be a good start in 2013 to pull one of these modules from behind the paywall and let the SEOmoz community overcome it. I chose robots.txt because our in-app response is showing that it is one of the ones people learned the most from.

Also, despite years of experience, I came to know a few things I did not know as I wrote this module (especially about the precedence of various rules and the interaction of wildcards with explicit rules). I hope it is useful to many of you – beginners and experts alike.

An interactive guide to Robots.txt

robots.txt

Robots.txt is a plain-text file found at the root of a domain (eg www.example.com/robots.txt). It is a widely accepted standard and allows webmasters to control all types of automated consumption of their site, not just search engines.

In addition to reading about the protocol, robots.txt is one of the more accessible areas of SEO because you can access the robots.txt of any site. Once you complete this module, you will be sure to value the robots.txt files of some large sites (for example Google and Amazon).

For each of the following sections, modify the text in TextRas and see them in green when the correct answer is found.

Original boycott
The most common use case for robots.txt is to block robots from accessing specific pages. The simplest version applies rules to all robots along a line by saying user-agent: *. The subsequent lines have specific exclusions that work cumulatively, so the codes below prevent the robot from accessing /secret.html.

Add another rule to block access to /secret2.html other than /secret2.html.

Consumer Agents: *
Disallow: /secret.html

Skip directories
If you end an exclusion command such as a trailing slash (“/”), such as disallowed: / private / then everything within the directory is blocked.

Modify the exclusion rule below to block a folder named incognito instead of page secret.html.

User-agent: * Disallow: / secret.html

Allow specific path
In addition to rejecting specific paths, the robots.txt syntax allows specific paths. Note that the permission to use the robot is the default state, so if there are no rules in a file, all paths are allowed.

The primary use for permission: instruction is to ride more than normal: instruction. The preceding rule states that “[path] will trump the less specific (shorter) rule based on the length of the entry. The order of precedence is undefined for rules with wildcards.”

We will demonstrate this by modifying the exclusion of / incognito / folder with permission: rule permission /secret/not-secret.html. Since this rule is long, it will take precedence.

User-agent: * Disapproved: / secret /

Restrictions for specific user agents
All the direction work we have done has been equally applied to all robots. It is specified by the user-agent: * which starts our command. Instead of *, however, we can design rules that apply only to specifically named robots.

Replace * with Googlebot in the example below to create a rule that applies only to Google’s robots.

User-agent: * Disapproved: / secret /

Add multiple blocks
It is possible to have multiple blocks of commands targeting different sets of robots. The robots.txt example below will allow Googlebot to access all files except in the / secret/directory and it will block all other robots from the entire site. Note that because Googlebot has a set of explicitly targeted instructions, Googlebot will completely ignore all robots’ intended instructions. This means that you cannot make your exclusions based on common exclusions. If you want to target the designated robot, each block must specify all its rules.

Add a second block of instructions targeting all robots (User-agent: *) that blocks the entire site (rejected: /). This will create a robots.txt file that blocks the entire site from all robots except Googlebot which can crawl any page except those / incognito / folder.

User-agent: reject googlebot: / secret /

Use more specific user agents
There are occasions when you want to control specific crawlers such as Google’s Images crawler differently from the main Googlebot. To enable this in robots.txt, these crawlers will choose to listen to the most specific user-agent string applied to them. So, for example, if there is a block of instructions for Googlebot and one for Googlebot-images, the picture crawler will follow a later set of instructions. If there is no specific set of instructions for Googlebot-images (or any other expert google bots), they will follow the regular Googlebot instructions.

Note that the crawler will only follow a set of instructions – there is no concept of cumulatively directing in groups.

Given the following robots.txt, Googlebot-images will follow google bots instructions (in other words the / incognito / folder will not crawl. Modify it so that the instructions for Googlebot (and Googlebot-news etc.) remain the same, but Googlebot: Images have a specific set of instructions which means that it will not crawl the / secret/folder or / copyright/folder:

User-agent: reject googlebot: / secret /

Basic wildcards
The following wildcard (specified with *) is ignored so rejected: / private * is the same as rejected: / private. Wildcards however are useful for matching multiple types of pages simultaneously. The star character (*) matches 0 or more instances of any valid character (including /,?, Etc.).

For example, Disallow: News * .html Block:

news.html
news1.html
news1234.html
newsy.html
news1234.html? id = 1
But does not block:

newshtml “Note the lack of.”
News.html matches are sensitive
/directory/news.html
Only instead of the entire page directory. Modify the following pattern to block in the Html blog directory:

User-agent: * Disallow: / blog /

Block some parameters
A common use-case of wildcards is to block certain parameters. For example, one way to handle asserted navigation is to prevent combinations of 4 or more aspects. One way to do this is to add a parameter for all combinations of 4+ aspects in your system such as? Crawl = no. For example this would mean that a URL with 3 aspects can be / facet1 / facet2 / facet3 /, but when a fourth is added, it is / facet1 / facet2 / facet3 / facet4 /? Crawl = no.

The robot rules that block this should look for * crawl = no (no *? Crawl = no, because the query string; sort = Esc & crawl = no would be valid).

To crawl a page = To prevent not being crawled, add a rule to Robo.Tax below.

User-agent: * Disapproved: / secret /

Match full filename
As we saw with folder exclusion (where a pattern like / private / will match the path of files in that folder such as /pStreet/pStreetfile.html), by default the pattern we specify in robots.txt, They only match a portion of the filename and allow anything to come later without explicit wildcards.

There are times when we want to be able to apply a pattern to the entire filename (with or without wildcards). For example, the following robots.txt looks like it prevents jpg files from being crawled, but will actually prevent a file called -jpg.html from being crawled as it also matches the pattern.

If you want a pattern to match the end of the filename then we should end it with a $ sign that denotes the “line end”. For example, modifying the exclusion from Disallow: /private.html to Disallow: /pStreet.html> would stop pattern matching /pStreet.html?sort=asc and therefore allow that page to be crawled.

Modify the pattern below to exclude actual .jpg files (ie those ending with .jpg).

User-agent: * Disallow: * .jpg

Add an XML Sitemap
The last line in many robots.txt files is an instruction specifying the location of the site’s XML sitemap. There are many good reasons to include a sitemap for your site and also to list it in your robots.txt file. You can read more about XML sitemaps here.

You specify the location of your sitemap using the Sitemap directive:

Add a sitemap directive to the following robots.txt. For a sitemap called My Site-Sitemap, which can be found at http://www.distilled.net/my-sitemap.xml.

User-agent: * Disapproved: / private /

Add a Video Sitemap
In fact, you can add multiple XML sitemaps (each on its own line) using this syntax. Go ahead and modify the robots.txt. It also includes a video sitemap called My Video-Site-Sitemap which resides on /my-video-sitemap.xml.

User-agent: * Disallow: / private / sitemap: /my-sitemap.xml

The idea of this post is to teach general principles on how robots.txt files are interpreted, rather than explain the best ways to use them. To know more about implementing technical SEO techniques you can contact our Digital Marketing Training Professionals at 99 Digital Academy for SEO and Digital Marketing Services.

LEAVE A REPLY

Please enter your comment!
Please enter your name here