noindex a post with meta robots noindex

robots.txt: the ultimate guide

robots.txt: the ultimate guide

May 17th, 2016 – 5 Comments

The robots.txt file is one of the primary ways of telling a search engine where it can and can’t go on your website. All major search engines support the basic functionality it offers. There are some extra rules that are used by a few search engines which can be useful too. This guide covers all the uses of robots.txt for your website. While it looks deceivingly simple, making a mistake in your robots.txt can seriously harm you site, so make sure to read and understand this.

What is a robots.txt file?


A couple of developers sat down and realized that they were, in fact, not robots. They were (and are) humans. So they created the humans.txt standard as a way of highlighting which people work on a site, amongst other things.

A robots.txt file is a text file, following a strict syntax. It’s going to be read by search engine spiders. These spiders are also called robots, hence the name. The syntax is strict simply because it has to be computer readable. There’s no reading between the lines here, something is either 1, or 0.

Also called the “Robots Exclusion Protocol”, the robots.txt file is the result of a consensus between early search engine spider developers. It’s not an official standard by any standards organization, but all major search engines do adhere to it.

What does the robots.txt file do?

Crawl directives

The robots.txt file is one of a few crawl directives. We have guides on all of them, find them here:

Crawl directives guides by Yoast »

Search engines index the web by spidering pages. They follow links to go from site A to site B to site C and so on. Before a search engine spiders any page on a domain it hasn’t encountered before, it will open that domains robots.txt file. The robots.txt file tells the search engine which URLs on that site it’s allowed to index.

A search engine will cache the robots.txt contents, but will usually refresh it multiple times a day. So changes will be reflected fairly quickly.

Let our SEO experts analyze and optimize your site: Get Yoast SEO Care! »

Yoast SEO CareTemporarily sold out Info


Where should I put my robots.txt file?

The robots.txt file should always be at the root of your domain. So if your domain is, it should be found at Do be aware: if your domain responds without www. too, make sure it has the same robots.txt file! The same is true for http and https. When a search engine wants to spider the URL, it will grab When it wants to spider that same URL but over https, it will grab the robots.txt from your https site too, so

It’s also very important that your robots.txt file is really called robots.txt. The name is case sensitive. Don’t make any mistakes in it or it will just not work.

Pros and cons of using robots.txt

Pro: crawl budget

Each site has an “allowance” in how many pages a search engine spider will crawl on that site, SEOs call this the crawl budget. By blocking sections of your site from the search engine spider, you allow your crawl budget to be used for other sections. Especially on sites where a lot of SEO clean up has to be done, it can be very beneficial to first quickly block the search engines from crawling a few sections.

blocking query parameters

One situation where crawl budget is specifically important is when your site uses a lot of query string parameters to filter and sort. Let’s say you have 10 different query parameters and with different values, that can be used in any combination. This leads to hundreds if not thousands of possible URLs. Blocking all query parameters from being crawled will help make sure the search engine only spiders your site’s main URLs and won’t go into the enormous trap that you’d otherwise create.

This line would block all URLs on your site with a query string on it:

Disallow: /*?*

Con: not removing a page from search results

Using the robots.txt file you can tell a spider where it cannot go on your site. You can not tell a search engine which URLs it cannot show in the search results. This means that not allowing a search engine to crawl a URL – called “blocking” it – does not mean that URL will not show up in the search results. If the search engine finds enough links to that URL, it will include it, it will just not know what’s on that page.

Screenshot of a result for a blocked URL in the Google search results

If you want to reliably block a page from showing up in the search results, you need to use a meta robots noindex tag. That means the search engine has to be able to index that page and find the noindex tag, so the page should not be blocked by robots.txt.

Because the search engine can’t crawl the page, it cannot distribute the link value for links to your blocked pages. If it could crawl, but not index the page, it could still spread the link value across the links it finds on the page. When a page is blocked with robots.txt, the link value is lost.

robots.txt syntax

WordPress robots.txt

We have a complete article on how to best setup your robots.txt for WordPress. Note that you can edit your site’s robots.txt file in the Yoast SEO Tools → File editor section.

A robots.txt file consists of one or more blocks of directives, each started by a user-agent line. The “user-agent” is the name of the specific spider it addresses. You can either have one block for all search engines, using a wildcard for the user-agent, or specific blocks for specific search engines. A search engine spider will always pick the most specific block that matches its name.

These blocks look like this (don’t be scared, we’ll explain below):

User-agent: *
Disallow: /

User-agent: Googlebot

User-agent: bingbot
Disallow: /not-for-bing/

Directives like Allow and Disallow should not be case sensitive, so whether you write them lowercase or capitalize them is up to you. The values are case sensitive however, /photo/ is not the same as /Photo/. We like to capitalize directives for the sake of readability in the file.

User-agent directive

The first bit of every block of directives is the user-agent. A user-agent identifies a specific spider. The user-agent field is matched against that specific spider’s (usually longer) user-agent. For instance, the most common spider from Google has the following user-agent:

Mozilla/5.0 (compatible; Googlebot/2.1; 

A relatively simple User-agent: Googlebot  line will do the trick if you want to tell this spider what to do.

Note that most search engines have multiple spiders. They will use specific spiders for their normal index, for their ad programs, for images, for videos, etc.

Search engines will always choose the most specific block of directives they can find. Say you have 3 sets of directives: one for *, one for Googlebot and one for Googlebot-News. If a bot comes by whose user-agent is Googlebot-Video, it would follow the Googlebot restrictions. A bot with the user-agent Googlebot-News would use the more specific Googlebot-News directives.

The most common user agents for search engine spiders

Below is a list of the user-agents you can use in your robots.txt file to match the most commonly used search engines:

Search engine Field User-agent
Baidu General baiduspider
Baidu Images baiduspider-image
Baidu Mobile baiduspider-mobile
Baidu News baiduspider-news
Baidu Video baiduspider-video
Bing General bingbot
Bing General msnbot
Bing Images & Video msnbot-media
Bing Ads adidxbot
Google General Googlebot
Google Images Googlebot-Image
Google Mobile Googlebot-Mobile
Google News Googlebot-News
Google Video Googlebot-Video
Google AdSense Mediapartners-Google
Google AdWords AdsBot-Google
Yahoo! General slurp
Yandex General yandex

Disallow directive

The second line in any block of directives is the Disallow line. You can have one or more of these lines, specifying parts of the site the specified spider can’t access. An empty Disallow line means you’re not disallowing anything, so basically it means that spider can access all sections of your site.

User-agent: *
Disallow: /

The example above would block all search engines that “listen” to robots.txt from crawling your site.

User-agent: *

The example above would, with only one character less, allow all search engines to crawl your entire site.

User-agent: googlebot
Disallow: /Photo

The example above would block Google from crawling the Photo directory on your site and everything in it. This means all the subdirectories of the /Photo directory would also not be spidered. It would not block Google from crawling the photo directory, as these lines are case sensitive.

How to use wildcards / regular expressions

“Officially”, the robots.txt standard doesn’t support regular expressions or wildcards. However, all major search engines do understand it. This means you can have lines like this to block groups of files:

Disallow: /*.php
Disallow: /copyrighted-images/*.jpg

In the example above, * is expanded to whatever filename it matches. Note that the rest of the line is still case sensitive, so the second line above will not block a file called /copyrighted-images/example.JPG from being crawled.

Some search engines, like Google, allow for more complicated regular expressions. Be aware that not all search engines might understand this logic. The most useful feature this adds is the $, which indicates the end of a URL. In the following example you can see what this does:

Disallow: /*.php$

This means /index.php could not be indexed, but /index.php?p=1 could be indexed. Of course, this is only useful in very specific circumstances and also pretty dangerous: it’s easy to unblock things you didn’t actually want to unblock.

Non-standard robots.txt crawl directives

On top of the Disallow and User-agent directives there are a couple of other crawl directives you can use. These directives are not supported by all search engine crawlers so make sure you’re aware of their limitations.

Allow directive

While not in the original “specification”, there was talk of an allow directive very early on. Most search engines seem to understand it, and it allows for simple, and very readable directives like this:

Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

The only other way of achieving the same result without an allow directive would have been to specifically disallow every single file in the wp-admin folder.

noindex directive

One of the lesser known directives, Google actually supports the noindex directive. We think this is a very dangerous thing. If you want to keep a page out of the search results, you usually have a good reason for that. Using a method of blocking that page that will only keep it out of Google, means you leave those pages open for other search engines. It could be very useful in a specific Googlebot user agent bit of your robots.txt though, if you’re working on improving your crawl budget. Note that noindex isn’t officially supported by Google, so while it works now, it might not at some point.

host directive

Supported by Yandex (and not by Google even though some posts say it does), this directive lets you decide whether you want the search engine to show  or Simply specifying it as follows does the trick:


Because only Yandex supports the host directive, we wouldn’t advise you to rely on it. Especially as it doesn’t allow you to define a scheme (http or https) either. A better solution that works for all search engines would be to 301 redirect the hostnames that you don’t want in the index to the version that you do want. In our case, we redirect to

crawl-delay directive

Supported by Yahoo!, Bing and Yandex the crawl-delay directive can be very useful to slow down these three, sometimes fairly crawl-hungry, search engines. These search engines have slightly different ways of reading the directive, but the end result is basically the same.

A line as follows below would lead to Yahoo! and Bing waiting 10 seconds after a crawl action. Yandex would only access your site once in every 10 second timeframe. A semantic difference, but interesting to know. Here’s the example crawl-delay line:

crawl-delay: 10

Do take care when using the crawl-delay directive. By setting a crawl delay of 10 seconds you’re only allowing these search engines to index 8,640 pages a day. This might seem plenty for a small site, but on large sites it isn’t all that much. On the other hand, if you get 0 to no traffic from these search engines, it’s a good way to save some bandwidth.

sitemap directive for XML Sitemaps

Using the sitemap directive you can tell search engines – specifically Bing, Yandex and Google – the location of your XML sitemap. You can, of course, also submit your XML sitemaps to each search engine using their respective webmaster tools solutions. We, in fact, highly recommend that you do. Search engine’s webmaster tools programs will give you very valuable information about your site. If you don’t want to do that, adding a sitemap line to your robots.txt is a good quick option.

Let our SEO experts analyze and optimize your site: Get Yoast SEO Care! »

Yoast SEO CareTemporarily sold out Info

Read more: ‘several articles about Webmaster Tools’ »

Validate your robots.txt

There are various tools out there that can help you validate your robots.txt, but when it comes to validating crawl directives, we like to go to the source. Google has a robots.txt testing tool in its Google Search Console (under the Crawl menu) and we’d highly suggest using that:

robots.txt tester

Be sure to test your changes thoroughly before you put them live! You wouldn’t be the first to accidentally robots.txt-block your entire site into search engine oblivion.

Keep reading: ‘WordPress robots.txt example for great SEO’ »

5 Responses to robots.txt: the ultimate guide

  1. Alessia Martalò
    By Alessia Martalò on 27 May, 2016

    Very useful guide. I think in this case “less is more”. Better avoiding to put *everything* in robots!

  2. Keith
    By Keith on 21 May, 2016

    Whew, that was quite a read.

    I don’t use a robots.txt file normally, however, I’m thinking it may be a good idea to use one and include /wp-includes/ and /wp-content/ into them, right? So they don’t use up the Crawl Budget? Should I also include /wp-admin/ or does that not get crawled by default?

    Sorry if the questions sound stupid, I just want to make certain that my websites are getting crawled correctly, and not wasting time crawling thousands of default WordPress files that have no importance to my site.

    If you don’t have time to reply – no problem. Thanks for the in-depth article anyway. I’ll be attempting to add those to my own robots.txt file in the future either way!

  3. Nidhi Shah
    By Nidhi Shah on 21 May, 2016

    Hi Joost,

    Very useful guide on robots.txt file. I had little confusion with it but after reading the whole post, it has cleared my doubts on it. Thanks for sharing this useful information with us.

  4. Rajendra
    By Rajendra on 19 May, 2016

    Can you please share the basic robots.txt format for wordpress blog

  5. azukadm
    By azukadm on 19 May, 2016

    Just what I was looking for…. but how do I know the perfect setting for it

Check out our must read articles about Technical SEO