WordPress robots.txt Example

WordPress Robots.txt advice from YoastRobots.txt is a way to tell a search engine which pages it’s allowed to spider, to “see”, and which pages it cannot “see”. Because of that, robots.txt differs from meta name="robots" tags, which tell search engines on those individual pages, whether they can include them in their index or not. The difference is subtle, but important. Because of that, the suggested robots.txt in the codex is wrong. Let me explain:

Google sometimes lists URLs that it’s not allowed to spider, because it’s blocked by robots.txt, because a lot of links point to a URL. A good example of this is a search for [RTL Nieuws] (disclosure: RTL is a client of mine). rtlnieuws.nl 301 redirects to the news section of rtl.nl. But… rtlnieuws.nl/robots.txt exists… And has the following content:

User-agent: *
Disallow: /

Because of that, the links towards rtlnieuws.nl don’t count toward the news section on rtl.nl, and Google displays rtlnieuws.nl in the search results. This is unwanted behavior that we’re trying to fix but for now it’s a good example of what I wanted to explain. By blocking /wp-admin/ and /trackback/ in your robots.txt, you’re not preventing them from showing up.

Unfortunately, recently the /wp-admin/ block was added to WordPress core, because of this Trac ticket. In the discussion on that ticket, I’ve proposed another solution in this patch. This solution involves sending an X-Robots-Tag header, which is the HTTP header equivalent of a meta name="robots" tag. This would in fact remove all wp-admin directories from Google search results.

WordPress Robots.txt blocking Search results and Feeds

There are two other sections which are blocked in the suggested robots.txt, /*?, which blocks everything with a question mark and as such all search results, and */feed/, which blocks all feeds. The first is not a good idea because if someone were to link to your search results, you wouldn’t benefit from those links.

A better solution would be to add a <meta name="robots" content="noindex, follow"> tag to those search results pages, as it would prevent the search results from rankings but would allow the link “juice” to flow through to the returned posts and pages. This is what my WordPress SEO plugin does as soon as you enable it. It also does this for wp-admin and login and registration pages.

I’m aware that that is different from Google’s guidelines on this topic at the moment, which state:

Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.

I’ve reached out to Google to get clarification on whether they would say my solution is acceptable as well, or perhaps even better :) .

Blocking /feed/ is a bad idea because an RSS feed is actually a valid sitemap for Google. Blocking it would prevent Google from using that to find new content on your site. So, my suggested robots.txt for WordPress is actually a lot smaller than the Codex one. I only have this:

User-Agent: *
Disallow: /wp-content/plugins/

I block the plugins directory because some plugin developers have the annoying habit of adding index.php files to their plugin directories that link back to their websites. For all other parts of WordPress, there are better solutions for blocking.

The other WordPress Robots.txt suggestions

The other sections of the robots.txt as suggested are a bit old and no longer needed. Digg mirror is something for us old guys who remember when Digg used to send loads of traffic, Googlebot Image and Media Partner are still there but if you only have the above in your robots.txt you don’t need specific lines for them in your WordPress robots.txt file.

 

Tags: ,


Yoast.com runs on the Genesis Framework

Genesis theme frameworkThe Genesis Framework empowers you to quickly and easily build incredible websites with WordPress. Whether you're a novice or advanced developer, Genesis provides you with the secure and search-engine-optimized foundation that takes WordPress to places you never thought it could go.

Read our Genesis review or get Genesis now!

36 Responses

  1. Chris LangilleBy Chris Langille on 10 February, 2012

    Joost, can we use both meta tags and robots.txt at the same time?

    I am using a theme that already has the meta tag baked in, however I cannot figure out how to modify it, so I was trying to work around that by using robots.txt

    • Joost de ValkBy Joost de Valk on 10 February, 2012

      Hmm, well as you can see from my explanation above they serve different purposes. But seriously, edit them out of your theme.

      • Chris LangilleBy Chris Langille on 6 March, 2012

        I’m sorry Yoast, I’m a bit green with this still.

        Edit what out of the theme? The meta tags or the robots.txt?

        Currently I’m using your recommended robots.txt above.

        And I’m guessing the meta tags would be in my header.php file?

        THanks!!!!

  2. Jeremy MyersBy Jeremy Myers on 10 February, 2012

    I have been looking for clarification on this for a while. Thanks.

    What about the digg mirror and google image bot sections which the codex also recommends?

    # Google Image
    User-agent: Googlebot-Image
    Disallow:
    Allow: /*

    # Google AdSense
    User-agent: Mediapartners-Google*
    Disallow:
    Allow: /*

    # digg mirror
    User-agent: duggmirror
    Disallow: /

    • Joost de ValkBy Joost de Valk on 10 February, 2012

      I should have added my opinion on that to the article, will do that now.

    • Joost de ValkBy Joost de Valk on 10 February, 2012

      Done :)

      • Jeremy MyersBy Jeremy Myers on 10 February, 2012

        Cool! Thanks. Very helpful (as always).

  3. Michael KovisBy Michael Kovis on 10 February, 2012

    You have really brought up some good points Joost. So basically, instead of using the /*? in your robots.txt, it would be a lot more beneficial to make sure that you make good use of the canonical URL meta. I like it. I have never thought of it this way until reading this post.

    Thanks!

  4. Eric ChristopherBy Eric Christopher on 10 February, 2012

    Thanks! It’s funny because this solves a problem I just discovered. When we build web sites for our clients, we do it on a sub-domain and set the WP privacy to not be indexed.

    However, if you type the exact URL into Google, it will show that it has been indexed. So I have been looking for a solution and here it is!

    When you’re are doing local SEO, you don’t really want your competition knowing what you are doing until it is too late.

    Appreciate you sharing this… I’m very surprised that you don’t offer a membership subscription to the public. Any reason why?

  5. Paul SalmonBy Paul Salmon on 10 February, 2012

    I remember when I first started out online it seemed many people were creating huge robots.txt files to prevent specific files and folders from being spidered. It eventually became such huge pain to determine what was being spidered that it wasn’t worth the effort.

    It seems your solution is much more elegant and less error prone.

  6. Jose TintoBy Jose Tinto on 10 February, 2012

    My WP search result pages are getting indexed by Google even after disallow the parameters in the robots.txt. I use Yoast’s WordPress for SEO plugin too. Any suggesions?

  7. JPBy JP on 11 February, 2012

    What about affiliate links? Your earlier advice suggests that they should be in robots.txt.

    http://yoast.com/affiliate-links-and-seo/

  8. SylviaBy Sylvia on 11 February, 2012

    Very confused about this.

    A while ago I was advise to use the following, especially for security purposes.

    User-agent: *
    Disallow: /wp-admin
    Disallow: /wp-includes
    Disallow: /wp-content/plugins
    Disallow: /wp-content/cache
    Disallow: /wp-content/themes
    Disallow: /trackback
    Disallow: /feed
    Disallow: /comments
    Disallow: */trackback
    Disallow: */feed
    Disallow: */comments
    Disallow: /*?*
    Disallow: /*?

    Allow: /wp-content/uploads

    But you are suggesting to delete all that? Would that not increase my security risk?

    And what about comments. Would it be best to not index them for SEO purposes as I do know, or allow Google to index.

    Would love to have your opinion on these.

    Sylvia

    • Joost de ValkBy Joost de Valk on 11 February, 2012

      Hi Sylvia,

      Sorry to ask but did you read the article?? I’m specifically explaining why you shouldn’t use that.

  9. SylviBy Sylvi on 11 February, 2012

    Hi Joost. Thanks for your speedy reply. I just read the article again and still don’t understand it. Sorry. The patch you mention is quite technical and I don’t want to change the core WordPress files. You do not mention security in your article nor comments. I also currently use Thesis with integrated SEO which I’m happy with, so will not be able to use your SEO plugin for this site. I would be able to change the functions file for Thesis.

    It would be really helpful if you could provide some direction on what best to do for an non-technical person like me, so that I can keep WordPress safe and have maximum exposure to search engines.

    I would also really appreciate your thoughts on wether to block or index comments with regards to SEO.

    Sylvia

    • Joost de ValkBy Joost de Valk on 11 February, 2012

      There is no core change in the post :) it’s just robots.txt, which is a separate file that isn’t shipped with WordPress. When you do *not* have a robots.txt file, WordPress serves a default robots.txt file, that’s what I talked about as well.

  10. Olaf PijlBy Olaf Pijl on 12 February, 2012

    Thanks again for sharing!

  11. Phil GreenoBy Phil Greeno on 12 February, 2012

    Hi Joost,
    Thanks for the great advice on robots.txt.

    Your SEO plugin and general advice has been a game changer in the way I approach wordpress websites for clients.

    I’ve just checked the robots.txt file of this particular blog and besides your above advice of blocking the plugins folder. Is there any reason why you have also blocked something called /out/?

    Sorry to be nosey but thought you could also offer a further “case in point” of why & when to manually block things.

    Cheers

    Phil

  12. Kathy AliceBy Kathy Alice on 13 February, 2012

    Great post. Another reason why blocking the rss feed is a bad idea is that Bing is on record saying that they prefer to find content that way. Since for most (although not all) sites that I have looked at, the Bing indexation is not as complete as Google’s this further supports your point.

  13. Dale DeckerBy Dale Decker on 13 February, 2012

    Thanks so much for all the great information on your site.

    I notice you don’t include the sitemaps line. Would including it cause any problems or do search engine just know where to look for it these days?

    Thanks so much!

  14. GaganBy Gagan on 15 February, 2012

    Guys, you know what, google is not even paying attention to robots.txt…. on my blog, the first thing I checked was robots.txt since my earlier attempts in blogging were really messed up with google search as google listed every plugin i was using with every file in the folder, and this time the robots.txt contained
    “User-agent: Mediapartners-Google
    Disallow:

    User-agent: *
    Disallow: /search
    Allow: /

    Sitemap: http://www.iamme.in/feeds/posts/default?orderby=updated
    Still the results are popping up from the search section, I don’t know why this happen, but google’s spiders are going nuts I guess……

  15. KUASHABy KUASHA on 15 February, 2012

    I’ve found this post from your SEO plugin Recent post widget :D. Really important thing because somedays ago google webmaster tools showed that robots.txt is blocking some cretical things.
    I’m using your WordPress SEO plugin, should i need to add any meta tags for wp-admin ?
    Thanks again!

  16. GabeBy Gabe on 16 February, 2012

    “For all other parts of WordPress, there are better solutions for blocking.”

    Like what?

  17. zimbrulBy zimbrul on 16 February, 2012

    Does your amazing WordPress SEO plugin control all of this? I mean…can I use just WP SEO Plugin to output just the recommended code above?

  18. BenBy Ben on 16 February, 2012

    Wouldn’t this be a simple fix to block googlebot from every URL with a question mark (like website.com/page/?utm_source=rss), yet still allow search results to be indexed?

    Disallow: /*?
    Allow: /*?s

    The disallow blocks all ? urls, and the allow command specifically allows just search results (which look like blog.com/?s=search-term).

    This is what I use. Any reason why this shouldn’t work?

  19. StijnBy Stijn on 16 February, 2012

    When seeing that yoast’s robots.txt is only three lines, I’m beginning to wonder if I’m doing something wrong with my longer one here. Feel free to comment.

    http://aardling.com/robots.txt

  20. OlafBy Olaf on 17 February, 2012

    Thanks Stijn, for sharing your robots.txt! It makes a very interesting discussion to this page, commenting on which crawlers should be disallowed.

  21. BenBy Ben on 17 February, 2012

    Agreed. I’m very curious if blocking specific crawlers is worth it. I would think that any illegitimate crawlers would just ignore the robots.txt.

  22. Mark FlemingBy Mark Fleming on 20 February, 2012

    If I’m remembering this correctly, I had an issue last year where Webmaster Tools told me that I had hundreds of pages that were potentially duplicate content, all due to it indexing URL’s that had a question mark in them, for example , “…?replytocom…” Google didn’t seem to like that at all. So I added a robots.txt line that disallowed indexing of URLs with question marks. And then I requrested removal of all of these pages from the index.

    So, I’m wondering why I should simplify my robots.txt file to remove that and have Google index all these ?replytocom… URL’s again. I’m really confused about all of this.

    Thanks.

  23. RilwisBy Rilwis on 22 February, 2012

    Hi Yoast, this post is very informative. I just have a small question: You said in the post that your WP SEO plugin auto add robots meta tag for /wp-admin/ area if we choose its settings, but I can’t find any options related with robots tag for /wp-admin/. Can you please point me to a guide?

  24. Randall CollinsBy Randall Collins on 25 February, 2012

    Signing up for newsletter.

  25. RonBy Ron on 28 February, 2012

    Hi,
    I have a question about blocking specific content.
    I offer a free PDF report to people that register at my site. I send them a link to a page on my site that contains a link to my report.
    Using your SEO plugin, I set that page as NOINDEX, NOFOLLOW.
    But how do I block the robots from indexing the PDF file itself?
    Thanks

  26. VishmaxBy Vishmax on 3 March, 2012

    Thanks for this share, Can you provide me a robot.txt in down loadable format? it will very thankful if do so.

  27. CatherineBy Catherine on 4 March, 2012

    Hi,

    Question already asked above:

    “For all other parts of WordPress, there are better solutions for blocking.”

    Like what?

  28. BJBy BJ on 9 March, 2012

    It seems there are a few questions about your plugin, it seems that Google is still crawling these pages. I just noticed that we are getting a lot of errors within Webmaster Central for search results: http://screencast.com/t/KLPVGTzM.

    Can you confirm if this is already in place with the plugin, or is there a setting that I need to update for this to function properly?

    • BJBy BJ on 9 March, 2012

      Ok, I was able to see that it was working as you mentioned in the article. But why would I still have these 100k pages not found within Webmaster Central? Wouldn’t these be connected?