Robots.txt is a way to tell a search engine which pages it’s allowed to spider, to “see”, and which pages it cannot “see”. Because of that, robots.txt differs from
meta name="robots" tags, which tell search engines on those individual pages, whether they can include them in their index or not. The difference is subtle, but important. Because of that, the suggested robots.txt in the codex is wrong. Let me explain:
Google sometimes lists URLs that it’s not allowed to spider, because it’s blocked by robots.txt, because a lot of links point to a URL. A good example of this is a search for [RTL Nieuws] (disclosure: RTL is a client of mine). rtlnieuws.nl 301 redirects to the news section of rtl.nl. But… rtlnieuws.nl/robots.txt exists… And has the following content:
User-agent: * Disallow: /
Because of that, the links towards rtlnieuws.nl don’t count toward the news section on rtl.nl, and Google displays rtlnieuws.nl in the search results. This is unwanted behavior that we’re trying to fix but for now it’s a good example of what I wanted to explain. By blocking /wp-admin/ and /trackback/ in your robots.txt, you’re not preventing them from showing up.
Unfortunately, recently the /wp-admin/ block was added to WordPress core, because of this Trac ticket. In the discussion on that ticket, I’ve proposed another solution in this patch. This solution involves sending an X-Robots-Tag header, which is the HTTP header equivalent of a
meta name="robots" tag. This would in fact remove all wp-admin directories from Google search results.
WordPress Robots.txt blocking Search results and Feeds
There are two other sections which are blocked in the suggested robots.txt, /*?, which blocks everything with a question mark and as such all search results, and */feed/, which blocks all feeds. The first is not a good idea because if someone were to link to your search results, you wouldn’t benefit from those links.
A better solution would be to add a
<meta name="robots" content="noindex, follow"> tag to those search results pages, as it would prevent the search results from rankings but would allow the link “juice” to flow through to the returned posts and pages. This is what my WordPress SEO plugin does as soon as you enable it. It also does this for wp-admin and login and registration pages.
I’m aware that that is different from Google’s guidelines on this topic at the moment, which state:
Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.
I’ve reached out to Google to get clarification on whether they would say my solution is acceptable as well, or perhaps even better :) .
Blocking /feed/ is a bad idea because an RSS feed is actually a valid sitemap for Google. Blocking it would prevent Google from using that to find new content on your site. So, my suggested robots.txt for WordPress is actually a lot smaller than the Codex one. I only have this:
User-Agent: * Disallow: /wp-content/plugins/
I block the plugins directory because some plugin developers have the annoying habit of adding index.php files to their plugin directories that link back to their websites. For all other parts of WordPress, there are better solutions for blocking.
The other WordPress Robots.txt suggestions
The other sections of the robots.txt as suggested are a bit old and no longer needed. Digg mirror is something for us old guys who remember when Digg used to send loads of traffic, Googlebot Image and Media Partner are still there but if you only have the above in your robots.txt you don’t need specific lines for them in your WordPress robots.txt file.