Ever since the announcement on the Google Blog and more recently Yahoo’s announcement that they’ve enhanced their support for it, I’ve been meaning to play with the X-Robots-Tag header. This HTTP header allows you to do what you’d normally do in a robots meta tag, in an HTTP header, which has some pretty cool appliances. I’ll show you a few cool things you can do with this, but first some theory. If you don’t feel like that, skip to the example uses of the X-Robots-Tag.
As Sebastian explained in an excellent post on SEOmoz, there are two different kinds of directives: crawler directives and indexer directives.
robots.txt file only contains the so called Crawler directives, telling search engines, identified by their
User-agent:, where they are not allowed to go by using
Disallow: and where they can (and should) go by using
Allow:, and by pointing them at a
As Sebastian pointed out and explains thoroughly in another brilliant post, pages that search engines aren’t allowed to spider, can still show up in the search results, when they have enough links pointing at them. This basically means that if you want to really hide something from the search engines and thus from people using search,
robots.txt just isn’t good enough.
Indexer directives are directives that are, even with the birth of the X-Robots-Tag, set on a per page or even per element basis. Up until July 2007, there were two: the microformat rel=”nofollow”, which means that that link should not pass authority / PageRank, and the Meta Robots tag.
With the Meta Robots tag, you can really prevent search engines from showing the pages you block in the search results. You can reach the same with the relatively new X-Robots-Tag HTTP header. If you don’t know what an HTTP header is, I’d suggest reading the Wikipedia page on it, but in short: look at it as the envelope around your content. This HTTP header is better than the meta robots tag for a couple of reasons, one of them is that you can send those headers for other documents too. So, let’s get into some examples.
If you want to prevent search engines from showing files you’ve generated with PHP, add the following in the header file:
header("X-Robots-Tag: noindex", true);
This would not prevent search engines from following the links on those pages, if you want to do that, do the following:
header("X-Robots-Tag: noindex, nofollow", true);
But doing it in PHP is probably not the easiest use for this kind of thing. I myself greatly prefer setting headers in Apache, when possible. Consider, for instance, preventing search engines from caching / showing a preview for all .doc files on your domain, you would only have to do the following:
<FilesMatch ".doc$"> Header set X-Robots-Tag "index, noarchive, nosnippet" </FilesMatch>
Or, if you’d want to do this for both .doc and .pdf files:
<FilesMatch ".(doc|pdf)$"> Header set X-Robots-Tag "index, noarchive, nosnippet" </FilesMatch>
Or another case, your
robots.txt file itself is showing up in the search results. Adding this to your Apache config or your
.htaccess file would solve that:
<FilesMatch "robots.txt"> Header set X-Robots-Tag "noindex" </FilesMatch>
I had a slight uncomfortable feeling when writing this down, so I e-mailed Matt Cutts asking the following: “<snip> would that mean that you will still fetch it for robots.txt purposes, but won’t show it in the index?”. I’m waiting for him to answer, and will add his response here once I have it.
I’ve quickly created a bookmarklet which shows all the headers for a page (works in Moz browsers only I think, and a Greasemonkey script which pops up when a page is using an X-Robots-Tag header.
As you can see, if you combine the examples above with the stuff you can learn from for instance AskApache’s .htaccess tutorial, the X-Robots-Tag HTTP header becomes a very powerful tool. Use it wisely and with caution, as you won’t be the first to block your entire site by accident, but it’s a great addition to your toolset if you know how to use it.