Playing with the X-Robots-Tag HTTP header
Ever since the announcement on the Google Blog and more recently Yahoo's announcement that they've enhanced their support for it, I've been meaning to play with the X-Robots-Tag header. This HTTP header allows you to do what you'd normally do in a robots meta tag, in an HTTP header, which has some pretty cool appliances. I'll show you a few cool things you can do with this, but first some theory. If you don't feel like that, skip to the example uses of the X-Robots-Tag.
As Sebastian explained in an excellent post on SEOmoz, there are two different kinds of directives: crawler directives and indexer directives.
Crawler directives
The robots.txt file only contains the so called Crawler directives, telling search engines, identified by their User-agent:, where they are not allowed to go by using Disallow: and where they can (and should) go by using Allow:, and by pointing them at a Sitemap:.
As Sebastian pointed out and explains thoroughly in another brilliant post, pages that search engines aren't allowed to spider, can still show up in the search results, when they have enough links pointing at them. This basically means that if you want to really hide something from the search engines and thus from people using search, robots.txt just isn't good enough.
Indexer directives
Indexer directives are directives that are, even with the birth of the X-Robots-Tag, set on a per page or even per element basis. Up until July 2007, there were two: the microformat rel="nofollow", which means that that link should not pass authority / PageRank, and the Meta Robots tag.
With the Meta Robots tag, you can really prevent search engines from showing the pages you block in the search results. You can reach the same with the relatively new X-Robots-Tag HTTP header. If you don't know what an HTTP header is, I'd suggest reading the Wikipedia page on it, but in short: look at it as the envelope around your content. This HTTP header is better than the meta robots tag for a couple of reasons, one of them is that you can send those headers for other documents too. So, let's get into some examples.
Example uses of the X-Robots-Tag
If you want to prevent search engines from showing files you've generated with PHP, add the following in the header file:
header(’X-Robots-Tag: noindex’, true);
This would not prevent search engines from following the links on those pages, if you want to do that, do the following:
header(’X-Robots-Tag: noindex, nofollow’, true);
But doing it in PHP is probably not the easiest use for this kind of thing. I myself greatly prefer setting headers in Apache, when possible. Consider, for instance, preventing search engines from caching / showing a preview for all .doc files on your domain, you would only have to do the following:
<FilesMatch "\.doc$"> Header set X-Robots-Tag "index, noarchive, nosnippet" </Files>
Or, if you'd want to do this for both .doc and .pdf files:
<FilesMatch "\.(doc|pdf)$"> Header set X-Robots-Tag "index, noarchive, nosnippet" </Files>
Or another case, your robots.txt file itself is showing up in the search results. Adding this to your Apache config or your .htaccess file would solve that:
<FilesMatch "robots\.txt"> Header set X-Robots-Tag "noindex" </FilesMatch>
I had a slight uncomfortable feeling when writing this down, so I e-mailed Matt Cutts asking the following: "<snip> would that mean that you will still fetch it for robots.txt purposes, but won't show it in the index?". I'm waiting for him to answer, and will add his response here once I have it.
Tools
I've quickly created a bookmarklet which shows all the headers for a page (works in Moz browsers only I think, and a Greasemonkey script which pops up when a page is using an X-Robots-Tag header.
Conclusion
As you can see, if you combine the examples above with the stuff you can learn from for instance AskApache's .htaccess tutorial, the X-Robots-Tag HTTP header becomes a very powerful tool. Use it wisely and with caution, as you won't be the first to block your entire site by accident, but it's a great addition to your toolset if you know how to use it.
Related posts













Louis Jan 21st, 2008 at 00:35
Quite an interesting article Joost.
I like the idea of centralising the management of crawlers in the .htaccess.
Jeff_ Jan 21st, 2008 at 00:58
Hmmm.. I see that Google and Yahoo support X-Robots; any sign of Live, Ask, et al joining the club? As discussed, for the engines that support them, this is an excellent method for controlling indexing of site resources such as text, multimedia, and PDF files. Usefulness of X-Robots will continue to grow along with increasing support from the search engines. Cheers on the timely post!
Joost de Valk Jan 21st, 2008 at 08:35
@Louis: yeah, me too :)
@Jeff_: thx, and indeed, more support would make it even more useful.
Sebastian Jan 21st, 2008 at 10:21
Great post, Joost! I'd like to add that MSN and Yahoo might show URL references on SERPs even for pages with a "noindex" REP tag.
Joost de Valk Jan 21st, 2008 at 10:23
@Sebastian: they do? Darn that's annoying :) Then the only way of really making sure they won't show it is cloaking a 404 to them :)
JohnMu Jan 21st, 2008 at 10:35
Joost, I checked about blocking the robots.txt a while back: you can even list it in your robots.txt if you wanted to, it will still get read and processed. Applying an x-robots-tag to it should also be fine.
Note that you can't apply all of that to your sitemap file(s): blocking them in your robots.txt will prevent the search engines from being able to read the files, though using the x-robots-tag should be fine.
Sebastian Jan 21st, 2008 at 11:03
Joost, instead of faking noindex'ed URLs a 404/410 HTTP response better cloak in a way that's capable to funnel link love to other pages with a 301. That's the definitive way of deindexing anyway. ;)
John, AFAIK we can apply a "noindex" REP tag to sitemap files, as well as other indexer directives. Only disallow'ing them in robots.txt prevents them from processing. That makes sense by the way, because MSN lists sitemap files that are referenced in robots.txt or submitted directly on the SERPs, even when there's not a single link pointing to them.
Sint Jan 21st, 2008 at 12:57
If you could block the robots.txt and sitemaps in this way without keeping them from being used, this could be very helpful.
I would then also consider blocking all my .xml-files, which don't only include sitemaps, but also data files which are processed within the site using Ajax or generated to aggregate content to clients, also examples of stuff you wouldn't want to appear in search engine results.
Joost de Valk Jan 21st, 2008 at 13:11
@JohnMu: thx for dropping by, and confirming that this works as expected.
@Sebastian: hehe, so true :)
@Sint: not a bad idea in your case I guess :) it's quite easy to do as you can see.
Hendrik Jan 21st, 2008 at 13:47
Nice post, Joost. I'll test it soon as possible. Are you thinking that the robots.txt quickly disappears?
Joost de Valk Jan 21st, 2008 at 15:37
@Hendrik: that would depend on how often Google and other se's spider your site and your robots.txt.
Rob Jan 21st, 2008 at 17:19
I like the idea of using x-http-headers, but it doesn't replace robots.txt yet until it is accepted by globally.
Just as with the ability to specifiy certain directives for certain engines in robots.txt.
Correct me if I'm wrong but strictly speaking there is no allow: directive for robots.txt, only disallow:
SEO Blog Jan 23rd, 2008 at 12:56
Didn't know about this tag. But seems very usefull. I alse downloaded your grease monkey script to be sure that pages where I place my links on have no x-robot nofollow header ;). And it's also very usefull to detect in time that you blocked your entire site :).
Thx!
Baby Chloe Jan 23rd, 2008 at 19:06
Hi Joost, what are you thoughts on getting people to do link exchanges with a 'links' page and then for that page sending via php, a x-robots noindex nofollow header.
Would google consider my site to have a huge influx of one way links ?
Chloe.
Joost de Valk Jan 23rd, 2008 at 19:45
Chloe: that sounds like a black idea :-)
Sebastian Jan 23rd, 2008 at 20:52
Chloe, when you add a "nofollow" to the "noindex" X-Robots-Tag you're fine with Google. But then, why trade links at all?
Sint Jan 24th, 2008 at 11:08
Because the people you exchange these links with won't know you are nofollowing them?
quasigoal Feb 12th, 2008 at 12:02
Joost, interesting blog, useful! I've subscribed the feed. :0)
Greetings from Italy.
John S. Britsios Mar 21st, 2008 at 16:37
Google have indexed anchor links i.e http://www.example.com/example/example.html#post100
Is there a X-Robots workaround in the htaccess for that, since I cannot get rid of them in their Webmaster Tools URL removal tool?
R. Richard Hobbs Mar 25th, 2008 at 18:19
howbout the notion of using noindex,follow (or maybe noindex just by itself so (hopefully) you dont get penalised for dupe content, but allow the Googlebot (or other spider) make the most of all the linky love? Would also seem to take care of a problem of say, finding your feed your highest ranked page? (i.e. instead of disallow: for /feed/ use noindex,follow???)
Thanks for any thoughts or ideas.
Brett Jul 1st, 2008 at 19:30
Hi,
You asked if MSNBot supported X-Robots tags. They do. Read the blog post
5ubliminal Jul 11th, 2008 at 01:06
I'd take a look at the page linked here in my comment. You've got all you need there to put your link exchange partners our of business.
Crawler Detection + X-Robots-Tag = Undetectable Link
LoveHate !If crawler detection is in place no greasy script will do anything. And when someone is smart enough to implement this, he goes all the way. So ... no greasy countermeasures!