Preventing your site from being indexed, the right way

December 17th, 2009 – 36 Comments

We’ve said it in 2009, and we’ll say it again: it keeps amazing us that there are still people using just a robots.txt files to prevent indexing of sites in Google or Bing. And thus show up in the search engines anyway. You know why it keeps amazing us? Because robots.txt doesn’t actually do the latter, even though it does prevents indexing of your site. For more on robots.txt, please read robots.txt: the ultimate guide.

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 199 - Buy now » Info

There is a difference between being indexed and being listed in Google

Before we explain things any further, we need to go over some terms here first:

  • Indexed / Indexing
    The process of downloading a site or a page’s content to the server of the search engine, thereby adding it to it’s “index”.
  • Ranking / Listing / Showing
    Showing a site in the search result pages (aka SERPs).

So, while the most common process goes from Indexing to Listing, a site doesn’t have to be indexed to be listed. If a link points to a page, domain or wherever, Google follows that link. If the robots.txt on that domain prevents indexing of that page by a search engine, it’ll still show the URL in the results if it can gather from other variables that it might be worth looking at. In the old days, that could have been DMOZ or the Yahoo directory, but I can imagine Google using, for instance, your My Business details these days, or the old data from these projects. There are more sites that summarize your website, right.

Now if the explanation above doesn’t make sense, have a look at this 2009 Matt Cutts video explanation:

If you have reasons to prevent indexing of your website, adding that request to the specific page you want to block like Matt is talking about, is still the right way to go. But you’ll need to inform Google about that meta robots tag.  So, if you want to effectively hide pages from the search engines you need them to index those pages. Even though that might seem contradictory. There are two ways of doing that.

Prevent indexing of your page by adding a meta robots tag

The first option to prevent indexing of your page is by using robots meta tags. We’ve got an ultimate guide on robots meta tags that’s more extensive, but it basically comes down to adding this tag to your page:

<meta name="robots" content="noindex,nofollow>

The issue with a tag like that is that you have to add it to each and every page.

Or by adding a X-Robots-Tag HTTP header

To make the process of adding the meta robots tag to every single page of your site a bit easier, the search engines came up with the X-Robots-Tag HTTP header. This allows you to specify an HTTP header called X-Robots-Tag and set the value as you would the meta robots tags value. The cool thing about this is that you can do it for an entire site. If your site is running on Apache, and mod_headers is enabled (it usually is), you could add the following single line to your .htaccess file:

Header set X-Robots-Tag "noindex, nofollow"

And it’d have the effect that that entire site can be indexed. But will never be shown in the search results.

So, get rid of that robots.txt file with Disallow: / in it. Use the X-Robots-Tag or that meta robots tag instead!

36 Responses to Preventing your site from being indexed, the right way

  1. culvers
    By culvers on 16 January, 2010

    Thanks for this, i was going wrong with this info for a while. My site now gets crawled regularly with no issues. Good work!

  2. Tigre de Fogo
    By Tigre de Fogo on 14 January, 2010

    Great tips, Joost, very useful explanation. Thanks.

  3. Constantin Nemut
    By Constantin Nemut on 6 January, 2010

    First you have to check robots.txt, if you have it check what you have in it.
    The command Disallow is preventing to index a site or a certain page.

    next you have to check the meta tag and ehat you have in
    content attribute.

    Google first is looking for robots.txt than for meta tag, if you put correct meta robots tag but you have wrong robots.txt commands will be a mistake.

  4. Gabi
    By Gabi on 5 January, 2010

    I have a question: not sure that I understood. Crawling is different than indexing. So it does not matter if a page is indexed or not, it could appear in search engines if other sites link to it.
    Doing link building on a blocked page could makes that page to appear in search engines?

  5. carmen
    By carmen on 27 December, 2009

    I ve seen it many times that Google indexes pages who are blocked in the robots.txt. This solution works perfectly well.

  6. Credit
    By Credit on 23 December, 2009

    Ok-first of all I’m so glad I found your site. Excellent information all around!

  7. Sheila Kyle
    By Sheila Kyle on 22 December, 2009

    Thanks, I’m a newbie and not had a clear understanding of robot.txt. Now I can fix the problem

    Thank Sheila Kyle

  8. Tony @ Free Ipod
    By Tony @ Free Ipod on 21 December, 2009

    I have been using a robots.txt file to stop pages being indexed as that’s the way I was told to do it when I first set up my websites, and is probably a common story for most people. I didn’t realise that it’s not the way to do it, and might explain the errors I see in Webmaster Tools.

    I knew that robots.txt was for stopping folders from being crawled, but I think I’ll now start using the noindex meta tag.

    Thanks for the tips.

  9. WPexplorer
    By WPexplorer on 21 December, 2009

    This is a good tip. I do not quite understand why you would want to disallow an entire site though. As well as some people commented about not wanting their contact pages to be followed, however, I feel that contact pages usually contain some great keywords, that you want Google to check out. And it can be nice for customer interaction if they are too lazy to go to your site and look for the contact link, they can simply “google” the phrase “Contact Yoast”.

    ps:(you can delete this next part) but do you realize that when a guest is typing in the text area #comment, the text extends further to the right than the white background.

  10. SEO Doctor
    By SEO Doctor on 21 December, 2009

    Joost – how about knocking up a plugin for this so we don’t need to mess about in .htaccess :)

  11. Andy Beard
    By Andy Beard on 21 December, 2009

    The best directive pair is noindex follow, otherwise you can lose a lot of links

    One gotcha for people doing this using their security setings in WordPress is that you can still end up with RSS feed content that gets crawled, syndicated and indexed.

    • Miriam Schwab
      By Miriam Schwab on 21 December, 2009

      Andy, you’re saying that even if you add noindex, nofollow, the spiders can still index a site’s feed? If that’s the case, how can you tell the search engines not to index an RSS feed?

      • Andy Beard
        By Andy Beard on 21 December, 2009

        You need to only allow access to content if specific conditions are fulfilled

        Password either through WordPress or htaccess
        IP Address

        Blocking access to specific user agents or IPs is an uphill battle you never win, far better to block everything and allow specific.

  12. Peter
    By Peter on 21 December, 2009

    Ok, I understand this, but I wondered why you people don’t want certain pages indexed? Or not
    to appear in rankings. I would do it for a contact us page and t&cs but what else?

  13. Miriam Schwab
    By Miriam Schwab on 21 December, 2009

    No matter how much I read about robots.txt it seems that I never completely understood it. Thanks for this very clear and useful explanation of what the different “robot” options can and can’t do.

  14. Tony
    By Tony on 21 December, 2009

    Very useful to know, and as always great tips and easy to follow.

  15. Michael
    By Michael on 20 December, 2009

    That is defenitly not the whole truth. Google has a second crawler wich tests if sites are violating
    the rules escpelialy cloaking. This robot does not respect the robots.txt and Google even do not want to talk about the robot which is searching for cloaking sites.

    One tip to make the googlebot more careful is to adress the bot direct so you add a command to robots.txt like:

    User-agent: *

    Disallow: /forbiddenfolder

    you add

    User-agent: googlebot

    Disallow: /forbiddenfolder

    This solves a lot of Problems cause it addreses Google Bot direct.
    Perhaps Google does violate the robots.txt with there Cloaking Bot
    and Matt does not like questions which point in this direction.

    Another issue is did you ever visit a website with Google Adense on it via a
    anonym proxy? You will wonder cause no Adsense is shown at the side …
    well this cloaking! Google does it cause they are afraid of click fraud but
    is it really a reason to do something which you try to forbid all other Websites?

  16. Adi Jaffe
    By Adi Jaffe on 19 December, 2009

    You’d mentioned that for google to crawl pages on a site it needs to have links on indexed pages. Do internal links count or only external?

    • Joost de Valk
      By Joost de Valk on 20 December, 2009

      Internal links count to.

  17. Rob
    By Rob on 19 December, 2009

    Very nice succinct article. I’ve also used the robots.txt (will this always be supported?) file, but from now on will be using the header method.


  18. seo specialist
    By seo specialist on 19 December, 2009

    thanks Joost, I keep learning something new whenever I visit your site to read article. Earlier I were using Meta Tag (nofollow) on Index page and Robots.txt option to avoid getting it listed on SERP. This time I have learned .htaccess method. You rock.

  19. magimmo
    By magimmo on 18 December, 2009

    Just to signal “a coquille” (en little error in french) at the end of your GREAT post ! (as usual I am an fan). You repeat the word that (“that that”).
    Good to read you.

  20. Nurul Azis
    By Nurul Azis on 18 December, 2009

    I am gonna need this knowledge about preventing certain page to be indexed… ups.m. not showing on SERPs later. I am bookmarking. Thanks.

  21. Ondre
    By Ondre on 18 December, 2009

    I’m using just the “I would like to block search engines, but allow normal visitors” setting in the Privacy section of the WP settings. How does that fit into this scheme of things?

    I guess non-technical users like me would like to know.

  22. Agent Deepak
    By Agent Deepak on 18 December, 2009

    Hmm! Nice tip/ I had no idea robots.txt file is not very affective. Thanks.

  23. Luci
    By Luci on 18 December, 2009

    So is using the meta noindex/nofollow better than a robots.txt file? I wasn’t sure, I thought it might be the other way around for some reason.
    Is there any advantage over using the Header set rather than the meta tag? Because to include it on every page, could you just use an include with all the common headers you need in?

  24. DG
    By DG on 17 December, 2009


    Please correct me if I’m wrong; you said above:

    Header set X-Robots-Tag "noindex, nofollow"

    And it’d have the effect that that entire site can be indexed, but will never be shown in the search results.

    Don’t you think the meta code should be:

    Header set X-Robots-Tag "index, nofollow"

    Please advise

    • Joost de Valk
      By Joost de Valk on 18 December, 2009

      No, my advise was right :) The terminology is weird perhaps: it can be crawled by the search engine, just not put into its index.

      • DG
        By DG on 18 December, 2009

        Thanks for explaining Joost.

  25. Jim Gaudet
    By Jim Gaudet on 17 December, 2009

    I too would like to know if I can do this to specific pages, or directories. Thanks,

    • Joost de Valk
      By Joost de Valk on 18 December, 2009

      You can, with a specific .htaccess in the subdirectory for instance. This depends on your Apache installations configuration though.

      • Jim Gaudet
        By Jim Gaudet on 18 December, 2009

        Thanks for the quick response, I did think I could use one in each dir, but wasn’t sure…

  26. Yves
    By Yves on 17 December, 2009

    Hi Joost, thanks for sharing this one.
    I’m looking for a variant on that: is it possible to let google crawl just a part of a page? (e.g.: not indexing the comments section).

    Thanks for any feedback!


  27. Todd
    By Todd on 17 December, 2009

    Thanks for the info. I’m using wp platform to build a few sites and stumbled across your site.
    I would like the opposite though, and have google crawl and index my sites. I suppose I could
    just add the code: Header set X-Robots-Tag “index, follow”

    Thanks again!

    • Joost de Valk
      By Joost de Valk on 18 December, 2009

      You don’t have to do anything to get Google to crawl your page, just get links pointing at it from pages that have already been indexed.

Check out our must read articles about Analytics