Cloak your sitemap!

There’s a discussion on Webmasterworld as to whether XML sitemaps are a welcoming door for scrapers. Especially with the new autodiscovery feature for sitemaps around, it becomes very easy for a scraper to find any and all URL’s in your site. So in my opinion, they are a bit too warm of a welcome. Tedster said in that thread:

After all, the sitemap.xml file hands over a list of urls directly to any scraper that wants to make use of it. And excessively scraped sites can struggle in the SERPs.
Sounds like a very good reason for cloaking to me.

Search Engine Optimization with PHPAnd in my eyes he’s right, it does sound like a very solid reason to cloak for me too. Now I’d been playing around with the simplecloak library Jaimie made (using the free IPLists.com data) for the Search Engine Optimization with PHP book, and I though it would work quite well for this purpose. After all, cloaking is “bad”, but this content is only meant for the search engines (which, btw, is kind of wrong anyway), so why would they mind you cloaking it just for them?

So I downloaded the latest version of the script from Jaimie’s update page, installed it (which is frickin’ easy if you follow the book), and wrote the following sitemap.php:

require_once 'include/simple_cloak.inc.php';</pre>
if (SimpleCloak::isSpider() >= 3) {
header("Content-Type: text/xml");
include('/your/private/sitemap/file/goes/here.xml');
} else {
header("HTTP/1.0 404 Not Found");
echo "<h1>You'r not allowed to look at this file</h1>";
echo "We're sorry, but the sitemap you requested was not meant for your consumption, as it was meant for the search engine only.";
echo "If you want, have a look at <a href="http://www.example.com/">the site</a> this sitemap is for.";
}

You can now use whatever tool you normally use to create a sitemap, which you include on the fifth line of this script. Just don’t ever tell anyone what you name that actual file, and no one who should not have your sitemap will get it. The last thing to do is adding the following line to your robots.txt:

Sitemap: http://www.example.com/sitemap.php

Now spiders will find the sitemap as they expect it, and scrapers will get a warning saying “This sitemap was not intended for you, go away.”.

Yoast.com runs on the Genesis Framework

Genesis theme frameworkThe Genesis Framework empowers you to quickly and easily build incredible websites with WordPress. Whether you're a novice or advanced developer, Genesis provides you with the secure and search-engine-optimized foundation that takes WordPress to places you never thought it could go.

Read our Genesis review or get Genesis now!

10 Responses

  1. SintBy Sint on 8 May, 2007

    Maybe you should change the error message to make it a bit more user-friendly, since it’s not unusual that sitemaps suddenly show up in the SERP’s…

    (e.g. http://www.seohandleiding.nl/sitemap-rankt-in-google.html)

  2. Joost de ValkBy Joost de Valk on 8 May, 2007

    I’ll do even better, I’ll make it throw in a 301 to the main domain :)

  3. Joost de ValkBy Joost de Valk on 8 May, 2007

    Thinking a bit more about it, that isn’t a really elegant solution either… I’ll just add some more text, and make it throw a 404, to prevent it from indexing in other weird search engines.

  4. PocketSEOBy PocketSEO on 10 June, 2007

    You could avoid cloaking by giving the sitemaps unconventional filenames like /234890234.xml that would not be guessed by scrapers. Then either ping the search engines every time the site is updated, or use cron to send the pings. The sitemap protocol explains how to ping search engines. Example:

    [search engine url]/ping?sitemap=sitemap_url

    I avoid adding sitemaps on most sites though…

  5. Joost de ValkBy Joost de Valk on 10 June, 2007

    @PocketSEO: the problem is the SE’s now want you to tell them where the sitemap is by putting it in your robots.txt… Once you do that, you’re open to scrapers.

  6. PocketSEOBy PocketSEO on 10 June, 2007

    I suppose that would be a problem… I think you can alternatively just put all the sitemaps in a hidden sitemap index file and then ping the engines with the location of the index file(s)…

    That book looks interesting. Going to order it now…

  7. Joost de ValkBy Joost de Valk on 10 June, 2007

    It is! You’re going to have loads of fun with it ;)

  8. TrophaeumBy Trophaeum on 12 January, 2008

    Use readfile() not include() unless you happen to use shared hosting and store the .xml file in /tmp somewhere, I could have fun running php code on your url with it then ;)

Trackbacks

  1. [...] you are concerned take a look here for a solution: Cloak your sitemap! – SEO Blog – Joost de Valk Well worth reading Joost’s blog as well. One of the better technical SEO blogs on the block. [...]

  2. Seo Placement…

    Note: You must register on the site to view the article at. Ask me how I feel about their web site (not to mention the URLs)! You need us for some search engine optimization!…