There's a discussion on Webmasterworld as to whether XML sitemaps are a welcoming door for scrapers. Especially with the new autodiscovery feature for sitemaps around, it becomes very easy for a scraper to find any and all URL's in your site. So in my opinion, they are a bit too warm of a welcome. Tedster said in that thread:
After all, the sitemap.xml file hands over a list of urls directly to any scraper that wants to make use of it. And excessively scraped sites can struggle in the SERPs.
Sounds like a very good reason for cloaking to me.
And in my eyes he's right, it does sound like a very solid reason to cloak for me too. Now I'd been playing around with the simplecloak library Jaimie made (using the free IPLists.com data) for the Search Engine Optimization with PHP book, and I though it would work quite well for this purpose. After all, cloaking is "bad", but this content is only meant for the search engines (which, btw, is kind of wrong anyway), so why would they mind you cloaking it just for them?
So I downloaded the latest version of the script from Jaimie's update page, installed it (which is frickin' easy if you follow the book), and wrote the following sitemap.php:
require_once 'include/simple_cloak.inc.php';
if (SimpleCloak::isSpider() >= 3) {
header("Content-Type: text/xml");
include('/your/private/sitemap/file/goes/here.xml');
} else {
header("HTTP/1.0 404 Not Found");
echo "<h1>You'r not allowed to look at this file</h1>";
echo "We're sorry, but the sitemap you requested was not meant for your consumption, as it was meant for the search engine only.";
echo "If you want, have a look at <a href="http://www.example.com/">the site</a> this sitemap is for.";
}You can now use whatever tool you normally use to create a sitemap, which you include on the fifth line of this script. Just don't ever tell anyone what you name that actual file, and no one who should not have your sitemap will get it. The last thing to do is adding the following line to your robots.txt:
Sitemap: http://www.example.com/sitemap.php
Now spiders will find the sitemap as they expect it, and scrapers will get a warning saying "This sitemap was not intended for you, go away.".






Maybe you should change the error message to make it a bit more user-friendly, since it's not unusual that sitemaps suddenly show up in the SERP's...
(e.g. http://www.seohandleiding.nl/sitemap-rankt-in-google.html)
I'll do even better, I'll make it throw in a 301 to the main domain :)
Thinking a bit more about it, that isn't a really elegant solution either... I'll just add some more text, and make it throw a 404, to prevent it from indexing in other weird search engines.
You could avoid cloaking by giving the sitemaps unconventional filenames like /234890234.xml that would not be guessed by scrapers. Then either ping the search engines every time the site is updated, or use cron to send the pings. The sitemap protocol explains how to ping search engines. Example:
[search engine url]/ping?sitemap=sitemap_url
I avoid adding sitemaps on most sites though...
@PocketSEO: the problem is the SE's now want you to tell them where the sitemap is by putting it in your robots.txt... Once you do that, you're open to scrapers.
I suppose that would be a problem... I think you can alternatively just put all the sitemaps in a hidden sitemap index file and then ping the engines with the location of the index file(s)...
That book looks interesting. Going to order it now...
It is! You're going to have loads of fun with it ;)
Use readfile() not include() unless you happen to use shared hosting and store the .xml file in /tmp somewhere, I could have fun running php code on your url with it then ;)