Google Webmaster Tools Content Analysis shows Google breaks the rules.

Well, it might offer some cool data, but it also shows me that whatever Google does, they don’t adhere to the robots standard. Check this:

Content Analysis of blocked pages

Well, yes, these pages do have duplicate title tags. But they have something else as well:

<meta name="robots" content="noindex,follow,noodp" />

Perhaps, Google, instead of working on these nice webmaster tools, you should start honoring simple stuff like a meta name=”robots” first.

Tags: runs on the Genesis Framework

Genesis theme frameworkThe Genesis Framework empowers you to quickly and easily build incredible websites with WordPress. Whether you're a novice or advanced developer, Genesis provides you with the secure and search-engine-optimized foundation that takes WordPress to places you never thought it could go.

Read our Genesis review or get Genesis now!

31 Responses

  1. Patrick AltoftBy Patrick Altoft on 14 December, 2007

    Have these pages been indexed?

  2. ArttuBy Arttu on 14 December, 2007

    I think it’s just a limitation with Webmaster Tools. In the same vein it shows nofollowed inlinks from other sites.

    “Noindex” in meta name=”robots” doesn’t tell the bots to not inspect the page. You could have a sitemap page “noindexed”, and it would still pass all the link juice.

  3. Dave DavisBy Dave Davis on 14 December, 2007

    Agreed, noindex and “no crawl” are indexed. The adwords and quality score bots will still crawl too.

  4. Joost de ValkBy Joost de Valk on 14 December, 2007

    I know guys, but what good is that information to me if I’ve specifically blocked it from the SE?? I don’t want it to show up in the SERPs anyway…

  5. Dave DavisBy Dave Davis on 14 December, 2007

    Good point, but are those pages in your sitemap.xml?

  6. Joost de ValkBy Joost de Valk on 14 December, 2007

    Of course not :)

  7. Dave DavisBy Dave Davis on 14 December, 2007

    Hmmm, does the webmastertools bot have a unique user agent? Looks like we have another one to block.

  8. Richard HearneBy Richard Hearne on 14 December, 2007

    tinfoil hat time…

    perhaps noindex doesn’t really mean noindex? perhaps noindex means index but dont let anyone know about it, and the console is pulling data from ‘non-sanitised’ index data?

    dont dare call me paranoid :)

  9. Joost de ValkBy Joost de Valk on 14 December, 2007

    Yeah well this proves that they keep the data doesn’t it? How would they know the title otherwise?

  10. Joost de ValkBy Joost de Valk on 14 December, 2007

    Well, why? :) Why should I invest time in making sure that a searchpage’s second page has another title than it’s first page, if I don’t intend people to come there in ANY other way than to click through FROM that first page?

  11. André ScholtenBy André Scholten on 14 December, 2007

    We all knew they keep the data, they have to crawl those pages to find links they have to follow. And noindex or not: does it matter? In the most ideal situation you want every page to have a unique title.

  12. André ScholtenBy André Scholten on 15 December, 2007

    I someone creates a link to that second page, a visitor that clicks on that link immediately sees he’s on page 2. “sense of place”.

    But I agree with you: nobody will invest time in those little details. You have to be a idealist to do that.

  13. SebastianBy Sebastian on 15 December, 2007

    Noindex means “crawl but don’t display it on SERPs”. Since those URLs aren’t indexed Google obeys the REP tag. To follow the links Google needs to keep a copy, so why shouldn’t they provide that info in your GWC acct where nobody else can view it? It’s extracted from the crawl cache, not the visible cache. You could complain when you add a Disallow: /*?p= to your robots.txt’s Googlebot section ;)

  14. SebastianBy Sebastian on 15 December, 2007

    Oops … Disallow: /*?=s that is

  15. Joost de ValkBy Joost de Valk on 15 December, 2007

    Sebastian: that’s not new to me either, believe me :) but it’s just that since I don’t want these pages to appear in the SERPs, have made that VERY clear to Google, and YET they still give me that stupid advice :)

  16. SebastianBy Sebastian on 15 December, 2007

    From Google’s POV that’s usability advice, not crawling, indexing or ranking advice. I see where you’re coming from but I don’t think that’s really “breaking rules”. :)

  17. SintBy Sint on 15 December, 2007

    I agree with some people here that the Google Webmaster Tool can used for more things than just SEO, so providing this information from pages that should be no-indexed can still be useful. But it also raises questions about how Google is interpreting the noindex-metatag and what the status of this tag is related to non-SERP-functionalities of search engines. Maybe it would be wise if Google would include some information on this issue in the documentation/help section of WM-Tools.

  18. namBy nam on 15 December, 2007

    Content Analysis is a excellent tool to determine a duplicated content, I’ve used my file robots.txt to to restrict access to Google on these pages, and it’s working well for me :
    For exemple :
    Disallow: /*&view=getnewpost$
    Disallow: /*&view=getlastpost$
    Disallow: /*&view=old$
    Disallow: /*&view=new$

  19. Sergey RusakBy Sergey Rusak on 16 December, 2007

    For my blog I block all /trackback/ and /feed/ pages through tobots.txt but for some reason Google still index them every time new page appear. Only one good thing, it appear in index only for 1-3 days and later dissapear. Wierd.

  20. SintBy Sint on 16 December, 2007

    @Sergey: I hope the reason for this is not that you called your file really tobots.txt ;-)
    Do you have a noindex-tag in these files? Or are they all non-HTML so this isn’t possible?

    Alternatively you could also block all traffic coming from Googlebot using .htaccess on these folders.

  21. tingeltangeltillBy tingeltangeltill on 17 December, 2007

    If someone is linking to your “noindex” sites, google will put them in their index, as Matt said in a Google Video

  22. Joost de ValkBy Joost de Valk on 17 December, 2007

    @tingeltangeltill: noindex without a nofollow impicitly means, noindex, follow. Hence they need to spider those pages to know which links are on there, that’s logical. They won’t show them in their index, yet they DO give me some sort of advice on them. I’d rather have them not do that, but they don’t “put them in their index”, they spider them because they need to follow the links.

  23. edBy ed on 18 December, 2007

    Where can I find this duplicate title tag section in google analytics?

  24. Joost de ValkBy Joost de Valk on 18 December, 2007

    @ed: this is in Google Webmaster Tools, not analytics.

  25. edBy ed on 19 December, 2007

    Thanks, got confused by the term analysis, which is not analytics…

  26. gissitBy gissit on 7 January, 2008

    So you have pages that you want google to follow links from but do not want it to send visitors there? So it is a doorway page of some sort?

  27. tomBy tom on 7 January, 2008

    Did you read the article gissit? or do want a link to your site…

  28. gissitBy gissit on 7 January, 2008

    I read the article (I was searching for something else and this caught my eye). The behaviour of the Googlebot is exactly as I would expect, it was not prohibited from viewing the page in robots.txt, it has not indexed the page as directed by the meta tag and ‘maybe’ it followed the links as requested.
    As this is an SEO blog my comment was more related to the wisdom of having such pages in todays link spam climate. Google has specifically made a point of saying on many occaisions that pages should be for visitors and not for search engines. If I were an algorithm writer this would certainly raise some alarms with me.
    If the pages are not for visitors to see then what else are they for? If it is to lead a search engine to other pages then they are indeed doorway pages and may suffer the wrath of Google.
    As for posting just to get a link? I really do not need a link from a blog to help my seo. If this page is written in accordance with current wisdom the link will either have a rel=”nofollow” or will be wrapped in javascript. If it is not then it would have little or no value anyway as this is not related to my site topic.

  29. Joost de ValkBy Joost de Valk on 7 January, 2008

    @gissit: First of all, what you call “current wisdom” does not apply here, I’ve removed nofollows. Second: the pages are just there for discovery, just like XML sitemaps are, there’s nothing wrong with that.

  30. gissitBy gissit on 7 January, 2008

    I can’t really argue with someone with a homepage pr like yours :) but experience has shown me that less is very often more in terms of raw number of pages Vs SERP positions.
    If I want to create discovery links I do it in a html sitemap in a page formatted for users.
    Using this type of philosophy I have much better results than most competitiors with far fewer inbound links and visible (toolbar) PR.

    I know that a lot of talk has gone into the whole nofollow thing but I have also seen the effect of using it as intended. A bit of link management work on a friends forum has increased Google traffic more than 20 fold in less than 9 months.

    Anyway, I think this may all be a bit off-topic for this thread but if you don’t mind I will bookmark your site and return when I have more time as it looks interesting here.