Google Webmaster Tools Content Analysis shows Google breaks the rules.

Well, it might offer some cool data, but it also shows me that whatever Google does, they don't adhere to the robots standard. Check this:

Content Analysis of blocked pages

Well, yes, these pages do have duplicate title tags. But they have something else as well:

[code language="html"]

[/code]

Perhaps, Google, instead of working on these nice webmaster tools, you should start honoring simple stuff like a meta name="robots" first.

30 Responses to “Google Webmaster Tools Content Analysis shows Google breaks the rules.”

  1. Have these pages been indexed?

  2. I think it's just a limitation with Webmaster Tools. In the same vein it shows nofollowed inlinks from other sites.

    "Noindex" in meta name="robots" doesn't tell the bots to not inspect the page. You could have a sitemap page "noindexed", and it would still pass all the link juice.

  3. Agreed, noindex and "no crawl" are indexed. The adwords and quality score bots will still crawl too.

  4. Good point, but are those pages in your sitemap.xml?

  5. Hmmm, does the webmastertools bot have a unique user agent? Looks like we have another one to block.

  6. tinfoil hat time...

    perhaps noindex doesn't really mean noindex? perhaps noindex means index but dont let anyone know about it, and the console is pulling data from 'non-sanitised' index data?

    dont dare call me paranoid :)

  7. We all knew they keep the data, they have to crawl those pages to find links they have to follow. And noindex or not: does it matter? In the most ideal situation you want every page to have a unique title.

  8. I someone creates a link to that second page, a visitor that clicks on that link immediately sees he's on page 2. "sense of place".

    But I agree with you: nobody will invest time in those little details. You have to be a idealist to do that.

  9. Noindex means "crawl but don't display it on SERPs". Since those URLs aren't indexed Google obeys the REP tag. To follow the links Google needs to keep a copy, so why shouldn't they provide that info in your GWC acct where nobody else can view it? It's extracted from the crawl cache, not the visible cache. You could complain when you add a Disallow: /*?p= to your robots.txt's Googlebot section ;)

  10. Oops ... Disallow: /*?=s that is

  11. From Google's POV that's usability advice, not crawling, indexing or ranking advice. I see where you're coming from but I don't think that's really "breaking rules". :)

  12. I agree with some people here that the Google Webmaster Tool can used for more things than just SEO, so providing this information from pages that should be no-indexed can still be useful. But it also raises questions about how Google is interpreting the noindex-metatag and what the status of this tag is related to non-SERP-functionalities of search engines. Maybe it would be wise if Google would include some information on this issue in the documentation/help section of WM-Tools.

  13. Content Analysis is a excellent tool to determine a duplicated content, I've used my file robots.txt to to restrict access to Google on these pages, and it's working well for me :
    For exemple :
    Disallow: /*&view=getnewpost$
    Disallow: /*&view=getlastpost$
    Disallow: /*&view=old$
    Disallow: /*&view=new$

  14. For my blog I block all /trackback/ and /feed/ pages through tobots.txt but for some reason Google still index them every time new page appear. Only one good thing, it appear in index only for 1-3 days and later dissapear. Wierd.

  15. @Sergey: I hope the reason for this is not that you called your file really tobots.txt ;-)
    Do you have a noindex-tag in these files? Or are they all non-HTML so this isn't possible?

    Alternatively you could also block all traffic coming from Googlebot using .htaccess on these folders.

  16. If someone is linking to your "noindex" sites, google will put them in their index, as Matt said in a Google Video

  17. Where can I find this duplicate title tag section in google analytics?

  18. Thanks, got confused by the term analysis, which is not analytics...

  19. So you have pages that you want google to follow links from but do not want it to send visitors there? So it is a doorway page of some sort?

  20. Did you read the article gissit? or do want a link to your site...

  21. I read the article (I was searching for something else and this caught my eye). The behaviour of the Googlebot is exactly as I would expect, it was not prohibited from viewing the page in robots.txt, it has not indexed the page as directed by the meta tag and 'maybe' it followed the links as requested.
    As this is an SEO blog my comment was more related to the wisdom of having such pages in todays link spam climate. Google has specifically made a point of saying on many occaisions that pages should be for visitors and not for search engines. If I were an algorithm writer this would certainly raise some alarms with me.
    If the pages are not for visitors to see then what else are they for? If it is to lead a search engine to other pages then they are indeed doorway pages and may suffer the wrath of Google.
    As for posting just to get a link? I really do not need a link from a blog to help my seo. If this page is written in accordance with current wisdom the link will either have a rel="nofollow" or will be wrapped in javascript. If it is not then it would have little or no value anyway as this is not related to my site topic.

  22. Joost
    I can't really argue with someone with a homepage pr like yours :) but experience has shown me that less is very often more in terms of raw number of pages Vs SERP positions.
    If I want to create discovery links I do it in a html sitemap in a page formatted for users.
    Using this type of philosophy I have much better results than most competitiors with far fewer inbound links and visible (toolbar) PR.

    I know that a lot of talk has gone into the whole nofollow thing but I have also seen the effect of using it as intended. A bit of link management work on a friends forum has increased Google traffic more than 20 fold in less than 9 months.

    Anyway, I think this may all be a bit off-topic for this thread but if you don't mind I will bookmark your site and return when I have more time as it looks interesting here.

Comments closed, if you feel you have something to say:
drop me a line.

1 Trackbacks to “Google Webmaster Tools Content Analysis shows Google breaks the rules.”