Yahoo SiteExplorer web vs. API: answers from Yahoo!

In response to my post about the Yahoo API giving the “wrong” results, I got an email from a Yahoo! rep, and we’ve been emailing back and forth a few times since. When I showed him the difference in the numbers given through the API and the Web interface for css3.info (I’ve updated my domain-info tool to both scrape the web interface and get the numbers through the API), saying “they can’t both be accurate” he explained the difference this way:

Nope, not going to claim they are accurate, merely an estimate taken from either the raw (for the scraped pages) or semi-analyzed (for the API data) for the server cluster you hit at the time of request.

If it were possible to return accurate numbers, I’m willing to bet they’d do that. Unfortunately, it’s usually not, due to stuff like scaling issues, crawl vs. report lag and other factors.

Followed by another nice quote at the end of that email:

Again, I’m not claiming that these number are the best possible (even as estimates, that’s why the engineers are trying to improve them), but they do serve as a guide. Likewise, I’d definitely make sure to grab numbers from Google, Ask and MSN since decision making off of one data point seldom makes for good decisions.

Now I think this points is great, were it not that the data these other 3 engines give are either stupidly off (in the case of Google and ASK) or non-existent at the moment (MSN).

He says something else too:

To be honest, the only person that can accurately measure real inbound link counts are the folks that control the access logs and can scan and report those. Anything outside of those numbers is never going to be as accurate.

Now this would be true, if all scrapers gave me clickthroughs… Yet they don’t. So I think I can get a nice sample of links which truely have value from access logs, but it wouldn’t show me any DMOZ links for instance. Another problem is of course, that your competitor probably won’t give you access to his access logs… So we need interfaces like these. The two different numbers now each have their inherent value because of these answers, so for now I’m going to keep using the API and scrape them.

I must say though, that it’s awesome to be able to mail with a rep from Yahoo! about this and discuss it so openly, and them having no problem at all with me blogging this.

Tags:


Yoast.com runs on the Genesis Framework

Genesis theme frameworkThe Genesis Framework empowers you to quickly and easily build incredible websites with WordPress. Whether you're a novice or advanced developer, Genesis provides you with the secure and search-engine-optimized foundation that takes WordPress to places you never thought it could go.

Read our Genesis review or get Genesis now!

5 Responses

  1. SintBy Sint on 11 June, 2007

    The statement of numbers being an estimation sounds clear, but why are they using two different methods of calculating the data? When you analyse data, it’s acceptable that it’s not 100% accurate, but it’s confusing when these estimations are calculated differently.
    I suppose one of them delivers the ‘best’ result, so why are they using the other methods if their data is less accurate?

  2. Joost de ValkBy Joost de Valk on 11 June, 2007

    Well as the Yahoo rep states it, the API has somewhat more time to filter… It’s a matter of resources, nothing else…

  3. Hyper Dog MediaBy Hyper Dog Media on 12 June, 2007

    It’s surprising that site explorer isn’t using the API. Would “the folks that control the access logs” be Google Analytics? Yahoo’s “Full Analytics”? You’d think they could get a pretty good idea from these sources.

  4. Joost de ValkBy Joost de Valk on 12 June, 2007

    Yeah I was thinking among the same lines…

  5. RenegadeLatinoBy RenegadeLatino on 13 July, 2007

    Good stuff