Yahoo SiteExplorer web vs. API: answers from Yahoo!

June 11th, 2007 – 5 Comments

In response to my post about the Yahoo API giving the “wrong” results, I got an email from a Yahoo! rep, and we’ve been emailing back and forth a few times since. When I showed him the difference in the numbers given through the API and the Web interface for (I’ve updated my domain-info tool to both scrape the web interface and get the numbers through the API), saying “they can’t both be accurate” he explained the difference this way:

Nope, not going to claim they are accurate, merely an estimate taken from either the raw (for the scraped pages) or semi-analyzed (for the API data) for the server cluster you hit at the time of request.

If it were possible to return accurate numbers, I’m willing to bet they’d do that. Unfortunately, it’s usually not, due to stuff like scaling issues, crawl vs. report lag and other factors.

Followed by another nice quote at the end of that email:

Again, I’m not claiming that these number are the best possible (even as estimates, that’s why the engineers are trying to improve them), but they do serve as a guide. Likewise, I’d definitely make sure to grab numbers from Google, Ask and MSN since decision making off of one data point seldom makes for good decisions.

Now I think this points is great, were it not that the data these other 3 engines give are either stupidly off (in the case of Google and ASK) or non-existent at the moment (MSN).

He says something else too:

To be honest, the only person that can accurately measure real inbound link counts are the folks that control the access logs and can scan and report those. Anything outside of those numbers is never going to be as accurate.

Now this would be true, if all scrapers gave me clickthroughs… Yet they don’t. So I think I can get a nice sample of links which truely have value from access logs, but it wouldn’t show me any DMOZ links for instance. Another problem is of course, that your competitor probably won’t give you access to his access logs… So we need interfaces like these. The two different numbers now each have their inherent value because of these answers, so for now I’m going to keep using the API and scrape them.

I must say though, that it’s awesome to be able to mail with a rep from Yahoo! about this and discuss it so openly, and them having no problem at all with me blogging this.

5 Responses to Yahoo SiteExplorer web vs. API: answers from Yahoo!

  1. Sint
    By Sint on 11 June, 2007

    The statement of numbers being an estimation sounds clear, but why are they using two different methods of calculating the data? When you analyse data, it’s acceptable that it’s not 100% accurate, but it’s confusing when these estimations are calculated differently.
    I suppose one of them delivers the ‘best’ result, so why are they using the other methods if their data is less accurate?

  2. Joost de Valk
    By Joost de Valk on 11 June, 2007

    Well as the Yahoo rep states it, the API has somewhat more time to filter… It’s a matter of resources, nothing else…

  3. Hyper Dog Media
    By Hyper Dog Media on 12 June, 2007

    It’s surprising that site explorer isn’t using the API. Would “the folks that control the access logs” be Google Analytics? Yahoo’s “Full Analytics”? You’d think they could get a pretty good idea from these sources.

  4. Joost de Valk
    By Joost de Valk on 12 June, 2007

    Yeah I was thinking among the same lines…

  5. RenegadeLatino
    By RenegadeLatino on 13 July, 2007

    Good stuff

Check out our must read articles about Analytics