crawl budget

Crawl budget optimization

Crawl budget optimization

July 05th, 2016 – 3 Comments

What is a crawl budget? Crawl budget is the number of pages Google will crawl on your site on any given day. This number varies slightly from day to day, but overall it’s relatively stable. The number of pages Google crawls, your “budget”, is generally determined by the size of your site, the “health” of your site (how many errors Google encounters) and the number of links to your site.

Google doesn’t always spider every page on a site instantly. In fact, sometimes it can take weeks. This might get in the way of your SEO efforts. Your newly optimized landing page might not get indexed. At that point, it becomes time to optimize your crawl budget.

It might crawl 6 pages a day, it might crawl 5,000 pages, it might even crawl 4,000,000 pages every single day. This depends on many factors, which we’ll discuss in this article. Some of these factors are things you can influence.

How does a crawler work?

A crawler like Googlebot gets a list of URLs to crawl on a site. It goes through that list systematically. It grabs your robots.txt file every once in a while to make sure it’s still allowed to crawl each URL and then crawls the URLs one by one. Once a spider has crawled a URL and it has parsed the contents, it adds new URLs it has found on that page that it has to crawl back on the to-do list.

Several events can make Google feel a URL has to be crawled. It might have found new links pointing at content, or someone has tweeted it, or it might have been updated in the XML sitemap, etc etc…. There’s no way to make a list of all the reasons why Google would crawl a URL, but when it determines it has to, it adds it to the to-do list.

When is crawl budget an issue?

Crawl budget is not a problem if Google has to crawl a lot of URLs on your site and it has allotted a lot of crawls. Say your site has 250,000 pages. Google crawls 2,500 pages on this particular site each day. It will crawl some (like the homepage) more than others. It could take up to 200 days before Google notices particular changes to your pages if you don’t act. Crawl budget is an issue now. If it crawls 50,000 a day, there’s no issue at all.

To quickly determine whether your site has a crawl budget issue, follow the steps below. This does assume your site has a relatively small number of URLs that Google crawls but doesn’t index (for instance because you added meta noindex).

  1. Determine how many pages you have on your site, the number of your URLs in your XML sitemaps might be a good start.
  2. Go into Google Search Console.
  3. Go to Crawl -> Crawl stats and take note of the average pages crawled per day.
  4. Divide the number of pages by the “Average crawled per day” number.
  5. If you end up with a number higher than ~10 (so you have 10x more pages than what Google crawls each day), you should optimize your crawl budget. If you end up with a number lower than 3, go read something else. 

Become a technical SEO expert with our Technical SEO 1 training! »

Technical SEO 1 training$ 179 - Buy now » Info

What URLs is Google crawling?

You really should know which URLs Google is crawling on your site. The only “real” way of knowing that is looking at your site’s server logs. For larger sites I personally prefer using Logstash + Kibana for that. For smaller sites, the guys at Screaming Frog have released quite a nice little tool, aptly called SEO Log File Analyser (note the S, they’re Brits).

Get your server logs and look at them

Depending on your type of hosting, you might not always be able to grab your log files. However, if you even so much as think you need to work on crawl budget optimization because your site is big, you should get them. If your host doesn’t allow you to get them, change hosts.

Fixing your site’s crawl budget is a lot like fixing a car. You can’t fix it by looking at the outside, you’ll have to open up that engine. Looking at logs is going to be scary at first. You’ll quickly find that there is a lot of noise in logs. You’ll find a lot of commonly occurring 404s that you think are nonsense. But you have to fix them. You have to get through the noise and make sure your site is not drowned in tons of old 404s.

Increase your crawl budget

Let’s look at the things that actually improve how many pages Google can crawl on your site.

Website maintenance: reduce errors

Step one in getting more pages crawled is making sure that the pages that are crawled return one of two possible return codes: 200 (for “OK”) or 301 (for “Go here instead”). All other return codes are not OK. To figure this out, you have to look at your site’s server logs. Google Analytics and most other analytics packages will only track pages that served a 200. So you won’t find many of the errors on your site in there.

Once you’ve got your server logs, try to find common errors, and fix them. The most simple way of doing that is by grabbing all the URLs that didn’t return 200 or 301 and then order by how often they were accessed. Fixing an error might mean that you have to fix code. Or you might have to redirect a URL elsewhere. If you know what caused the error, you can try to fix the source too.

Another good source to find errors is Google Search Console. Read this post by Michiel for more info on that. If you’re using Yoast SEO, connecting your site to Google Search Console through the plugin allows you to easily retrieve all those errors. If you’ve got Yoast SEO Premium, you can even redirect them away easily using the redirects manager.

Block parts of your site

If you have sections of your site that really don’t need to be in Google, block them using robots.txt. Only do this if you know what you’re doing, of course. One of the common problems we see on larger eCommerce sites is when they have a gazillion way to filter products. Every filter might add new URLs for Google. In cases like these, you really want to make sure that you’re letting Google spider only one or two of those filters and not all of them.

Reduce redirect chains

When you 301 redirect a URL, something weird happens. Google will see that new URL and add that URL to the to-do list. It doesn’t always follow it immediately, it adds it to its to-do list and just goes on. When you chain redirects, for instance, when you redirect non-www to www, then http to https, you have two redirects everywhere, so everything takes longer to crawl.

Get more links

This is easy to say, but hard to do. Getting more links is not just a matter of being awesome, it’s also a matter of making sure others know that you’re awesome too. It’s a matter of good PR and good engagement on Social. We’ve written extensively about link building, I’d suggest reading these 3 posts:

  1. Link building from a holistic SEO perspective
  2. Link building: what not to do?
  3. 6 steps to a successful link building strategy

When you have an acute indexation problem, you should definitely look at your crawl errors, blocking parts of your site and at fixing redirect chains first. Link building is a very slow method to increase your crawl budget. On the other hand: if you intend on building a large site, link building needs to be part of your process.

AMP and your crawl budget

Google is telling everyone to use Accelerated Mobile Pages, in short: AMP. These are “lighter” versions of web pages, specifically aimed at mobile. The problem with AMP is that it means adding a separate URL for every page you have. You’d get example.com/page/ and example.com/page/amp/. This means you need double the crawl budget for your site. If you have crawl budget issues already, don’t start working on AMP just yet. We’ve written about it twice, but find that for sites that do not serve news, it’s not worth it yet.

TL;DR: crawl budget optimization is hard

Crawl budget optimization is not for the faint of heart. If you’re doing your site’s maintenance well, or your site is relatively small, it’s probably not needed. If your site is medium sized and well maintained, it’s fairly easy to do based on the above tricks. If you find, after looking at some error logs, that you’re in over your head, it might be time to call in someone more experienced.

Read more: ‘Robots.txt: the ultimate guide’ »


3 Responses to Crawl budget optimization

  1. Jeremy Davis
    By Jeremy Davis on 19 July, 2016

    Hi Joost,
    Great post. It’s not easy finding good crawl budget posts.
    Have you found that the amount of pages being crawled according to log server files are different than the amount of page that WMT says it crawled?

  2. Brian
    By Brian on 8 July, 2016

    Great post! Google Webmaster Tools allows website owners to submit links (something like 500/month) for automatic crawling. Since I don’t have more than 500 new (or updated) posts per month I’ve just submitted links as soon as they’ve been posted/updated. Is that one valid way to get past the “I sure hope Google crawled my page changes” feeling?

  3. Ivo van den Dijssel
    By Ivo van den Dijssel on 8 July, 2016

    Hi Yoast,

    Do you know if adding parameters to the url-parameter section in Google Console gives Google a clear signal about your site size? And thus, which url to (re)crawl?

    Without parameters my site is about 18.000 pages, with 125.000

    Ivo


Check out our must read articles about Analytics