Last week I got a few messages from Google Webmaster Tools, saying it couldn’t access the robots.txt file on a site of a client. Turns out the client didn’t handle scheduled downtime correctly, causing problems with Google. While this article covers some rather basic technical SEO the last bit might be interesting for more advanced users. The message from Google Webmaster Tools read like this:
Over the last 24 hours, Googlebot encountered 41 errors while attempting to access your robots.txt. To ensure that we didn’t crawl any pages listed in that file, we postponed our crawl. Your site’s overall robots.txt error rate is 7.0%
HTTP status codes and search engines
A search engine constantly verifies whether content it’s linking to stille exist and hasn’t changed. It verifies two things:
- is the content still being served with the correct HTTP status code (HTTP 200);
- is it still the same content.
An HTTP 200 status code means: all is well, here is the content you asked for. It is the only correct status code for content. If content has moved, you can redirect it, either permanently, with an HTTP 301 header, or temporarily, with an HTTP 302 or 307 header.
If your server gives any other HTTP status header, it means the search engine can no longer find the content. If you server gives a 200 HTTP status code, but the page is in fact an error and says something like “File not found” or has very little content, Google will classify it as a soft 404 in Google Webmaster Tools.
There is only one proper way of telling the search engine that you’re doing site maintenance:
How server downtime works for search engines
If, during a crawl, a search engine finds that some content no longer exists, ie. it gives a 404 HTTP status, it will usually remove that content from the search results until it can come back and verify that it’s there again. If this happens often, it’ll take longer and longer for the content to come back in the search results.
What you should be doing is giving a 503 HTTP status code. This is the definition of the 503 status code from the RFC that defines these status codes:
The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.
So, you have to send a 503 status code in combination with a Retry-After header. Basically you’re saying: hang on, we’re doing some maintenance, please come back in X minutes. That sounds a lot better than what a 404 error says: “Not Found”. A 404 literally means that the server can’t find anything to return for the URL that was given.
How do I send a 503 header?
In PHP the code for a 503 would be like this:
$protocol = "HTTP/1.0"; if ( "HTTP/1.1" == $_SERVER["SERVER_PROTOCOL"] ) $protocol = "HTTP/1.1"; header( "$protocol 503 Service Unavailable", true, 503 ); header( "Retry-After: 3600" );
The delay time, 3600 in the above example, is given in seconds, so 3600 corresponds to 60 minutes. You can also specify the exact time when the visitor should come back, by sending a GMT date instead of the number of seconds. This would result in something like this:
header( "Retry-After: Fri, 19 Mar 2013 12:00:00 GMT" );
Use that with caution though, setting it to a wrong date might give unexpected results!
Our site is never down, we’re on WordPress
Nonsense. Every time you upgrade your core WordPress install, or when you’re updating plugins, WordPress will give a maintenance page. The default page sends out a proper 503 header. You can replace the default error page with a maintenance.php file in your wp-content folder, but if you do, you have to make sure that file sends out the proper 503 headers too. You can copy the code from the
If your database is down, WordPress actually sends an internal server error, using the
dead_db() function. If you’re doing planned maintenance on your database, therefore, you’ll need to set up a custom database error message page, db-error.php in your wp-content folder that sends a proper 503 header.
So where did our client go wrong?
Funnily enough, our client had properly configured 503 headers on their server. There was an issue though: they use a Varnish cache and that Varnish didn’t transfer the 503 status code correctly, it replaced it with a “general” HTTP 500 status, causing Google to send out that error email. I haven’t had a chance to test whether that is default Varnish behavior or something they broke, but it’s worth testing for your environment.
Pro tip: sending a 503 for your robots.txt
Per this post from Pierre Far of Google, if you send an HTTP 503 status code for your robots.txt, Google will halt all the crawling on your domain until it’s allowed to crawl the robots.txt again. This is actually a very useful way of preventing load on your server when doing maintenance. It still requires you to send a 503 for every URL on your server, including all static ones, but after Google has re-fetched the robots.txt it’ll probably stop hammering your server(s) for a while.
Conclusion: know what HTTP headers you’re sending
While writing this article I was reminded about a tweet quoting Vanessa Fox during last weeks SMX West:
— Ruth Burr (@ruthburr) March 11, 2013
I couldn’t agree more and would add to that: at all times. Now go check those headers!