Last week I got a few messages from Google Webmaster Tools, saying it couldn’t access the robots.txt file on a site of a client. Turns out the client didn’t handle scheduled downtime correctly, causing problems with Google. While this article covers some rather basic technical SEO the last bit might be interesting for more advanced users.
The message from Google Webmaster Tools read like this:
Over the last 24 hours, Googlebot encountered 41 errors while attempting to access your robots.txt. To ensure that we didn’t crawl any pages listed in that file, we postponed our crawl. Your site’s overall robots.txt error rate is 7.0%
HTTP status codes and search engines
A search engine constantly verifies whether content it’s linking to stille exist and hasn’t changed. It verifies two things:
- is the content still being served with the correct HTTP status code (HTTP 200);
- is it still the same content.
An HTTP 200 status code means: all is well, here is the content you asked for. It is the only correct status code for content. If content has moved, you can redirect it, either permanently, with an HTTP 301 header, or temporarily, with an HTTP 302 or 307 header.
If your server gives any other HTTP status header, it means the search engine can no longer find the content. If you server gives a 200 HTTP status code, but the page is in fact an error and says something like “File not found” or has very little content, Google will classify it as a soft 404 in Google Webmaster Tools.
There is only one proper way of telling the search engine that you’re doing site maintenance:
How server downtime works for search engines
If, during a crawl, a search engine finds that some content no longer exists, ie. it gives a 404 HTTP status, it will usually remove that content from the search results until it can come back and verify that it’s there again. If this happens often, it’ll take longer and longer for the content to come back in the search results.
What you should be doing is giving a 503 HTTP status code. This is the definition of the 503 status code from the RFC that defines these status codes:
The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.
So, you have to send a 503 status code in combination with a Retry-After header. Basically you’re saying: hang on, we’re doing some maintenance, please come back in X minutes. That sounds a lot better than what a 404 error says: “Not Found”. A 404 literally means that the server can’t find anything to return for the URL that was given.
How do I send a 503 header?
In PHP the code for a 503 would be like this:
$protocol = &quot;HTTP/1.0&quot;; if ( &quot;HTTP/1.1&quot; == $_SERVER[&quot;SERVER_PROTOCOL&quot;] ) $protocol = &quot;HTTP/1.1&quot;; header( &quot;$protocol 503 Service Unavailable&quot;, true, 503 ); header( &quot;Retry-After: 3600&quot; );
The delay time, 3600 in the above example, is given in seconds, so 3600 corresponds to 60 minutes. You can also specify the exact time when the visitor should come back, by sending a GMT date instead of the number of seconds. This would result in something like this:
header( &quot;Retry-After: Fri, 19 Mar 2013 12:00:00 GMT&quot; );
Use that with caution though, setting it to a wrong date might give unexpected results!