Google crawler document adds HTTP caching details
Google has updated its crawler help documentation to add a new HTTP Caching section that explains how Google’s crawlers handle cache control headers. Google also published a blog post asking us to let Google cache our pages.
Begging may be too much, but Gary Illyes wrote in the first line of the blog post: “Please allow us to cache.” He then said that we would allow Google to cache our content today more than we would have done 10 years ago. Gary wrote: “The number of requests that can be returned by local caches has decreased: 10 years ago, about 0.026% of total fetches were cacheable, which is no longer that impressive; today that number is 0.017%.”
Google has added a section on HTTP caching to the help document to explain how Google handles cache control headers. Google’s crawling infrastructure supports heuristic HTTP caching as defined by the HTTP Caching Standard, specifically through the ETag response and If-None-Match request header and the Last-Modified-Response and If-Modified-Since -Request header.
If both the ETag and Last-Modified response header fields are present in the HTTP response, Google’s crawlers use the ETag value as required by the HTTP standard. Specifically for Google’s crawlers, we recommend using ETag instead of the Last-Modified header to indicate caching preference, as there are no date formatting issues with ETag. Other HTTP caching directives are not supported, Google added.
I should add that both Google and Bing have supported ETag since at least 2018.
From Google: “Please allow us to cache. Caching is a crucial piece of the Internet’s great puzzle. Caching allows pages to load at lightning speed on re-visits, saves computing resources and therefore natural resources, and generates huge savings.” A lot expensive… https://t.co/vQRmBpJvQd
– Glenn Gabe (@glenngabe) December 9, 2024
4/ What impact does this have on page speed?
Google’s crawlers that support caching send the ETag value returned for a previous crawl of that URL in the If-None-Match header. If the ETag value sent by the crawler matches the current value generated by the server, your server should return:
— Siddhesh SEO a/cc (@siddhesh_asawa) December 9, 2024
Google added some more details to this section, but also expanded this section of the page:
Google’s crawlers and fetchers support HTTP/1.1 and HTTP/2. The crawlers use the protocol version that provides the best crawling performance and can switch protocols between crawling sessions depending on previous crawling statistics. The default protocol version used by Google’s crawlers is HTTP/1.1; Crawling over HTTP/2 may save computing resources (e.g. CPU, RAM) for your website and Googlebot, but otherwise the website does not provide any specific benefit to the Google product (e.g. no improvement in Google rankings -Search). To disable crawling over HTTP/2, instruct the server hosting your website to respond with HTTP status code 421 when Google tries to access your website over HTTP/2. If this is not possible, you can send a message to the crawling team (but this solution is only temporary). Google’s crawling infrastructure also supports crawling over FTP (as defined in RFC959 and its updates) and FTPS (as defined in RFC4217 and its updates), but crawling over these protocols is rare.
Forum discussion at X.