How to manage crawl budget for large websites
10 mins read

How to manage crawl budget for large websites

How to manage crawl budget for large websites

The internet is a constantly evolving virtual universe with over 1.1 billion websites.

Do you think Google can crawl any website in the world?

Even with all the resources, money, and data centers that Google has, the company can’t even crawl the entire web — and it doesn’t want to.

What is a crawl budget and is it important?

Crawl budget refers to the time and resources Googlebot spends crawling web pages on a domain.

It’s important to optimize your website so Google can find and index your content faster, which can help make your website more visible and more popular.

If you have a large website with millions of web pages, it’s especially important to manage your crawl budget so that Google can crawl your most important pages and better understand your content.

Google states the following:

If your site doesn’t have a large number of pages that change quickly, or if your pages appear to be crawled the day they were published, keeping your sitemap up to date and checking index coverage regularly is sufficient. Google also states that each page must be checked, consolidated and scored to determine where it will be indexed after crawling.

The crawl budget is determined by two main elements: crawl capacity limit and crawl demand.

Crawl demand is how much Google wants to crawl on your site. More popular sites, e.g. For example, a popular story from CNN and pages that undergo significant changes are crawled more frequently.

Googlebot wants to crawl your site without overloading your servers. To prevent this, Googlebot calculates a crawl capacity limit. This is the maximum number of simultaneous parallel connections that Googlebot can use to crawl a website and the time delay between requests.

Taking into account crawl capacity and crawl demand, Google defines a website’s crawl budget as the amount of URLs that Googlebot can and wants to crawl. Even if the crawl capacity limit is not reached and the crawl demand is low, Googlebot will crawl your site less.

Here are the top 12 crawl budget management tips for large to medium-sized websites with 10,000 to million URLs.

1. Determine which pages are important and which should not be crawled

Determine which pages are important and which pages are not as important to crawl (and therefore are visited less frequently by Google).

Once you have determined this through analysis, you can identify which pages on your site are worth crawling and which pages on your site should not be crawled and exclude them from crawling.

For example, Macys.com has over 2 million pages that are indexed.

Macys.com Pages Screenshot of the search for [site: macys.com]Google, June 2023

It manages its crawling budget by instructing Google not to crawl certain pages of the website because it has blocked Googlebot from crawling certain URLs in the robots.txt file.

Googlebot may decide it’s not worth looking at the rest of your site or increasing your crawl budget. Make sure Faceted Navigation and Session IDs: are blocked via robots.txt

2. Manage duplicate content

While Google doesn’t impose a penalty for duplicate content, you want to provide Googlebot with original and unique information that meets the information needs of the end user and is relevant and useful. Make sure you use the robots.txt file.

Google has stated that it does not use an index because queries are still being made, but are then deleted.

3. Block crawling of irrelevant URLs using Robots.txt and tell Google which pages can be crawled

For an enterprise-level website with millions of pages, Google recommends using robots.txt to block crawling of irrelevant URLs.

You also want to make sure that your important pages, directories containing your golden content, and money pages are allowed to be crawled by Googlebot and other search engines.

Robots.txtScreenshot by the author, June 2023

4. Long forwarding chains

Keep the number of redirects as low as possible. Too many redirects or redirect loops can confuse Google and lower your crawl limit.

Google states that long redirect chains can negatively impact crawling.

5. Use HTML

Using HTML increases the likelihood that a crawler from any search engine will visit your site.

While Googlebots have improved on JavaScript crawling and indexing, other search engine crawlers are not as sophisticated as Google and may have problems with languages ​​other than HTML.

6. Make sure your web pages load quickly and provide a good user experience

Optimize your website for Core Web Vitals.

The faster your content loads – less than three seconds – the faster Google can serve information to end users. If they like it, Google will keep indexing your content as your site is in Google crawled state, which may result in your crawl limit increasing.

7. Have useful content

According to Google, content is rated based on quality, regardless of age. Create and update your content as needed, but there’s no value in making pages appear artificially fresh by making minor changes and updating the page date.

If your content meets end-user needs and is helpful and relevant, it doesn’t matter if it’s old or new.

If users don’t find your content helpful and relevant, I encourage you to update and update your content to keep it current, relevant and useful and promote it through social media.

Also, link your pages directly to the home page, which may be considered more important and crawled more often.

8. Watch out for crawl errors

If you deleted some pages on your site, make sure the URL for permanently removed pages returns a 404 or 410 status. A 404 status code is a strong signal not to recrawl that URL.

However, blocked URLs remain part of your crawl queue for much longer and are crawled again when the block is unblocked.

  • Google also states that it will remove all soft 404 pages, which continue to be crawled, wasting your crawling budget. To test this, go to GSC and check your index coverage report for soft 404 errors.

If your site has a lot of 5xx HTTP response status codes (server errors) or connection timeouts indicate the opposite, crawling will be slow. Google recommends paying attention to the crawl statistics report in Search Console and keeping the number of server errors to a minimum.

By the way, Google doesn’t respect or follow the non-standard “Crawl-Delay”-Robots.txt rule.

Even if you use the nofollow attribute, if another page on your site or another page on the web doesn’t mark the link as nofollow, the page can still be crawled and wasted crawling budget.

9. Keep sitemaps up to date

XML sitemaps are important for helping Google find your content and can speed up the work.

Keeping your sitemap URLs up to date is extremely important -Tag for updated content and follow SEO best practices including but not limited to the following.

  • Only provide URLs that you want search engines to index.
  • Only include URLs that return a 200 status code.
  • Make sure a single sitemap file is less than 50MB or 50,000 URLs. If you decide to use multiple sitemaps, create one Index Sitemap that will list all.
  • Make sure your sitemap is in place UTF-8 encoded.
  • Contain Links to localized versions any URL. (See Google’s documentation.)
  • Keep your sitemap up to date i.e. update your sitemap every time there is a new URL or an old URL has been updated or deleted.

10. Create a good site structure

Good website structure is important for your SEO performance, indexing and user experience.

Website structure can affect search engine results page (SERP) results in a number of ways, including crawlability, click-through rate, and user experience.

With a clear and linear structure of your website, you can use your crawl budget efficiently, which will help Googlebot to find new or updated content.

Always remember the three-click rule, which means that any user should be able to get from one page of your website to another with a maximum of three clicks.

11. Internal Linking

The easier you make it possible for search engines to crawl and navigate your site, the easier it is for crawlers to identify your structure, context, and important content.

Internal links pointing to a webpage can let Google know that that page is important, help build a hierarchy of information for that site, and help propagate link equity throughout your site.

12. Always monitor crawl statistics

Always check and monitor GSC to see if there are any problems crawling your site and look for ways to make the crawling more efficient.

You can use the Crawl Statistics report to determine if Googlebot is having trouble crawling your site.

If your site is reporting availability errors or warnings in the GSC, look for instances in the Host availability Charts where Googlebot requests crossed the red line, click in the chart to see which URLs failed and try to correlate them with problems on your site.

Also, you can use the URL inspection tool to test some URLs on your website.

If the URL inspection tool returns host load warnings, it means Googlebot can’t crawl as many URLs from your site as it detected.

Summary

Crawl budget optimization is crucial for large websites due to their massive size and complexity.

With numerous pages and dynamic content, search engine crawlers face the challenge of crawling and indexing the website content efficiently and effectively.

By optimizing your crawl budget, site owners can prioritize crawling and indexing of important and updated pages, ensuring search engines are using their resources wisely and effectively.

This optimization process includes techniques such as improving website architecture, managing URL parameters, setting crawl priorities, and eliminating duplicate content, resulting in better search engine visibility, improved user experience, and increased organic traffic for large websites.

More resources:


Featured image: BestForBest/Shutterstock