crawl-budget-large-sites-hostlist.html
< BACK Endless rows of metal filing cabinets in a warmly lit industrial archive room

Crawl Budget on Large Sites: Indexing 91,000 Pages

Somewhere around page 47,000 of a crawl report, I genuinely considered a career change. The site โ€” a large UK-based e-commerce catalogue with around 91,000 indexable URLs โ€” had been sitting at roughly 34,000 pages indexed for six months. Not growing. The client was convinced something was "broken." I told them nothing was broken. I was half right.

That project changed how I think about crawl budget entirely. Not the theory โ€” I'd read the Google documentation, I'd watched the Search Central videos, I knew what crawl budgetwas. But knowing it and actually managing it at scale are two wildly different things. What follows is everything I'd tell myself if I could go back to that Tuesday morning in March 2022 when I first pulled the crawl stats in Google Search Console and felt my stomach drop.

What Crawl Budget Actually Means (And What It Doesn't)

Here's the thing that trips people up constantly: crawl budget does not mean "the number of pages Google will ever index for you." It means roughly the number of URLs Googlebot willfetchwithin a given crawl window, which Google itself defines as a combination ofcrawl rate limit and crawl demand.

Crawl rate limit is how fast Googlebot can crawl without hammering your server. Crawl demand is how much Googlewantsto crawl โ€” driven by how popular your URLs are and how often they change. Multiply those two levers together and you have a rough sense of how much crawling attention your site gets.

For most sites under 1,000 pages, this is irrelevant. Google will crawl everything. But once you're in the tens of thousands โ€” and absolutely once you breach six figures โ€” Googlebot starts making choices. It will prioritise. It will ignore. And if you haven't set it up to prioritise the right stuff, it will cheerfully spend its time crawling your session-ID parameter URLs and your filtered facet pages while your new product drops go unnoticed for weeks.

That's not a hypothetical. That's what happened on the 91,000-page project.

The Faceted Navigation Problem Nobody Warned Me About

Faceted navigation is the single biggest crawl budget killer I've encountered on large sites. Consistently. Every time.

The catalogue site had a faceted filter system โ€” colour, size, material, brand โ€” with no URL parameter handling configured anywhere. Each filter combination generated a unique URL. You could select "blue," "medium," "cotton," and "BrandX" and get/shop?colour=blue&size=medium&material=cotton&brand=brandx. Then someone flipped the order and got/shop?size=medium&colour=blue&brand=brandx&material=cotton. Different URL, identical content.

I ran a Screaming Frog crawl (version 18, which handles JavaScript rendering much better than older versions) and found over 200,000 URLs being generated by the filter system alone. Googlebot was visiting these. Constantly. While thousands of legitimate product pages sat unindexed.

The Fix That Actually Worked

We tackled this in two stages. First, I configured URL parameter handling in Google Search Console โ€” flagging the filter parameters as "Doesn't change page content" to signal Googlebot to consolidate. Second, and more importantly, the dev team implemented a proper canonical strategy, pointing all filter combinations back to the base category page. We also addednoindexto low-value filtered pages that couldn't practically be canonicalised.

Within about eight weeks, the indexed page count started climbing. Not explosively โ€” steadily. Which is actually what you want. A sudden spike in indexed pages can sometimes trigger a re-evaluation from Google rather than a clean win.

Crawl Stats in Search Console: The Data Most People Ignore

I've audited close to 80 sites in the last three years for crawl issues specifically. Maybe 15% of the people who handed those sites to me had ever looked at theCrawl Stats report in Search Console. That number should be much higher.

The Crawl Stats report shows you average crawl requests per day, average response time, and โ€” crucially โ€” what Googlebot is actually crawling broken down by purpose (discovery vs. refresh). If your "refresh" crawls are dominating and discovery crawls are minimal, Google is spending its time re-checking pages it already knows about. Not finding new ones. That's a signal your internal linking is probably shallow or your XML sitemap is doing nothing useful.

On the 91,000-page project, we were sitting at around 2,400 crawl requests per day. For a site that size, that means Google would theoretically take about 38 days to crawl everything once โ€” assuming every request hit a unique, useful page. It wasn't. Roughly 40% of crawl requests were hitting redirect chains or parameter-inflated duplicates.

Average Response Time Matters More Than You Think

One thing I underestimated early in my career: Googlebot is genuinely sensitive to server speed. Not in a ranking way (well, not directly), but in a crawl willingness way. Slow servers cause Googlebot to back off. Google will reduce its crawl rate to avoid stressing a struggling server.

The catalogue site had a Time to First Byte sitting around 1.8 seconds on category pages during peak traffic. After the client moved from shared hosting to a dedicated VPS with proper caching (WP Rocket for page caching, Redis for object caching), TTFB dropped to under 400ms. Crawl requests per day climbed noticeably over the following six weeks. Correlation, obviously, but I've seen this pattern too many times to dismiss it.

XML Sitemaps: Stop Treating Them Like a Formality

Most sitemaps I inherit are wrong. Not dramatically wrong โ€” just quietly, uselessly wrong.

Common issues I see:

  • Pages in the sitemap that return 404s or 301 redirects
  • Noindexed pages included in the sitemap (this confuses Googlebot โ€” you're simultaneously saying "crawl this" and "don't index this")
  • <lastmod>dates that are static or just wrong
  • Sitemaps with 70,000+ URLs in a single file (the limit is 50,000 per file, and large files slow down processing)
  • No sitemap index file, just one monolithic XML blob

On the large catalogue project, the sitemap had 91,000 URLs in a single file. It was also including every filtered URL that had ever been generated โ€” over 40,000 of which were noindexed. Googlebot was processing this enormous file and then discovering most of the URLs shouldn't be crawled anyway. Wasted signal on both ends.

We rebuilt the sitemap architecture as a proper sitemap index pointing to segmented child sitemaps: one for core category pages, one for product pages (split into two files given volume), one for editorial content. Each file under 40,000 URLs.<lastmod>values dynamically generated from the actual last-modified date in the database. No noindexed pages, no redirects.

The Bing Webmaster Tools data (yes, worth checking โ€” Bing will sometimes show you crawl behaviour patterns that hint at structural issues Google is also experiencing) showed sitemap processing time drop by over 60%.

Internal Linking: The Lever You Actually Control

Here's something I genuinely didn't appreciate until Seahawk took on a large content site โ€” roughly 65,000 articles โ€” for a media client back in 2020. The site had crawl budget issues despite having a well-formed sitemap and clean URL structure. The problem was internal linking depth. Thousands of articles were effectively orphaned โ€” no internal links pointing to them from any crawled page.

Googlebot doesn'tonlyfollow sitemaps. It follows links. If a page is only discoverable through a sitemap entry and has zero internal links, it gets deprioritised. That's not officially documented in crisp terms, butGoogle's own guidance on internal linkingmakes clear that crawlable links from important pages are how Googlebot prioritises discovery.

For that media client, we audited internal links using Ahrefs' Site Audit tool and identified around 12,000 articles with three or fewer internal links pointing to them. We built an automated "related articles" block into the CMS (WordPress, custom Gutenberg block) that pulled contextually similar content. Over the following quarter, indexed pages on that site climbed from 41,000 to over 58,000. Same domain authority. Same content production rate. Just better internal linking.

The numbered approach I now use on every large site audit:

  1. Run a full Screaming Frog crawl and export internal link data
  2. Identify every page with fewer than three inbound internal links
  3. Cross-reference against pages thatarewell-linked โ€” find topical clusters
  4. Build contextual internal links from high-traffic pages downward into the thin-linked pages
  5. Validate in Search Console's URL Inspection tool that newly linked pages move from "Discovered โ€” currently not indexed" to "Crawled"

That "Discovered โ€” currently not indexed" status in Search Console is your canary. It means Google knows the page exists but hasn't prioritised fetching it. Improving internal links is usually the fastest way to resolve it.

Log File Analysis: Uncomfortable But Necessary

I'll be honest โ€” log file analysis is something I avoided for years. It felt like unnecessary depth when crawl tools gave you most of what you needed. I was wrong.

Log files tell you what Googlebotactuallydid, not what you infer it did from your sitemap or crawl tool. On one project โ€” a SaaS company with about 8,000 product documentation pages โ€” log analysis revealed Googlebot was spending nearly 30% of its crawl time on/wp-admin/adjacent URLs and admin-side assets that should have been blocked inrobots.txt. Nobody had set that up properly. Documentation pages that hadn't been crawled in four months.

Screaming Frog's Log File Analyseris the tool I use. It's not glamorous but it's reliable. Import your server logs, filter by Googlebot user agent, and sort by URL hit frequency. The patterns that emerge are almost always illuminating โ€” and almost always include something crawling that shouldn't be.

When to Worry and When to Leave It Alone

Not every large site needs aggressive crawl budget management. If you're at 10,000 pages and 9,800 are indexed, don't start pulling levers. You'll create problems where none exist.

Crawl budget management becomes genuinely worth your time when:

  • You have more than ~15,000 indexable pages
  • Your indexed count has plateaued despite new content being added
  • Crawl Stats shows average crawl requests well below what you'd expect for your page volume
  • You see thousands of URLs in "Discovered โ€” currently not indexed" or "Crawled โ€” currently not indexed" status

That second status โ€” "Crawled โ€” currently not indexed" โ€” is different and worth separating out. It means Google fetched the page and decided not to index it, usually due to thin content or near-duplicate issues. No amount of crawl budget optimisation fixes a quality problem.

---

FAQ

Does crawl budget affect small sites?

Rarely in a meaningful way. If your site has under 1,000 pages and loads quickly, Google will almost certainly crawl everything regardless. Crawl budget becomes a genuine concern at scale โ€” typically above 10,000 to 15,000 pages, or on sites where a large portion of URLs are dynamically generated.

Will submitting a sitemap directly fix crawl budget issues?

No. A sitemap helps with discovery โ€” it tells Google these URLs exist. But if your site has structural issues (faceted navigation spam, slow server response, shallow internal linking), a sitemap won't override those signals. Think of a sitemap as a suggestion, not a command.

How do I check if Googlebot is wasting crawl on junk URLs?

Start with the Crawl Stats report in Google Search Console and look at what URL types are getting the most requests. Then cross-reference with a Screaming Frog crawl to identify high-volume URL patterns that are duplicates, noindexed, or low-value. Log file analysis will give you the most precise picture if you have access to server logs.

Should I use `noindex` or `robots.txt disallow` to save crawl budget?

Different tools for different jobs.Disallowin robots.txt prevents Googlebot from fetching the page at all โ€” saving crawl budget but meaning Google can't read any signals on that page.Noindexallows Google to fetch the page but tells it not to include the page in search results. For crawl budget specifically,disallowis more effective on truly junk URLs (admin paths, internal search results). For filtered facet pages where you want Google to understand the content but not index it,noindexwith a canonical is usually the right call.

What's a realistic timeframe to see improvements after fixing crawl budget issues?

Honestly, it depends on your crawl rate. On the 91,000-page project, meaningful movement in indexed page counts took about six to eight weeks after the major fixes were deployed. Don't expect overnight changes โ€” Googlebot needs to re-crawl, re-evaluate, and the indexing pipeline has its own latency on top of that.

---

The 91,000-page project ended well. Indexed pages climbed from 34,000 to just over 71,000 over five months. Not perfect โ€” there were genuinely thin product pages that deserved not to be indexed โ€” but the content that mattered got found. The client stopped asking if something was broken. And I stopped eyeing career changes around page 47,000 of crawl reports. Mostly.

< BACK