log-file-analysis-crawl-budget-large-sites.html
< BACK TO BLOG Dimly lit server room corridor with amber lighting and rows of black server racks

Log File Analysis for Crawl Budget on 50K-Page Sites

Back in 2021 I inherited a client — an e-commerce retailer in Birmingham with around 52,000 indexed URLs — who couldn't figure out why roughly 18,000 of their product pages hadn't been crawled in over three months. Their dev team had been guessing. Adding XML sitemaps. Pinging Google Search Console. Nothing worked. Then I pulled their raw server logs and within about forty minutes the answer was completely obvious: Googlebot was burning through its daily crawl allowance on paginated filter URLs, session parameters, and a broken internal search facet that generated something like 4,000 unique but worthless URLs a week. Total waste. Complete nonsense.

That's what log file analysis is actually for — not vanity metrics, not boardroom slides, but finding out exactly what a crawler is doing on your site on any given Tuesday and cutting the fat ruthlessly.

Why Crawl Budget Actually Matters at Scale

Here's the thing most people get wrong. Crawl budget isn't a concern for a 200-page brochure site. Googlebot will sweep that in minutes. But once you're past, say, 20,000 URLs — and definitely when you're at 50,000 or beyond — Google's crawler makes explicit decisions about what to prioritise.Google's own documentationcalls this "crawl budget" and breaks it down into two components: crawl rate limit (how fast Googlebot crawls without hammering your server) and crawl demand (how much Google actuallywantsto crawl based on popularity and freshness signals).

Both of those can be manipulated. But you can't manipulate what you can't measure. And you cannot measure it properly without the logs.

Analytics tools like Google Search Console give you a crawl stats report. It's fine as a starting point. But it's aggregated, delayed, and it doesn't tell youwhichspecific URLs are eating the budget. Server logs do. They show you every single request Googlebot made, to which URL, at what time, and what HTTP status code it received back. That's the raw material.

Getting Hold of the Logs

Sounds obvious but this is where most people stall. Depending on your hosting setup, logs live in different places.

On a managed WordPress host like WP Engine or Kinsta, you can pull raw access logs from the dashboard or via SFTP — look in the/logs/directory. On a VPS running Nginx, your access log is typically at/var/log/nginx/access.log. Apache puts it at/var/log/apache2/access.log. If you're on a CDN like Cloudflare, you'll need Cloudflare Logpush (enterprise tier) or you'll only see CDN-edge requests, not origin — important distinction.

For that Birmingham client, they were on a Kinsta managed server. I pulled 30 days of logs, which came to about 4.2GB of compressed.gzfiles. That's a normal size for a busy 50K-page site.

Parsing Raw Logs Without Losing Your Mind

You have two real options here:

  1. Screaming Frog Log File Analyser— This is what I use 90% of the time. You import the log files directly, filter by Googlebot user agent, and it gives you a sortable breakdown of crawled URLs, crawl frequency, status codes, and response times. Honestly, for most agency work it's the right tool.Screaming Frog's log analyserhandles files up to several GB without falling over, which matters.
  2. ELK Stack (Elasticsearch, Logstash, Kibana)— More setup, significantly more power. If you've got ongoing monitoring needs for a large client or an enterprise contract, this is worth the investment. Seahawk has a couple of clients where we pipe logs directly into a Kibana dashboard. Real-time, beautiful, and you can set alerts when Googlebot crawl frequency drops suddenly.

For a one-off audit, Screaming Frog Log File Analyser is fine. For anything ongoing, build the ELK stack or at least considerGoAccess— it's open source, runs in the terminal, and processes large log files faster than almost anything else I've tested.

What to Actually Look For

Once you've got the data loaded, most people stare at it and don't know what questions to ask. Here's what I actually look for in a log audit:

Crawl Frequency Distribution

Sort your URLs by crawl frequency — how many times Googlebot hit each URL in the 30-day window. You'll almost always find a bimodal distribution. A cluster of important URLs getting crawled frequently (good) and a long tail of junk URLs that are also getting crawled frequently (very bad). That junk tail is your problem.

On that Birmingham site, the top 500 crawled URLs included 340 filter/facet combinations. None of them were indexed. None of them had any search volume. Googlebot was visiting?colour=red&size=M&sort=price_ascmore often than it was visiting the actual category pages. Wild.

Status Code Breakdown

Filter for everything that isn't a 200. Specifically:

  • 404s being crawled repeatedly— These are a crawl budget haemorrhage. Fix them with 301 redirects or patch the internal links that point to them.
  • 301 chains— A redirect that goes A → B → C is two wasted hops. Googlebot follows them but it costs budget and PageRank leaks at each jump.
  • 500 errors— If Googlebot is hitting pages that return 500s and then retrying them, you're wasting budget AND damaging your crawlability score with Google over time.
  • 304 Not Modified— Actually fine. Means Google is checking freshness and your caching headers are working correctly.

Response Time Spikes

Google has said publiclythat slow server response times cause Googlebot to crawl less aggressively. If your logs show average response times above 500ms for crawled URLs — particularly category or product pages — that's a signal to fix your server-side caching before anything else.

Identifying the Budget Killers

Let me give you a hit list of the things I see eating crawl budget on large sites, in rough order of how often I encounter them:

  1. Faceted navigation without noindex or disallow— Filters, colour pickers, size selectors, sort orders. These multiply your URL count geometrically. A product category with 10 filter options and 5 sort orders generates 50+ duplicate URL variants. Across a 50K-page site, that's potentially hundreds of thousands of URLs.
  2. Paginated archives crawled infinitely/page/2,/page/3.../page/847. If the content on page 200 of your blog archive has zero organic search value, you need to either noindex it or disallow the pagination path in robots.txt.
  3. Session IDs in URLs— Old CMS platforms (and some legacy WooCommerce setups) append session tokens like?sessionid=abc123def456to URLs. Every session generates a unique URL. Googlebot crawls all of them. This is a catastrophic budget leak on older sites.
  4. Duplicate content via URL parameters?utm_source=emailin internal links, tracking parameters leaking into crawlable URLs,?ref=homepageappended by affiliate plugins. Fix in Google Search Console's URL parameter toolandcanonicalise at the HTML level.
  5. Orphaned pages with no internal links but still in sitemap— Googlebot finds them via sitemap, crawls them, finds no internal signal, deprioritises them over time. But they still eat budget on discovery crawls.
  6. Soft 404 pages returning 200 status— Search pages with no results, empty category pages, user profile pages for deleted accounts. Google wastes time crawling these and sometimes indexes them.

Fixing What You Find

Honestly, the analysis is the easier part. Implementation is where projects get political.

Here's my actual workflow when I've finished a log audit and need to present recommendations:

  • Robots.txt disallowfor URL patterns that should never be crawled — session parameters, filter combinations, internal search result URLs. I useDisallow: /*?sessionid=style wildcard rules. Test every rule in Google Search Console's robots.txt tester before deploying.
  • Noindex + nofollowon paginated pages beyond page 2 or 3, depending on content freshness. Don't disallow pagination entirely or you break Googlebot's ability to discover linked content.
  • Canonical tagson all parameterised URL variants pointing to the clean canonical URL. This is belt-and-braces alongside robots.txt.
  • Fix 404s at the source— Either update the internal links or implement 301 redirects. I use Screaming Frog's main crawler alongside the log data to find which pages are linking to dead URLs.
  • XML sitemap hygiene— Remove any URL from your sitemap that returns a non-200, is noindexed, or is a redirect. Your sitemap should be a curated list of pages youwantindexed, nothing else.

Seahawk had a fintech client last year — around 65,000 pages, mostly dynamic content — where just fixing the robots.txt to block internal search URL patterns reduced Googlebot's crawl of junk URLs by 61% within six weeks. The remaining 39% of crawl budget shifted to product and category pages. Indexation of new content dropped from an average of 23 days to 6 days. That's the real-world impact.

Setting Up Ongoing Monitoring

One log audit is a snapshot. Good crawl budget management is ongoing. What does that actually look like in practice?

At minimum, I'd recommend pulling and parsing logs monthly for any site above 30,000 pages. Look at the crawl frequency trend for your top 100 revenue-driving URLs. If Googlebot's visit frequency to those pages is declining, something has changed — new crawl budget leaks, server performance issues, or a drop in PageRank signal.

If you want to get more sophisticated, set up GoAccess as a cron job to process daily log snapshots and email a summary report. Takes about two hours to configure and saves you from missing slow-burn crawl budget erosion between quarterly audits.

FAQ

Does crawl budget matter if I'm already fully indexed?

Sort of. Full indexation today doesn't mean it stays that way. If you're publishing new content regularly — new products, new blog posts, new landing pages — crawl budget determines how quickly that fresh content gets found. A site with a leaky crawl budget can have new pages sitting uninspected for weeks. That's a real competitive disadvantage if you're in a fast-moving niche.

Should I block Googlebot entirely from certain subfolders using robots.txt?

Yes, in specific cases. Admin areas, staging paths, internal search results, and parameter-heavy filter URLs are all reasonable candidates for Disallow rules. The one thing I'd caution against is blocking JavaScript or CSS files — Googlebot needs those to render your pages properly. A lot of older SEO advice says to block JS; ignore it.

How much log data should I analyse?

30 days is the sweet spot for most sites. Less than that and you won't see low-frequency crawl patterns. More than that and the file sizes get unwieldy unless you're running a proper ELK stack. For seasonal e-commerce sites, I'll sometimes look at 60 days spanning a peak period to understand crawl behaviour under traffic load.

What if my host doesn't provide raw access logs?

Push back on your hosting provider — most managed hosts have this available even if it's not surfaced prominently in the dashboard. If you truly can't get raw logs,Cloudflare's bot analyticscan give you a partial picture for sites behind the Cloudflare proxy, though it's a pale substitute for real log data. Consider switching hosts if this is a recurring blocker on a large client account.

Is Google Search Console's crawl stats enough?

For a small site, arguably yes. For anything above 20K pages, no. GSC crawl stats are aggregated by day and don't surface URL-level data. You can see that Googlebot crawled 12,000 pages on a Tuesday but notwhich12,000 pages. Log files give you that resolution. Both tools together — that's the complete picture.

---

Look, most SEOs skip log file analysis because it feels like DevOps territory. It's not glamorous. You're grepping through gigabytes of timestamps and user-agent strings. But on large sites, it's the difference between guessing where your crawl budget is going and actually knowing. And knowing, in my experience, is always worth the two hours it takes to pull the data.

< BACK TO BLOG