wordpress-indexation-issues-large-sites.html
< BACK TO BLOG Dusty archive filing cabinet with overflowing drawers in warm amber light

Indexation Issues on Large WordPress Sites: A Diagnosis Guide

A client rang me on a Tuesday morning last spring — proper panic in his voice. He ran a property listings site, about 42,000 pages, and Google Search Console had just told him that only 5,800 of them were indexed. He'd lost roughly 86% of his indexable pages, seemingly overnight. No algorithm update. No manual action. No recent deployments he could remember. Just… gone.

I've seen this exact scenario more times than I can count, across Seahawk's 12,000+ WordPress builds. And the maddening thing is that indexation loss on large sites rarely has one cause. It's usually three or four small failures that compound quietly until something collapses.

Here's how I actually diagnose it.

---

Start With Google Search Console — But Don't Stop There

The first thing I always do is pull up the Pages report in Google Search Console. Not the old Coverage report — Google updated this in 2023, and the new Pages view breaks down indexed vs. non-indexed with proper reason codes. Take a screenshot on day one. You need a baseline.

The reason codes matter enormously. "Crawled — currently not indexed" is a completely different problem from "Excluded by 'noindex' tag". One is a quality signal issue; the other is a configuration disaster. I've seen developers treat both identically and waste weeks chasing the wrong thing.

The Reasons I See Most Often on Large Sites

  • Crawled — currently not indexed: Google visited the page but decided it wasn't worth indexing. Usually thin content, near-duplicates, or pages that don't earn backlinks or internal links.
  • Discovered — currently not indexed: Google found the URL (likely in your sitemap) but hasn't bothered to crawl it yet. This is a crawl budget problem, not a content problem.
  • Excluded by 'noindex' tag: Someone — possibly you, possibly a plugin — added a noindex directive. More on this below.
  • Duplicate, Google chose different canonical: Your canonical tags are pointing somewhere unexpected, or Google is overriding them.
  • Page with redirect: A page that should be indexable is redirecting somewhere, either correctly or incorrectly.

Don't just look at totals. Download the full list for each reason code as a CSV. On a 40,000-page site, you need to be able to sort and filter.

---

Crawl Budget Is Real and It Will Kill Large Sites

Back in 2019, Seahawk was working on a large e-commerce client — about 28,000 product pages — and we could not figure out why Google was only crawling around 3,000 pages per day. The site was fast. The sitemap was clean. Everything looked fine on the surface.

Turned out the site was generating thousands of faceted navigation URLs — ?colour=red&size=large&sort=price — that were crawlable, not canonicalised properly, and eating through Googlebot's crawl allowance before it ever reached the real product pages.

Crawl budget is essentially the number of URLs Googlebot is willing to crawl on your site within a given timeframe. Google's own documentation on crawl budget is genuinely worth reading — they're honest about how it works. The short version: if you're wasting it on garbage URLs, the important pages don't get crawled.

How to Actually Audit Crawl Budget

  1. Pull your server logs. Not Google's crawl stats — actual server logs. Tools like Screaming Frog Log File Analyser let you filter purely for Googlebot hits.
  2. Look at what percentage of Googlebot's visits are landing on URLs you actually care about. If it's below 60%, you have a budget problem.
  3. Find the URL patterns eating the most crawls. Sort by frequency. The top offenders are almost always: faceted nav, pagination on paginated archives, session ID parameters, and empty category/tag archive pages.
  4. Fix the source, not just the symptom. Disallow in robots.txt for parameters that should never be crawled. Canonical tags for everything else.

On that e-commerce project, we blocked the faceted URLs via robots.txt and added rel="canonical" to all filtered views. Within six weeks, indexed pages went from 8,000 to 24,000. Same content. Just Googlebot finally reaching it.

---

The noindex Disaster (It Happens More Than You Think)

I need to talk about this because I've caused it myself. Not my finest moment. During a staging-to-live migration for a news site back in 2021, we failed to uncheck "Discourage search engines from indexing this site" in WordPress Settings → Reading. The site went live with a site-wide noindex. It took eleven days before the client noticed organic traffic had fallen off a cliff.

WordPress buries that checkbox in a place nobody expects. And certain SEO plugins — Yoast, Rank Math, even AIOSEO — have their own noindex toggles at the post type level, the taxonomy level, and the individual page level. Any one of them can silently noindex huge swaths of your site.

How to Check for noindex at Scale

Run Screaming Frog on the full site and filter for pages returning a noindex directive. Export the list. Then cross-reference against your important URL groups — product pages, service pages, blog posts, whatever matters to the business.

Also check your robots.txt at yourdomain.com/robots.txt. Look for overly broad Disallow: rules. I've seen rules like Disallow: /wp-content/ blocking CSS and JS that Google needs to render pages properly — which can cause rendering failures that look like indexation problems but are actually Googlebot seeing a broken page.

---

Canonical Tags That Are Quietly Misfiring

Canonicals are the sneakiest indexation killer on large WordPress sites. Because they look correct in isolation and only reveal their damage at scale.

Here's a pattern I see constantly: a site with WooCommerce has products accessible via multiple URL paths — /product/red-shoes/, /product-category/footwear/red-shoes/, and sometimes /shop/red-shoes/. Each one has a canonical tag, but if those canonicals point to slightly different URLs (HTTP vs HTTPS, trailing slash vs no trailing slash, www vs non-www), Google treats them as signals pointing to different pages and refuses to consolidate.

The fix is boring but necessary:

  1. Audit every URL structure your WordPress install generates. Use Screaming Frog's site crawl → filter by "Canonical" → export.
  2. Check for mismatched protocols, trailing slashes, and subdomain variations.
  3. Make sure your canonical always matches your preferred URL exactly, character for character.

Rank Math and Yoast both generate canonical tags automatically, but neither plugin knows about your .htaccess redirects or your CDN's URL normalisation. You have to verify the rendered canonical, not just what the plugin thinks it's outputting. Fetch the page with a tool like httpstatus.io and inspect the actual response headers and HTML.

---

XML Sitemaps Are Often Wrong on Large Sites

Most WordPress SEO plugins generate sitemaps automatically. Most of them also include URLs you don't want in your sitemap — paginated pages (/page/2/, /page/3/), author archives, tag pages with two posts on them, attachment pages.

A sitemap should be a shortlist of your best, most canonical pages. Not a dump of every URL WordPress has ever generated.

Sitemap Hygiene Rules I Actually Follow

  • Exclude paginated archive pages. Always.
  • Exclude author archive pages unless it's a multi-author site where author pages have genuine content value.
  • Exclude tag archives unless tags are editorially managed and have meaningful content.
  • Set a post count threshold — I usually exclude any archive page with fewer than five posts.
  • Split large sitemaps into sitemap indexes. Keep individual sitemap files under 10MB and under 50,000 URLs. Google has documented limits here.

On the property listings site from the beginning of this post, the sitemap had 41,000 URLs including every tag archive, every pagination page, and — this still hurts me to say — the WordPress login page. Clean it up first. Always.

---

Internal Linking Is an Indexation Issue

People don't think of internal linking as an indexation tool. They should.

If a page has no internal links pointing to it, Googlebot may never find it in the first place — even if it's in your sitemap. Sitemaps tell Google a URL exists. Internal links tell Google a URL matters. Those are different signals.

On large content sites, orphaned pages are rampant. A blog post published three years ago, linked from the post archives but never linked from any other post, will see its crawl frequency drop to roughly nothing over time.

I use Screaming Frog's "Orphan Pages" report (under Site Structure) to identify pages in the sitemap that have zero internal links pointing to them. Then I work back through the content to find logical places to add links. Not forced links — actually relevant ones. Takes time but the indexation impact is real.

---

A Systematic Diagnosis Checklist

If I were handing this to a junior developer at Seahawk, here's the order I'd have them work through it:

  1. Pull Google Search Console → Pages report → download all non-indexed URLs with reason codes.
  2. Check robots.txt for accidental broad disallows.
  3. Verify the WordPress "Discourage search engines" checkbox is off.
  4. Run Screaming Frog and filter for noindex directives at the page level.
  5. Check canonical tags — rendered output, not plugin settings.
  6. Pull server logs and check Googlebot's crawl distribution across URL types.
  7. Audit the XML sitemap for junk URLs (pagination, empty archives, non-canonical variants).
  8. Run the Orphan Pages report and identify internally unlinked pages.
  9. Check for faceted navigation or parameter-based URLs generating duplicate crawlable paths.
  10. Verify page speed — pages that time out consistently get deprioritised by Googlebot.

Don't try to fix everything at once. Fix one category of issue, wait three to four weeks for Google to recrawl, measure, then move to the next. If you change everything simultaneously you'll never know what actually worked.

---

FAQ

Why are pages indexed one week and then dropped the next?

Google's index isn't static. It constantly re-evaluates pages based on quality signals, freshness, and crawl efficiency. A page that was indexed six months ago can be dropped if it hasn't earned any links, isn't being internally linked to, or if Google's quality assessment of your domain has shifted. This is especially common after a site migration or a significant content overhaul — Google re-crawls, re-evaluates, and sometimes decides previously-indexed pages don't meet the bar anymore.

Does site speed affect indexation?

Yes, more directly than most people realise. If pages are slow to respond — consistently over 2-3 seconds for the initial server response — Googlebot will deprioritise crawling them. At scale, this means slow pages simply don't get crawled frequently enough to stay indexed. Fix your Time to First Byte (TTFB) before worrying about anything else speed-related. A cheap caching plugin like WP Rocket makes a measurable difference. Core Web Vitals matter for rankings, but TTFB matters for crawling.

Can too many pages in a sitemap hurt indexation?

Not directly — but a bloated sitemap with low-quality URLs dilutes the signal you're sending Google about what matters. If your sitemap contains 40,000 URLs and 30,000 of them are thin archive pages, Google learns to treat your sitemap as noise. Keep sitemaps tight and high-quality. Think of it as editorial curation, not a URL inventory.

Should I use Google's URL Inspection tool to manually request indexing?

For individual important pages — yes, absolutely. But don't try to manually request indexing for thousands of URLs. It doesn't scale and Google has said it doesn't give manually-requested URLs special treatment in the long run. Fix the underlying crawl and quality issues and let Google's natural crawling do the work. Use manual inspection to verify that specific pages can be indexed, not to force index everything.

---

The honest truth is that indexation diagnosis isn't glamorous work. It's spreadsheets, log files, and a lot of waiting. But on a large site, even recovering 20% of your lost indexed pages can mean a meaningful jump in organic traffic — and on a 40,000-page property listings site, that's real money. Get the basics right before chasing anything exotic. It's almost never exotic.

< BACK TO BLOG