redirect-map-large-site-migration.html
< BACK TO BLOG Annotated paper map on a wooden desk with red ink arrows and a cold cup of tea, overcast window light

How to Build a Redirect Map for a 20,000-URL Site Migration

Back in 2021, a large UK retailer handed Seahawk a migration brief for a site with just over 22,000 indexed URLs. The dev team had already been working on the new platform for four months. They had a launch date. They had a staging site. What they didn't have, and genuinely hadn't thought about, was a redirect map. Not a rough one. Not any one. The SEO lead's plan was to "handle it post-launch." I still think about that meeting sometimes.

We pushed the launch by three weeks. Rebuilt the redirect strategy from scratch. The site launched clean, held 94% of its organic traffic through the transition, and the client sent us a bottle of Scotch. The three weeks of delay saved them from what would almost certainly have been a six-month recovery crawl.

So. Here's how you actually build a redirect map for a site at this scale, the process, the tooling, the prioritisation logic, and the parts most migration guides quietly gloss over.

---

Start With a Complete URL Inventory

You can't map what you haven't counted. Before anything else, you need a full export of every live, indexed URL on the origin site. Not just the sitemap. Sitemaps lie, they're often out of date, they exclude paginated URLs, and they routinely omit product or archive pages that have accumulated links over years.

I run Screaming Frog SEO Spider in list mode against a combined source: the XML sitemap plus a Google Search Console export of all indexed URLs. Those two sources together almost always surface URLs the other misses. For a 20,000-URL site, expect the real crawl count to come back anywhere between 18,000 and 35,000, pagination, filters, faceted nav, all of it.

Export the crawl to a spreadsheet. You want at minimum: URL, HTTP status, title tag, H1, inbound internal links count, and whether it appears in GSC with impressions. That last column matters more than people admit.

Don't forget the 404s that still get traffic

While you're in GSC, pull the Coverage report and grab every URL Google has tried to crawl in the last six months, including existing 404s. Some of those broken pages still have external backlinks pointing at them. I've seen a 404 with 40 referring domains on a site that hadn't been maintained in two years. Those need a destination too.

---

Categorise Before You Map

A flat list of 20,000 URLs is unusable. The first thing I do after the crawl export is categorise every URL by type, because the mapping logic is completely different depending on what a URL is.

Here's the rough taxonomy I use:

  • Product pages, 1:1 map to new product URL where possible
  • Category / collection pages, map to equivalent new category, or nearest parent
  • Blog posts / articles, match by slug, title similarity, or topic cluster
  • Tag and archive pages, usually consolidate to category or homepage
  • Paginated URLs (e.g. /category/shoes/page/3), almost always → parent category
  • User-generated or account URLs, usually drop or redirect to login
  • Old campaign landing pages, evaluate link equity before deciding
  • Duplicate/canonical variants, redirect to the canonical, full stop

Doing this categorisation step in Google Sheets with a dropdown column takes a couple of hours. It saves days. Once everything is typed, you can process each category with a different rule set rather than making 20,000 individual decisions.

---

The Matching Phase: Automated First, Manual Second

Here's where most teams get it wrong. They try to manually match every URL. At 20,000 rows that's not thorough, it's a nervous breakdown waiting to happen.

My process is automated matching first, manual review second, only for the URLs that actually matter.

Automated matching with VLOOKUP and Python

For sites where the URL structure is similar between old and new (e.g. /products/red-shoes/ becoming /shop/red-shoes/), a simple VLOOKUP in Sheets on the slug portion sorts out 60–70% of the list in under ten minutes. Regex-based find/replace handles structural pattern changes.

For messier migrations, platform changes, complete IA redesigns, I use a short Python script that does fuzzy string matching on page titles between the old crawl export and the new site's crawl. The thefuzz library (formerly FuzzyWuzzy) does this well. Anything above an 85% match score gets auto-assigned. Anything below goes into a manual review queue.

The manual queue is usually 20–30% of the list. Not all of it needs senior attention.

Prioritising the manual queue

Not all 20,000 URLs deserve equal time. I score each URL by:

  1. GSC impressions in the last 90 days, if it's driving search traffic, it's high priority
  2. Number of referring domains (pulled from Ahrefs), link equity you can't afford to drop
  3. Internal link count from the crawl, signals structural importance
  4. Revenue attribution, if the client can provide GA4 ecommerce data, pages driving conversions jump to the top

Anything with impressions, backlinks, or revenue gets a human mapping decision. Everything else can follow a rule-based fallback (usually → parent category or homepage). Honestly, for a 20,000-URL site, maybe 800–1,200 URLs genuinely need individual attention. The rest are long-tail cruft.

---

Structuring the Redirect Map Document

The final map lives in a spreadsheet. Simple. No clever tooling needed at this stage, the file just needs to be unambiguous and importable.

The columns I use:

  1. Source URL (full, absolute URL of the old page)
  2. Destination URL (full, absolute URL of the new page)
  3. Redirect type (301 in almost every case, 302 only for genuinely temporary, which is rare)
  4. Match type (exact / pattern / regex)
  5. Category (from the taxonomy step)
  6. Priority tier (High / Medium / Low, based on the scoring above)
  7. Status (Pending / Confirmed / Implemented / Tested)
  8. Notes

That "Notes" column is underrated. It's where you put things like "client confirmed this product is discontinued, redirect to category" or "backlink from Forbes pointing here, map to closest equivalent not homepage." Future-you will thank current-you.

Keep the source URLs exactly as they appear, with or without trailing slash, with query strings if applicable. Inconsistency here causes partial matches and missed redirects that are a nightmare to diagnose post-launch.

---

Pattern-Based vs. Exact Redirects

At this scale you absolutely need pattern-based redirects, not just exact-match ones. Writing 20,000 individual Redirect 301 lines in an .htaccess file is, well, it works, but it's fragile, slow to parse, and a maintenance disaster.

For Apache/WordPress setups, I use regex-based RewriteRules for structural patterns. For example, if every old URL under /old-blog/[post-slug]/ maps to /insights/[post-slug]/, that's one rule, not 4,000.

On Nginx, the same principle applies with rewrite directives. On Cloudflare, you can use Bulk Redirects (their free tier handles up to 20 exact-match rules; Workers or the paid Redirect Rules product handles pattern logic at scale).

The map document should flag which redirects are pattern-eligible versus which need exact matching. Typically: blog posts, products, and category pages follow patterns. Old campaign pages, legacy subdomains, and weird historical URLs need exact matching.

Test patterns before they go live

I run the full pattern rule set against the URL list in a staging environment and log every redirect response with a tool like Redirect Checker (bulk) or a curl loop in bash. Every chain redirect (old → interim → new) is a problem, Google will follow chains but loses some link equity at each hop. Flatten them before launch.

---

Handling the Long Tail: The Fallback Strategy

Here's the thing about a 20,000-URL site, several thousand of those URLs probably have zero traffic, zero backlinks, and zero reason for anyone to ever visit them again. Redirecting them all to the homepage creates a different problem: it looks manipulative to Google, and it confuses users who followed a specific link.

My fallback hierarchy:

  • If the URL is a subcategory page with no traffic and no links → redirect to the parent category
  • If it's a tag or author archive → redirect to the blog index
  • If it's a truly orphaned page with no logical equivalent → let it 404, or soft-redirect to a well-designed 404 page with navigation

A good custom 404 page with contextual search and popular category links recovers more of these visits than a blanket homepage redirect. I built one for a Seahawk client last year, it had a 28% "recovered" rate (users navigating from the 404 to another page) versus about 9% before.

---

Post-Launch Validation

The redirect map doesn't end at launch. The first 72 hours are critical.

I set up a GSC property verification the day before launch, then monitor the Coverage report daily for the first two weeks. New 404s surfacing post-launch usually mean URLs that slipped through the inventory, rogue parameter variants, hreflang alternates, or old URLs in external email campaigns.

For each new 404 I find, I add a redirect and push it. Small fires. You want to catch them before Googlebot gives up on those URLs entirely.

Also, check your server logs. Not just GSC. Googlebot visits URLs that aren't linked anywhere based on its own historical crawl data. Log analysis (I use GoAccess for quick reads on smaller server setups) surfaces 404s that GSC sometimes takes a week or more to report.

---

FAQ

How long does building a redirect map for 20,000 URLs actually take?

Realistically, budget two to three weeks of part-time effort, maybe 40–60 hours total depending on how messy the old site's URL structure is. The automated matching phase is fast. The manual review of high-priority URLs and the validation phase eat the most time. Never let a client or PM tell you this can be done "over a weekend."

Should I redirect every single URL, or is it okay to let some 404?

It's fine to let genuinely dead, no-traffic, no-backlink URLs 404 naturally. Forcing a redirect to an irrelevant page creates a soft-404 signal that's arguably worse. Triage ruthlessly. Redirect what matters, and invest in a solid custom 404 experience for the rest.

What redirect type should I use, 301 or 302?

301 (permanent) for almost everything in a migration. A 302 tells Google the move is temporary and it'll preserve the old URL in the index. I've seen agencies use 302s "to be safe" and then watch the old domain keep ranking while the new one stagnates for months. Use 301.

Can I use a plugin to manage 20,000 redirects on WordPress?

Yes, but choose carefully. Redirection by John Godley handles large volumes well and stores rules in the database rather than .htaccess, which is better for performance at scale. For anything above ~10,000 exact-match redirects, I'd still recommend migrating pattern-based rules to server config rather than relying entirely on a plugin.

What's the most common mistake teams make on large migrations?

Starting the redirect map too late. I see it constantly, the dev work is 90% done, launch is two weeks away, and someone asks "so what about redirects?" At that point you're scrambling and inevitably missing things. The redirect map should start being built the moment the new site's URL structure is confirmed. Parallel workstream, not an afterthought.

---

Three weeks of delay, one bottle of Scotch, 94% traffic retention. The maths on getting this right is pretty straightforward.

The redirect map isn't the glamorous part of a migration. Nobody puts it in the case study hero banner. But it's the difference between a migration and a recovery, and I know which one I'd rather be billing for.

< BACK TO BLOG