Building a 25,000-Page Directory in Next.js

Q: How do you handle providers that go out of business or change their offering significantly?

We have astatusflag in the database —active,deprecated,redirected. Deprecated providers get a slim archive page rather than a full removal, which preserves any backlinks. Redirected providers (e.g. when one host acquires another) get a 301 handled via the Next.jsredirectsconfig innext.config.js. We review the status flags monthly.

Sometime in late 2022 I convinced myself that building a web hosting directory would be straightforward. Aggregate data, generate pages, rank, monetise. Clean. I'd done programmatic SEO plays before — a localized real estate tool for a UK client, a SaaS comparison site that peaked at 40k monthly visits — so I figured HostList would be a six-week project. It took closer to seven months. And it nearly broke a few things: my sleep schedule, one of my junior dev's confidence, and a £180/month Vercel bill I hadn't budgeted for.

This is the post-mortem. The real one, not the LinkedIn version.

---

The Brief I Wrote for Myself

HostList was supposed to be simple. A directory of web hosting providers — shared, VPS, dedicated, managed WordPress — with individual pages for each provider, comparison pages, category pages, and location-based pages (e.g. "best hosting in Germany"). Run the maths: ~400 providers × several page types × 20+ filter combinations. You get to 25,000 pages faster than you'd think.

I choseNext.jsalmost without thinking about it. We use it at Seahawk for most of our bigger React-based builds. The ecosystem is mature,getStaticPropsandgetStaticPathsmake sense for SEO-heavy static generation, and I personally find the file-based routing easier to reason about than Remix or Gatsby at this scale.

The first real decision was the data layer. I ruled out a headless CMS pretty quickly — I didn't want to pay Contentful rates for 25,000 entries, and I didn't trust a CMS to handle bulk programmatic writes cleanly. We landed on a Postgres database onSupabase, with a lightweight Next.js API layer sitting in front of it. That part actually worked fine. It's almost everything else that got complicated.

---

Static Generation at Scale: What Nobody Warns You About

Here's the thing aboutgetStaticPathswith 25,000 routes. It works. Technically. But your build times will make you question your life choices.

Our first full build took 4 hours and 47 minutes. On Vercel. Which, if you're not careful with your plan limits, is the kind of thing that causes a billing notification at 2am. I stared at that Slack alert from my phone and genuinely considered just using WordPress.

The `fallback: 'blocking'` Trap

My initial instinct was to pre-render everything. Every page, every combination. Bad idea — and not for the reason most tutorials warn you about (which is usually just "it takes a while"). The real problem iscache invalidation. When a hosting provider updates their pricing (and they do, constantly), you need to rebuild affected pages. If everything is statically pre-rendered with no ISR, you're triggering full rebuilds for data changes that affect maybe 30 pages out of 25,000.

I switched toIncremental Static Regenerationwith arevalidateof 86400 seconds (24 hours) for most pages, and 3600 seconds for pricing-heavy provider pages. This was the single biggest quality-of-life improvement in the entire project. Build times dropped to under 40 minutes because we were only pre-rendering the top ~2,000 pages by traffic priority and letting the rest generate on-demand withfallback: 'blocking'.

Splitting the Route Tree

One thing I'd do differently, and I tell every dev at Seahawk who touches a big programmatic project now: split your route tree early. Don't have one monolithicgetStaticPathsfunction trying to return 25,000 slugs. We broke ours into:

/providers/[slug]— individual provider pages (~400)
/compare/[slugA]-vs-[slugB]— head-to-head comparison pages (~8,000)
/category/[type]— category landing pages (~40)
/location/[country]/[type]— geo × category combinations (~16,000+)
/best/[use-case]— curated list pages (~600)

Each route group has its own revalidation cadence, its own data-fetching logic, and critically, its own build priority. The location pages are almost entirely on-demand. The provider pages are always pre-rendered. Clean separation.

---

The Data Pipeline Mess (And How We Fixed It)

Back in early 2023 I made the mistake of building the data collection side of HostList too loosely. We had a scraping script (written in Python, using BeautifulSoup and a rotating proxy pool from Webshare), a manual Google Sheet for corrections, and a Supabase table. Three sources of truth. None of them talking to each other properly.

A junior dev — good kid, just out of a bootcamp — spent three weeks maintaining a sync script between the Sheet and Supabase that broke every time a column name changed. I should have killed the Sheet on week one and built a proper internal admin UI. We eventually did, using Next.js API routes and a Retool dashboard bolted on the side, but we burned probably 60 engineering hours getting there.

The fix:one source of truth, always. The database is canonical. Everything writes to the database. The admin UI reads from and writes to the database. The scraper writes to the database. Sounds obvious. It always does, in hindsight.

Keeping Data Fresh at Scale

For a directory this size, data freshness is an SEO concern as much as a UX one. Google notices when pricing tables show £2.99/month for a plan that's been £5.99 for eight months. We set up:

A weekly scrape job running on a Railway cron (cheap, reliable, doesn't require a dedicated server)
A Supabase database webhook that fires when aprice_updated_atcolumn changes, hitting a Next.js revalidation endpoint
Manual override flags in Retool for the ~30 providers whose sites actively block scrapers

That revalidation endpoint —/api/revalidate?secret=TOKEN&path=/providers/siteground— is a stock Next.js feature, but wiring it to a database webhook took a bit of plumbing. Worth every minute.

---

SEO Architecture: What Actually Moved the Needle

I've built enough content sites to know that having 25,000 pages is not the same as having 25,000 pages that rank. The comparison pages were the trap. We generated every possible A-vs-B combination for our ~400 providers, which gave us roughly 79,800 theoretical pairings. We built ~8,000 of them. And most of them were, frankly, thin.

Honest confession: I got greedy. The SEO logic was sound — "SiteGround vs Bluehost" gets real search volume, the long-tail of comparison queries is huge — but we didn't build enough unique content per page to justify the existence of every single one. Google started crawling the comparison section and clearly decided it wasn't worth its time.Google's own guidance on thin contentis blunt about this, and I should have been blunter with myself earlier.

What We Did to Recover

We culled. Cut the comparison pages from ~8,000 down to ~1,200 — only pairs with demonstrable search volume (verified in Ahrefs, minimum 50 monthly searches globally). Then we enriched the remaining pages with:

Dynamic "who it's best for" sections pulled from structured provider data
Real uptime data (we integrated with a third-party uptime API)
User review summaries seeded from Trustpilot data where available

The result was 1,200 pages that were actually useful instead of 8,000 pages that weren't. Organic traffic to the comparison section went up 340% over the following three months. Counterintuitive until it isn't.

Internal Linking at This Scale

With 25,000 pages, internal linking can't be manual. We built a related-pages component that queries Supabase at build time (ingetStaticProps) and returns the five most relevant adjacent pages based on category and location overlap. No editorial intervention needed. It's not perfect — occasionally a VPS hosting page links to something a bit sideways — but it's 90% right, and it meant every page had contextually relevant internal links from day one.

---

Performance: The Part That Humbles You

You'd think static generation would make performance easy. And at a conceptual level, it does — pre-rendered HTML, edge-cached on Vercel's CDN, no server-rendering overhead. But 25,000 pages means 25,000 opportunities to have made a bad decision about your component tree.

Our biggest performance problem was the provider comparison table. It was a heavy client-side React component — lots of state, lots of conditional rendering, used on both provider pages and comparison pages. On mobile, it was causing aLargest Contentful Paintof around 4.8 seconds. Bad. Really bad for a site where the primary traffic is people mid-decision on a purchase.

We rebuilt it as a server-rendered static table with a thin React hydration layer for the interactive filter bits. LCP dropped to 1.9 seconds. That's not magic — it's just doing the boring thing properly.

The Image Problem

Every provider has a logo. 400 logos, plus screenshots, UI previews, feature icons. We made the mistake of hosting these on Vercel's built-in image optimisation for the first two months. The bandwidth costs were quietly horrifying. Moved everything to Cloudflare R2 with a custom domain, dropped our Vercel bill from £180/month to £40/month. If you're building anything image-heavy, look atCloudflare R2early — the free egress is genuinely useful at scale.

---

What the Build Pipeline Actually Looks Like Now

For anyone who wants the concrete picture:

Data collection— Python scraper on a Railway cron job, writes to Supabase Postgres
Admin layer— Retool dashboard for manual edits, corrections, and provider flags
Next.js app— Pages router (we started before App Router was stable enough to trust), deployed on Vercel
ISR + on-demand revalidation— top ~2,000 pages pre-built, rest on-demand, all with 24h revalidation
Images— Cloudflare R2, served via a custom subdomain with Cloudflare CDN in front
Analytics— Plausible for privacy-friendly traffic data, Ahrefs for ranking tracking
Uptime monitoring— BetterUptime watching the five most traffic-heavy page types

It's not glamorous. It's also largely boring to maintain, which is exactly what you want from infrastructure you're going to leave running for three years.

---

Honest Mistakes, Numbered

Started too broad.25,000 pages was always the goal, but I should have launched with 500 high-quality pages and expanded. Instead I launched with everything and had a Google crawl budget problem for the first four months.
Didn't set up revalidation properly from day one.We wasted two months on full rebuilds that ISR would have made unnecessary.
Kept the Google Sheet.Single source of truth should have been non-negotiable from week one.
Underestimated comparison page quality.Volume is not a strategy.
Used Vercel image optimisation for too long.Moved to R2 six weeks later than we should have.
Didn't split the route tree early enough.Mixed fast and slow routes in the samegetStaticPathscall and then wondered why builds were slow.

Every one of these is a decision that seemed reasonable at the time. That's the part tutorials don't capture — bad architectural decisions usually have good-sounding justifications when you make them.

---

FAQ

How long did the initial build take to go live?

Seven months from first commit to a version I was comfortable calling v1. The first rough public version was live at about month four, but it had serious thin-content issues and the comparison section was mostly useless. I'd say four months to "technically live" and another three to "actually good."

Would you use the App Router if you were starting today?

Probably yes, for new projects started in late 2023 onward. The App Router's server components would actually be well-suited to this kind of data-heavy page generation. But migrating an existing 25,000-page Pages Router app is not a project I'm taking on anytime soon. The Pages Router still works, and "works" is underrated.

How do you handle providers that go out of business or change their offering significantly?

We have astatusflag in the database —active,deprecated,redirected. Deprecated providers get a slim archive page rather than a full removal, which preserves any backlinks. Redirected providers (e.g. when one host acquires another) get a 301 handled via the Next.jsredirectsconfig innext.config.js. We review the status flags monthly.

What would you use instead of Next.js if you were doing this again?

I genuinely don't know. Astro is interesting for mostly-static content sites, and I've been playing with it on a smaller project. But Next.js gave us the flexibility to have both static and dynamic sections in the same codebase, which mattered. For a purely static directory with no interactive features, Astro might be faster to build and cheaper to run. Ask me again in a year.

How do you stop scrapers from copying the whole directory?

Honestly? You can't, fully. We rate-limit the API routes, use Cloudflare's bot management on the frontend, and rotate some of the structured data so that scraped copies go stale quickly. But if someone wants to clone a public-facing directory, they're going to find a way. The moat is data freshness and UX quality, not technical obfuscation.

---

Closing Thought

HostList is not a runaway success. It makes money — affiliate commissions, a few direct advertising deals — and it ranks reasonably well for maybe 600 of the terms I originally targeted. That's fine. It was a learning project that also happens to generate revenue, which is the best kind.

If you're thinking about building a large-scale programmatic SEO site on Next.js, my honest advice is this: do it. It's genuinely a good stack for the job. But build less than you think you need, build it better than you think you have time for, and sort your data architecture out before you write a single page template.

The tech is the easy part. It always is.

Pick your view

How I Built a 25,000-Page Directory in Next.js: HostList Post-Mortem