programmatic-seo.html

Programmatic SEO that survives the Helpful Content Update — built by the operator behind HostList.io.

About 28,000 pages live since 2024 on Next.js plus Supabase. The same playbook applied to your structured data — quality gates, schema strategy, internal linking at scale, sitemap streaming past 50,000 URLs.

HostList.io: ~28,000 pages live Next.js + Supabase + Vercel Helpful Content Update survivor Quality gates on every page

WHAT I LEARNED BUILDING HOSTLIST WITH 28,000 PROGRAMMATIC PAGES

I started HostList in early 2024 as a side project. The idea was straightforward enough: catalogue every web hosting company on the internet, give each one a real page with a real review, and let people compare hosts the way they actually want to compare them. Two and a half years later there are about twenty-eight thousand pages on the site, every one of them generated programmatically from a structured data source, and I have personally watched the site go through every Google update Helpful Content has thrown at it.

The thing nobody tells you when you start a programmatic site is that the work is mostly editorial, not technical. The Next.js side comes together in a couple of weeks. The Supabase schema, the ingestion pipeline, the streaming sitemap, the schema.org emitter — all of that is solved engineering. What takes the rest of the year is figuring out which of your twenty-eight thousand rows actually deserves to be in the index, and what you have to add to the template before any of those rows reads as a real page rather than a database printout with SEO ambitions.

I have come to think of programmatic SEO as the discipline of subtraction. The default move is to ship every row. The right move is to ship only the rows that earned a spot, and then to wrap them in enough editorial context that the page exists for a reason beyond filling a sitemap. Get those two things right and Google leaves you alone through core updates. Get either one wrong and you lose most of your indexed pages within two quarters.

What follows is the playbook I run on HostList every day, applied to client work in the same shape. It is not a marketing pitch. It is the actual checklist.

WHEN PROGRAMMATIC SEO IS THE RIGHT SHAPE

Most ideas pitched to me as programmatic should not be programmatic. The way I sort it on the call is whether the dataset is genuinely interesting and whether the searches are genuinely fragmented across the long tail. Both have to be true. If the dataset is just SEO bait and the searches are not really happening at the long tail you imagine, programmatic is the wrong shape and pushing ahead anyway will cost you the indexed pages within six months.

A handful of patterns work in 2026, and they are pretty narrow. Comparison sites work because the searcher already knows the names involved and just wants a tiebreaker; Notion versus Linear, Stripe versus Adyen, Cloudways versus Kinsta. Location pages work because local intent is fundamentally fragmented and almost nobody writes it by hand at scale. Industry directories work when the entity-times-filter combination produces queries with real volume; HostList itself is built around exactly that shape, which is why I know the failure modes from running them. Glossary pages work when the term is technical enough that the existing answers on the web are bad. Calculator pages work when the calculation itself plus a methodology page underneath is the actual value to the searcher.

Everything else I get pitched is the bad version. The "we want a million pages of generic content with our brand on them" version, usually packaged as a growth experiment that is meant to ten-x organic traffic in a quarter. Google has been particularly aggressive on this since the Helpful Content Update in late 2022, and the de-indexing waves have only sped up since. I have watched five different teams try the lazy programmatic play in the last two years; all five lost the bulk of their indexed pages inside two quarters. I now turn the work down rather than ship it, which is uncomfortable on the sales call but kinder to everyone in the long run.

HOW THE QUALITY GATES ACTUALLY WORK

Three gates run at build time before any page lands in the sitemap. They are automated rather than a manual review, because at thirty thousand URLs a manual review is not actually a review and pretending otherwise just delays the de-indexing.

Gate one is unique data. Take a page about Cloudways managed WordPress hosting on HostList. It needs at least three things specific to Cloudways. A price band. A feature list. A region. A parent company. A use case. Anything that is not also true of Kinsta or WP Engine. If the page only has a name, a logo, and a generic description, it fails the gate. Held back from the sitemap. Noindexed in the source. The data layer fills in eventually as the team enriches the row, then the page earns its way back into the index. On HostList right now, roughly fifteen percent of the database stays out of the sitemap for exactly this reason.

Gate two is editorial value-add. The template has to do something the data alone cannot. Comparison. Scoring. Recommendation. Aggregation. Pros and cons. A template that just renders the database row in nice typography is not enough, even if the typography is good. This is the gate teams fail most often in practice. They build clever ingestion, miss the editorial wrapper, ship two thousand pages that all look identical underneath the keyword, and then wonder why Google de-indexes them six months later. The wrapper is what signals to Google that the page exists for a reason beyond filling a sitemap.

Gate three is real query intent. Every URL has to map to a query that someone is plausibly searching, with enough volume to be worth indexing. Pages targeting queries under fifty monthly searches are usually noindexed even if they pass the first two gates, because they pollute the sitemap and dilute crawl budget for the strong pages on the same domain. The threshold flexes by industry; we calibrate it per project after looking at Search Console data on adjacent sites in the same vertical.

WHAT I CUT FROM HOSTLIST AND WHAT I KEPT

The first thing I cut from index was the thin tail. About fifteen percent of the database stays out of the sitemap because the unique-data threshold was not met. A row with just a name, a logo, and a one-line generic description is not a page Google should know about; the cost of crawling it is higher than the value of having it indexed. Category pages with under five strong listings also stay out, because a thin category reads as low-effort even when the schema is technically correct. Filter combinations with under three results get noindex automatically through a build-time check.

What I kept and grew was comparison. Head-to-head pages between named hosts ended up being the highest-converting page type on the site, generating about thirty percent of all conversions despite being under five percent of the URL count. I added comparison as a separate template and scaled it deliberately. Category pages with strong unique data also outperformed the generic versions by a wide margin. Not just "best WordPress hosting" but "best WordPress hosting for WooCommerce stores under ten thousand products". Specific. Querying. Useful. The narrower the qualifier, the better the page tended to perform, which runs against most of the SEO advice you read online.

The pages I kept hand-written were the centre of gravity. About two hundred of the twenty-eight thousand are entirely human-written editorial. The methodology page. The scoring rubric. The "how to choose a hosting provider" guide. A handful of strong category landings. They do not scale programmatically and they were never meant to, but they carry disproportionate weight in the topical authority graph and every leaf page links back to them. The twenty-seven thousand eight hundred programmatic pages orbit around the two hundred. That is the structure that survives a core update.

WHAT GOES INTO A PROGRAMMATIC BUILD WE SHIP

The data layer sits on Postgres, either through Supabase or self-hosted depending on what the team is already running. Every facet column is properly indexed, because at scale full-table scans on a filter query become the bottleneck before the page itself is slow. Each content type gets a dedicated entities table with quality-gate columns alongside the actual content — uniqueness score, completeness percentage, last-verified timestamp. A sitemap-eligibility view filters out rows below the threshold automatically, so the sitemap and the underlying data stay in sync without manual curation getting involved.

Templates come in four shapes. A detail template per entity type, with explicit slots for unique data plus the editorial wrapping. A comparison template for head-to-head between named entities, FAQPage schema attached, never AggregateRating unless first-party reviews actually exist. A category and filter template using CollectionPage with ItemList of qualifying entities, paginated with proper canonical handling so that filter combinations do not create infinite duplicate URLs. And editorial templates using Article schema, hand-written, lower volume, higher topical weight, treated as the spine of the link graph rather than the leaves.

SEO scaffolding is the part most teams underestimate at scale. The sitemap streams in chunks per template, because a single sitemap.xml maxes out at fifty thousand URLs and most programmatic projects pass that within the first year. Internal linking is generated from the data itself — every leaf links to its category, its location, its named competitors, and similar entities by feature overlap. A build-time SEO linter samples a slice of pages on every deploy and fails the build on any H1 count anomaly, meta description out of range, JSON-LD validity error, or hreflang cluster integrity issue. After launch, AI Overview citation tracking via Otterly or Profound runs weekly to spot when a generative search engine starts citing or stops citing a page on the domain.

HOW MUCH PROGRAMMATIC SEO COSTS

Honest ranges, taken from real recent engagements rather than aspirational pricing on a sales deck. A small programmatic build under one thousand entities runs eighteen to thirty thousand US dollars over six to nine weeks. Mid-sized work between one and ten thousand entities, with a structured data import, runs thirty to sixty thousand over eight to fourteen weeks. Larger projects between ten and a hundred thousand entities, with a custom ingestion pipeline against an external API or scraping source, run fifty to ninety thousand over twelve to eighteen weeks. Care plans for ongoing operation, content refresh, and quality-gate maintenance run five hundred to three thousand a month after launch.

Each range includes the data scaffolding, the templates, the SEO linter, and a basic admin dashboard for editorial overrides. They do not include data acquisition itself. Manual editorial, scraping infrastructure, third-party API costs, and original brand and design work are all separate line items. Paid traffic acquisition is also out of scope; programmatic SEO is an organic play and we do not bundle paid media into the engagement. Most projects sit comfortably in the lower half of each band; the upper half exists for genuinely complex builds where the data ingestion or the editorial layer is unusually heavy.

FREQUENTLY ASKED QUESTIONS

What is programmatic SEO?

Programmatic SEO is the practice of generating thousands of pages from a structured data source plus a template, designed to capture long-tail search demand at a scale single-author content cannot match. Each page targets a specific intent — "best CRM for solo founders", "Italian restaurants in Manchester", "shared hosting for WooCommerce" — and earns its place in the index through unique data plus an editorial layer that adds context.

How is programmatic SEO different from regular content SEO?

Regular SEO content is human-written, long-form, and aimed at a small set of high-value keywords. Programmatic SEO is template-plus-data, aimed at the long tail, and lives or dies on the quality of the underlying data plus how the template adds value on top of it. Both can coexist on the same site — most successful programmatic platforms have a strong human-written editorial layer at the hub level and programmatic pages at the leaf level.

When should I use programmatic SEO?

Three signals. You have a structured data source — a database, an API, a clean spreadsheet — that contains thousands of unique entities. The search demand for those entities is real but fragmented across many long-tail queries. And you have an editorial angle — a scoring rubric, a comparison framework, a recommendation system — that the data alone cannot provide. If any of those three is missing, programmatic is the wrong shape.

What is the HostList playbook?

HostList.io is the programmatic SEO directory I built solo to catalogue the entire web hosting industry — about 28,000 hosting company pages live since 2024 on Next.js plus Supabase plus Vercel. The playbook from running it: every page needs three unique data points beyond the entity name, every category page needs at least five strong listings to deserve indexing, internal linking matters more than backlinks at this scale, and pages that fail the quality gate are held back from the sitemap until they earn their way in. We bring this playbook to client programmatic builds.

How do I avoid thin-content penalties?

Three rules. Every page has at least three unique data points specific to that URL — never just a name and a templated description. The template adds context — comparison, scoring, recommendation, aggregation — that the underlying data does not provide on its own. Pages below the quality threshold are blocked from the sitemap and noindexed until the data layer catches up. We hold roughly 15% of HostList's database back from index for exactly this reason; the indexed pages are the ones with enough unique signal to deserve a spot.

How do I handle a sitemap with 50,000+ URLs?

Stream it. A single sitemap.xml caps at 50,000 URLs and 50 MB. Past that you generate a sitemap index file pointing at multiple chunked sitemaps, each chunked by content type or by ID range. We generate the index at build time and stream each chunk on demand from Postgres so memory usage stays flat regardless of URL count. HostList.io has been past 25,000 URLs since launch; the same pipeline scales to hundreds of thousands without changes.

What schema goes on programmatic pages?

Per page type, never invented. Listing pages — Organization, Product, Service, Place, or LocalBusiness depending on what the entity actually is. Comparison pages — FAQPage plus a careful Article emit, never AggregateRating unless you have first-party reviews. Category and tag pages — CollectionPage plus ItemList of the listings on that page. Home and methodology pages — Organization for the directory site itself, Article for editorial. Every page also gets BreadcrumbList. Build-time JSON-LD validation is non-negotiable because schema fails silently in production.

How do I build internal linking at scale?

Programmatic. The link graph is generated from the data — each listing links to its category, its location, its named competitors, similar listings by feature overlap, and a small set of curated editorial pages. We model the link graph as a separate query on the listings table and inject the relevant link block into every leaf page at build time. The result is that every leaf has 8-15 contextual outbound internal links plus inbound links from at least three category and comparison pages. Crawl budget optimises around the link graph, not around individual page depth.

What about search and filter on a programmatic site?

Postgres full-text search up to about 10,000 listings; Algolia or Meilisearch past that. Server-render every filter combination as a URL with a canonical, but noindex thin or duplicate filter combinations to prevent index bloat. The Helpful Content Update has been particularly aggressive on filter-driven thin pages — we run a check on every build that any filter combination with under three results gets noindex automatically.

Will AI search and Google Helpful Content kill programmatic SEO?

They will kill bad programmatic SEO. The thin-content version of programmatic — name plus templated description, no unique data, no editorial layer — was already a bad idea pre-AI-Overviews and is dying faster now. The good version — unique data, original editorial, real value per page — gets cited by AI Overviews and Perplexity precisely because the per-page passages are extractable answers to specific long-tail queries. We build for the second version.

How long does a programmatic SEO build take and what does it cost?

Implementation runs 8-16 weeks typically. Pricing 25,000-90,000 USD depending on volume, search and filter complexity, and the data acquisition story. If you bring 5,000 well-structured rows ready to import the build is faster. If you bring an Excel file or an API that needs rate-limit-respecting ingestion, the data work is half the engagement. Care plans for ongoing operation run 500-3,000 USD per month.

WHAT THE FIRST 48 HOURS LOOK LIKE

Book a 30-minute call. Bring your data source — even a rough description — your industry, and roughly how many entities you think you have. By the end of the call you will have a read on whether programmatic SEO is the right shape for your idea, what the data quality gates would look like for your specific dataset, and a price range. If your idea works better as something else I will tell you that too.