guides/directory-data-model.html

THE DATA MODEL

How I design the schema for a directory site at six-figure scale. Entities, attributes, slugs, indexes, and the design choices that compound.

← Blog

The schema decisions that compound

A directory site at 28K rows (HostList.io) or 137K rows (Not Another Sunday) succeeds or fails on schema decisions made in the first week. The decisions that matter most: how entities are normalised, how attributes are stored, how slugs are generated, what is indexed.

Every one of these is reversible at small scale and prohibitively expensive to fix at six-figure scale. Get them right early.

Entities, attributes, junctions

One entities table with a master row per company, place, or product. Stable UUID primary key, one column per intrinsic attribute (name, founded_year, primary_market). Avoid jamming attributes into JSON columns at this layer; they should be queryable as first-class columns.

One attributes table per orthogonal axis. categories (each row is a category), regions (each row is a region), price_tiers (each row is a tier). These are the dimensions across which entities get grouped.

Many-to-many relationships via junction tables: entity_categories, entity_regions, entity_features. Never store comma-separated category lists in the entities table; the junction tables are what make filtering, listing, and counting fast.

Slugs are forever

Once a directory page is indexed by Google, the slug should not change. Use deterministic slug generation from the entity name with collision handling via numeric suffix: nginx-hosting, nginx-hosting-2 if a collision occurs.

Maintain a separate slug_history table mapping any historical slug to the current canonical slug. When an entity name needs to change (correction, rebrand), the new canonical slug gets generated, the old slug enters the history table, and a permanent 301 redirect is set up automatically.

The cost of getting slug stability wrong on a 28K-row site is catastrophic: every redirect chain you create is a small ranking erosion, and there are no good after-the-fact fixes.

Indexability at the row level

Add an indexable boolean column to the entities table, default true, with logic that flips it to false when the entity does not meet quality thresholds (insufficient unique content, missing required attributes, no incoming internal links).

The sitemap query filters on indexable = true. The page renderer adds noindex meta when indexable = false. Both are derived from the same column, so they stay in sync atomically.

On HostList.io, roughly 18K of 28K entities are indexable at any given time. The other 10K are crawlable but not indexable, which keeps the indexed surface clean and the rest of the indexed pages stronger.

Indexes that matter

Index on slug for the page-render lookup. Index on (indexable, last_updated_at) for the sitemap query. Index on (category_id, indexable) on the junction table for category-listing pages. Index on (region_id, indexable) for regional pages.

At six-figure-row scale, an unindexed page-render query takes seconds. With proper indexes, it takes milliseconds. The build-time impact compounds: 28K page renders at 100ms each is 47 minutes; at 10ms each it is 5 minutes.

Run EXPLAIN ANALYZE on every page-render query at scale. The default query planner gets it wrong on directory-shaped joins more often than you would expect.

WHEN YOU ARE READY TO TALK

If you are mid-build on something this guide touches and want a second pair of eyes, the fastest path is a 30-minute call.

BOOK YOUR 30-MIN CALL