Schema Markup at Scale: JSON-LD for 91,000 Pages

Back in 2021, a travel client handed Seahawk a migration brief that made my stomach drop a little. Ninety-one thousand destination and hotel pages. Each one needed valid, specific,testedschema markup — not the lazy one-size-fits-allWebPagetype that most plugins slap on and call it a day. The client had already tried two "automatic schema" WordPress plugins. Both had produced technically valid JSON-LD that was also, in every meaningful sense, useless — generic names, no nested entities, prices missing, review aggregates pointing at the wrong thing. Google's Rich Results Test was politely confused.

That project taught me more about schema at scale than the previous eight years combined. So here's what I actually know.

---

Why "Just Install a Plugin" Breaks at Scale

Look, I'm not here to dunk on Yoast or Rank Math. For a 40-page brochure site they're genuinely fine. But somewhere around the 500-page mark, plugin-generated schema starts to buckle under its own assumptions.

The core problem is that plugins are built aroundpage templates, notdata models. They read the post title, maybe a custom field or two, and construct a schema blob. When your site has 91,000 pages across six content types — hotels, destinations, tours, reviews, FAQs, and author profiles — a single plugin configuration cannot express that variety without enormous manual override work. And if you're doing manual overrides at that scale, you've already lost.

Here's the thing: schema markup is fundamentally a data transformation problem. You have structured data in a database; you need it expressed as JSON-LD in a<script>tag. That's it. The moment you frame it that way, the right architecture becomes much clearer.

The Three Failure Modes I Keep Seeing

Static schema blobshardcoded in templates. Fine until the product name changes, then you've got 12,000 pages lying to Google.
Plugin configsthat can't handle conditional logic — like only showingaggregateRatingwhen there are actually reviews, or different@typeper post category.
Batch-generated filesuploaded once and never updated. I've audited sites where the schema was eighteen months stale. The prices were wrong. The event dates had passed.

---

How JSON-LD Actually Works at Scale

Before getting into tooling: a quick grounding.JSON-LD— JSON for Linked Data — is Google's preferred schema format precisely because it lives in a<script>block, separate from your HTML. That means you can generate it server-side, inject it cleanly, and update it without touching markup. That separation is everything when you're dealing with tens of thousands of pages.

TheSchema.org vocabularyis vast. Most people use about 1% of it. At scale you need to go deeper —Hotel,TouristDestination,LocalBusiness,Review,AggregateRating, nestedOfferobjects,BreadcrumbList. Each type has required and recommended properties, and Google's interpretation of "recommended" is basically "required if you want the rich result."

The fundamental rule I work to:one primary `@type` per page, with nested types as needed.Don't stack five@typevalues hoping one sticks. Pick the most specific type that fits, then nest supporting types inside it.

---

The Architecture We Actually Used

For the travel client, we ended up with a three-layer system. Not elegant in a whiteboard-diagram way, but it worked.

Layer 1: Template-Level Schema Classes (PHP)

Each content type got its own PHP class responsible for building its schema array.HotelSchemaBuilder,DestinationSchemaBuilder,TourSchemaBuilder— you get the idea. Each class pulled from ACF Pro custom fields, WooCommerce data where applicable, and a few computed values (like calculatingaggregateRatingfrom a CPT-based review system).

The output of each class was a plain PHP array. No JSON yet. Just data.

This matters because it means you can unit test the data logic separately from the serialisation. I wish I'd done that from day one on this project. I didn't. That cost us about two days of debugging in staging whenratingValuewas returning a string instead of a float and Google's validator was silently ignoring the wholeaggregateRatingblock.

Layer 2: A Central Schema Manager

A singleSchemaManagerclass, hooked intowp_head, was responsible for:

Determining which builder class to invoke based on the current template/post type
Merging in sitewide entities (theOrganizationgraph,WebSitewithSearchAction,BreadcrumbList)
Encoding the final array as JSON withJSON_PRETTY_PRINT | JSON_UNESCAPED_SLASHES | JSON_UNESCAPED_UNICODE
Wrapping it in a<script type="application/ld+json">tag and echoing it

The breadcrumb logic was the trickiest part. Destinations had a three-tier hierarchy: Region → Country → City. Getting theBreadcrumbListto reflect that dynamically, without hardcoding anything, meant traversing post ancestors at render time. Slow, if you're not careful. We cached the breadcrumb arrays per post ID in a transient with a 24-hour TTL. That brought the overhead down to negligible.

Layer 3: Validation and Monitoring

Generating schema is step one. Knowing when it breaks is step two, and most teams skip it entirely.

We set up a Google Search Console property and watched the Rich Results report weekly. But that's reactive — GSC tells you about errorsafterGoogle has crawled the page. For proactive checks, we ranSchemaAppon a crawl of the top 2,000 pages monthly. It surfaces property-level errors that the GSC report obscures.

Also:Google's Rich Results Testhas an API. We wrote a small script that would hit the API with a random sample of 50 URLs nightly and log any validation failures. Cheap insurance.

---

Handling Dynamic Data Without Killing Performance

Here's where most scale implementations fall over. Schema that references live data — pricing, availability, review counts — has to stay fresh. But regenerating JSON-LD on every single page load for 91,000 pages isn't free.

My approach, and I've refined this across maybe a dozen large sites since:

Cache aggressively, invalidate smartly.

For hotel pages, the schema blob was stored as post meta — a serialised JSON-LD string — and regenerated only when:

The post itself was updated
A new review was submitted for that post
The price custom field changed (we hooked into the ACFsave_postaction for this)

Everything else served the cached string. Dead fast. And because the invalidation hooks were specific, the schema stayed accurate.

One thing I got wrong initially: I cached the full<script>tag, including the opening and closing elements. Then we needed to change the@contextURL for one content type. Had to bust every cache entry. Now I cache only the JSON string and wrap it at render time. Five minutes of extra code, saved an hour of head-scratching.

What About Real-Time Prices?

For tour pricing that changed multiple times a day, we took a different approach. The base schema was cached, but theOfferblock was generated fresh at request time and merged in before serialisation. Yes, it added a small overhead per request. But it wasonedatabase query per page load, not twelve. Acceptable trade-off.

---

Scaling to Multiple Sites: The Seahawk Angle

Seahawk has built over 12,000 sites, and schema implementation comes up on a significant chunk of them. The travel client was an extreme case. But the same architectural principles apply whether you're doing 91,000 pages or 4,000.

What I've settled on as a reusable pattern is a small internal WordPress plugin — we call itseahawk-schema-core— that provides the manager/builder scaffolding without any content-type-specific logic. Client projects extend it with their own builder classes. No plugin dependencies for the core schema logic. No risk of a third-party plugin update blowing up a site's entire rich results presence.

That last point is more real than people admit. I've seen Rank Math updates silently break custom schema overrides. Not because Rank Math is bad — it isn't — but because when you're customising output at the level a large site requires, you're operating outside what the plugin was designed to handle. Own the code, own the risk profile.

---

Testing at This Scale: A Practical Checklist

You cannot manually test 91,000 URLs. So you test intelligently.

Sample by template type.Pick 10 URLs per content type. Test those. If the builder is correct for one hotel page, it's correct for all 3,000 hotel pages (unless there's bad data — more on that below).
Test edge cases specifically.Pages with no reviews. Pages with incomplete custom fields. Pages with special characters in titles (&,", accented characters). JSON serialisation eats a lot of these, but not all of them.
Run a full structured data crawl with Screaming Frog.TheScreaming Frog SEO Spiderhas a structured data extraction mode that'll pull and validate JSON-LD from every URL it crawls. Export the errors, group by template type, fix at the source.
Monitor GSC's Enhancements tab.Set a threshold alert — if valid items drop by more than 5% week-over-week, something broke. Act within 48 hours.
Spot-check after every deployment.Even if the schema code didn't change. Database migrations, plugin updates, theme changes — any of them can introduce upstream data issues that corrupt schema output.

Bad Data Is the Silent Killer

The travel site had a content team of twelve people across three countries. Some destination pages had malformed HTML in thedescriptionfield — pasted from Word, presumably. When that field fed into the schemadescriptionproperty, the JSON was technically valid but the description included entities and stray<span>tags. Google ignored the property. We added a sanitisation step in every builder class that strips tags and decodes HTML entities before the value hits the schema array. Solved it permanently.

---

The Entity Graph: Don't Ignore It

One thing that separates mediocre schema work from genuinely good technical SEO is the entity graph — specifically, the sitewideOrganizationandWebSiteentities that should appear on every page and link everything together.

Most sites have these, poorly. Name, URL, maybe a logo. The fullOrganizationtype supportssameAslinks to your Wikidata entry, social profiles, and other authoritative sources. That cross-linking is how Google builds confidence that yourOrganizationentity in its Knowledge Graph is the same entity appearing in your page schema.

For the travel client, we built out theOrganizationblock with:

sameAspointing to their Crunchbase profile, LinkedIn page, and a Wikipedia stub they had
contactPointwith structured phone and department info
foundingDateandnumberOfEmployees(rough range — this is public info anyway)

Did it move rankings overnight? No. Schema almost never does in isolation. But it's infrastructure. You build it once, properly, and it compounds over time.

---

FAQ

How long does it take to implement schema at this scale?

For the 91,000-page travel site, the full implementation — architecture, builder classes, caching layer, testing, GSC monitoring setup — took about six weeks with two developers. That sounds like a lot. But half of that time was auditing the existing data quality, not writing schema code. If your data is clean, you can move faster.

Should I use a plugin or build custom for large sites?

For anything under a few hundred pages, a plugin is genuinely fine. Rank Math's schema module is solid and the custom schema block gives you reasonable flexibility. Above a few thousand pages with multiple distinct content types, I'd go custom every time. The control is worth the build cost.

What's the single most common schema mistake at scale?

MissingaggregateRatingwhen reviews exist — or including it when they don't. Google is strict about this. If your schema claims anaggregateRatingof 4.7 from 843 reviews and a user lands on the page and sees no reviews, that's a manual action waiting to happen. Conditional logic in your builder classes is non-negotiable.

Does schema directly improve rankings?

Directly? Probably not much for most query types. What it does is unlock rich results — star ratings, FAQ dropdowns, review snippets, breadcrumbs in the SERP — and those features improve click-through rates measurably. The travel client saw a 22% CTR increase on hotel pages within four months of full implementation. That feeds into engagement signals, which do affect rankings. So: indirectly, yes. Substantially.

What tools do you actually use day-to-day for schema work?

Screaming Frog for crawl-level auditing. Google's Rich Results Test for spot-checks. Schema Markup Validator atvalidator.schema.orgfor property-level validation. And honestly, the Schema.org documentation itself — I have theHotel type pageand a handful of others bookmarked and I refer to them constantly. No fancy subscription tool needed.

---

Schema at scale is one of those problems that looks like a plugin problem until you're inside it and realise it's actually a software architecture problem dressed in SEO clothing. Get the data model right. Cache intelligently. Validate relentlessly. The markup itself is almost the easy part.