A prospect called us in February with a question we get a lot now: "Can you build me 8,000 pages?" He ran a HVAC parts distributor and had read a case study about a competitor ranking for "[part number] compatibility with [model number]" across tens of thousands of keyword permutations. He wanted the same thing. He also wanted to know whether Google would penalize him into oblivion the day he launched.
The honest answer was: it depends entirely on what's underneath the pages.
The line Google actually draws
The March 2024 helpful content update, and the half-dozen tightening passes that have followed it, do not penalize "scaled content." They penalize scaled content that has no reason to exist. The distinction matters. Zillow has millions of pages. Yelp has millions of pages. Both rank fine. Both have real, page-specific data on every one of those URLs that nobody else has assembled in that combination.
The pages that get nuked are the ones where AI generated 8,000 variations of "Looking for the best [service] in [city]? At [company], we offer..." with city and service swapped in. There is no data underneath. There is no reason that page exists except as bait. Google's classifier has gotten very good at finding that pattern, and it's been getting better in roughly six-month increments.
What "real data underneath" means
We tell clients to ask one question: if I removed the prose from this page, would the remaining structured data still be useful? If the answer is yes, programmatic scale is defensible. If the answer is no, you're building a spam farm with extra steps.
Useful underneath looks like proprietary numbers — pricing comparisons you've collected, response times you've measured, inventory counts that change, specifications you've normalized across manufacturers. It looks like calculators that actually compute. It looks like comparisons of two real entities with real attributes pulled from a database, not invented by a model.
The HVAC distributor had real data. He had a parts catalog of about 42,000 SKUs, compatibility tables that took him a decade to build, and live stock numbers. Pages built on top of that catalog — "Is the Carrier 38AUZA14A0A6 compatible with the Bryant 213BNA042?" with a yes/no answer, a parts list, and stock status — are pages that should exist. Nobody else has assembled them. The AI-written prose around the data is decoration. The data is the reason Google sends traffic.
The 8,000-page problem
We ended up agreeing to build the prospect 8,000 pages, but not in the way he expected. The first tranche was 800. We picked the 800 part/model combinations with the highest measurable search demand, generated pages from the catalog with thin AI-written intros, and shipped. Indexing was slow but clean. After ninety days, 670 were indexed, about 240 were earning impressions, and 90 were earning clicks. That's a typical hit rate at this stage.
The next tranche of 1,500 went up only after we'd seen which page templates Google liked. Two of the four formats we tested got crawled deeply; the other two got partial indexing. We dropped the underperforming templates and used the patterns from the winners. By month six we had about 3,400 pages indexed and 6,000 published, and the traffic curve was finally inflecting.
The fight you can't win is publishing 8,000 pages on day one. Google won't crawl them all, and even if it does, it will use the early sample to decide whether the rest is worth its time. Lose that judgment and the whole site gets a soft demotion that's hard to recover from. Win it and you get a slow but compounding flow that keeps growing as the catalog grows.
Where AI fits in the workflow
We use AI to write the prose around the data. We don't use it to invent the data. The intro paragraph on a comparison page that says "The Carrier unit consumes 4% less energy at peak load than the Bryant equivalent, based on AHRI-certified ratings" is fine if those numbers came from a database we control. If a model hallucinated them, you'll find out when a competitor's lawyer or a thorough buyer does.
Our internal rule: any factual claim on a programmatic page must trace to a row in a database. If the template can't enforce that, the template doesn't ship. AI generates the connective tissue and the variation in phrasing so the pages don't read like a mail merge. The skeleton is always data.
What we tell people not to do
Don't build "Best [service] in [city]" pages with no city-specific information. Don't build "[product] vs [product]" pages where the comparison reduces to "both are good options." Don't build location pages that have a stock photo and a map embed and nothing else. These were viable in 2019. They are now liabilities that get a site downranked across all its pages, not just the bad ones.
The clients who win at programmatic SEO in 2026 are the ones who treat the pages as a structured-data publishing problem and not a content-volume problem. The prose is the wrapping. The data is the gift. If the gift box is empty, no amount of AI-generated ribbon makes Google interested.