We replaced our design QA with AI vision models for a week. Here's what broke.

The pitch was internal. One of our designers had spent a Friday afternoon going through a forty-screen Figma file checking buttons against a brand spec, and by the end of it she was cross-eyed and had still missed a heading that used the wrong weight. On Monday morning we ran an experiment: for one week, every design QA pass would be done by Claude with vision first, and a human would only verify what the model flagged.

Five working days. About sixty screens across three client projects. Here's what we learned.

What the model nailed

Accessibility was the surprise. We had been doing color-contrast checks with a browser plugin and calling it done, which meant we caught the failures on text but routinely missed contrast issues on icon-only buttons, on disabled states, and on overlays where the background was a photo. Claude flagged all of them. On one e-commerce mockup it caught a "Filter" icon button sitting at 2.9:1 against a pale gray header that had been signed off by three humans, including ours.

It was also very good at structural consistency. Heading hierarchy that skipped levels. Body text that switched line-height between two screens that were supposed to be identical. Spacing tokens that were a couple pixels off from the spec sheet. The kind of thing a tired human reads past because the page looks fine.

Touch targets. We had a mobile prototype where two CTAs at the bottom of a card had become 36 pixels tall instead of the 44 we standardize on. The model caught that on the third screen and then caught the same mistake propagated to nine other screens, which would have taken our designer an hour of cross-referencing.

Where it whiffed

Brand color drift. Our spec sheet has a primary orange at #ff7a1a. One of the mockups used #ff8a3a in three places. The human eye, primed by the spec, catches that instantly. Claude said the color "appears consistent with the brand palette" on every screen. We tested this twice with explicit prompting — "verify the orange hex is exactly #ff7a1a" — and the model still said yes. Vision models do not pull exact hex values out of rasterized images reliably. They estimate.

Font weight was a similar miss. A 500-weight heading that should have been 600 looked correct to the model in three different screens. To a designer who has stared at the brand font for a year, the difference is obvious. To a transformer interpreting pixel patterns, 500 and 600 are close enough.

Copy tone. The model could read the words. It could not tell us that "Submit" should have been "Get my quote" because that's what the client's brand voice doc said three months ago. That doc didn't make it into the context window, and even when we tried adding it, the model's tone judgments were vague.

Where we landed

We didn't replace design QA. We split it. The model now runs first on every Figma export and produces a structured report: accessibility issues, spacing/sizing inconsistencies, layout problems, touch target failures, alignment drift. A human then does a focused pass on brand fidelity — exact colors, exact weights, copy, voice, the small judgment calls about whether a particular shadow feels right on this page.

The math is reasonable. A senior designer's QA pass used to be 90 to 120 minutes for a mid-sized project. With the model doing the structural pass first, the human pass is now 30 to 45. The bot pass takes about four minutes and costs us roughly $0.40 in API calls. We get more issues caught, not fewer, and the designer's attention goes to the parts she's actually best at.

What we won't do

We won't ship without the human pass. We won't trust a vision model on color. We won't ask it to make taste calls — "does this hero feel premium" is not a question it can answer in a way that maps to our clients.

And we won't pretend the experiment turned the model into a designer. It turned it into a very fast, very tireless QA technician with weirdly specific blind spots. That's a useful thing to have. It's just not the same thing as a designer, and any agency telling its clients otherwise is going to ship some bad work very quickly.

What the model nailed

Where it whiffed

Where we landed

What we won't do

Curious how AI fits into your build process? Let's compare notes.