We ran a side-by-side last month. A client's lead-scoring task — classify an inbound contact form into one of five buckets and write a one-sentence summary — was being handled by GPT-4o. We pointed the same workflow at a Llama 4 70B running on a rented H100 and let both grind through three weeks of real submissions. The Llama output was indistinguishable on classification and slightly worse on summary quality. The cost difference was real money.

This is the question we're getting most often this spring. Should we self-host?

The answer used to be "almost never." It's now "sometimes, and you need to do the math."

What changed

Two things happened roughly simultaneously. Open-weight models caught up enough that for narrow, well-defined tasks they're competitive with frontier APIs. Llama 4, Qwen 2.5, and Mistral Large are not GPT-5 or Claude Opus 4.7 at the high end of reasoning, but they are absolutely fine at the bulk of what most businesses actually ask of AI — classification, summarization, extraction, basic generation against a prompt template.

And inference hardware got cheaper and more available. A year ago you booked an H100 weeks in advance. Now you can spin one up on Lambda or RunPod in twenty minutes. A self-hosted 70B-class model at decent quantization runs comfortably on a single H100 for under $2 an hour.

The breakeven math, roughly

Take a workload of short-context calls, say 800 input tokens and 200 output. On a frontier API at current rates, that's somewhere around half a cent to a cent per call depending on which model you use. Call it $7 per 1,000 calls on a midrange API tier.

An H100 running a 70B model at 4-bit quantization will handle on the order of 50–80 calls per minute on that kind of payload, depending on how aggressively you batch. At $2/hour, you're amortizing roughly $0.0007 per call on hardware alone. Add overhead, model serving stack, monitoring, and you're maybe $0.0015–0.0025 per call all-in.

Where do those lines cross? Somewhere around 50,000 calls per month at a flat workload. Below that, you're paying for an H100 to sit idle most of the time and the API is cheaper. Above that, self-hosting starts paying for itself and the curve keeps bending in your favor. At 500,000 calls a month, the difference is significant — call it $3,000 a month in our spreadsheet for typical workloads, and that number moves both ways depending on token sizes.

These are rough numbers. The point is the breakeven exists, and it's at a scale a lot of SMBs are actually at.

What we tell clients to self-host

Steady workloads. Lead scoring, support ticket categorization, content moderation, document tagging, embedding generation. Things that run every day at predictable rates. The H100 is busy. The cost per call falls. You also own the latency and the model, which means no surprise rate-limit emails on a Tuesday afternoon.

Privacy-sensitive workloads. We have a healthcare-adjacent client whose data we couldn't put through a hosted API even if we wanted to. Self-hosted Qwen running on their own infrastructure was the only option that passed legal review. The fact that the model is "merely competent" is fine because the alternative was nothing.

High-volume embedding work. If you're embedding a million documents for RAG, doing it on a self-hosted model is dramatically cheaper than hitting an embeddings API. We've migrated three clients this way in the last six months.

What we tell them not to

Bursty workloads. If your AI usage is "occasionally a sales rep asks Claude to write a proposal," there is zero argument for spinning up infrastructure. Use the API. Pay the cent.

Workloads that need the smartest model available. Frontier reasoning is still meaningfully better at frontier APIs. If you're doing complex agentic work, multi-step planning, or anything where the model has to reason carefully across many turns, the gap between Llama 4 and Claude Opus is still real and you'll feel it in production. Pay for the smart one.

Anything where keeping the infrastructure running is going to consume more developer time than the savings justify. A team that doesn't already know how to run a GPU-backed inference stack is going to spend three weeks getting the first model deployed and then forget how it works six months later when something breaks. The total cost of ownership on self-hosted AI is not just the H100 rental. It's also the person whose Tuesday afternoon disappears because vLLM updated and the model server crashed.

The middle path most people actually want

Hybrid. Use a self-hosted open model for the high-volume routine work — classifications, summaries, embeddings, the boring stuff that runs constantly. Keep a frontier API key for the times you need actual reasoning or the freshest model. Route between them with a thin layer that picks based on the task.

Most of our clients who have moved AI workloads in 2026 have ended up in this configuration. It is not the cleanest architecture. It is the one that costs the least and breaks the least.

The clean architectures get written about. The messy ones make money.