A plumbing supply distributor called us in February. They'd been quoted $14,000 by another firm to fine-tune an LLM on their product catalog so their reps could ask questions like "what fits a 3/4 inch CPVC slip joint." It's a real problem. They have 23,000 SKUs and a sales team that turns over twice a year. The quote wasn't crazy. It was just wrong.
We built them a RAG pipeline in four days for about $60 in setup costs and a few cents per query running.
That gap — three orders of magnitude on price, one calendar week versus six on time — is roughly what we've seen on every SMB project where someone proposed fine-tuning. The fine-tuning sales pitch has a hold on the imagination that the engineering doesn't justify for the work most of our clients are trying to do.
What each one actually does
Fine-tuning changes the model's weights. You take a base model, feed it thousands of curated examples of the input/output pairs you want, and the model internalizes patterns. The output is a new model, slightly different from the one you started with, that does your specific thing better than the generic one did.
RAG — retrieval-augmented generation — doesn't touch the model. You chunk your documents, embed them into a vector database, and at query time you fetch the most relevant chunks and stuff them into the prompt as context. The base model answers using that context. Update your docs, the answers update. No retraining.
One is surgery. The other is handing the model a cheat sheet.
The cost math nobody puts in the deck
For a small business, the real comparison is not "which has better accuracy on a leaderboard." It's "what does this cost me to set up, run, and not break."
Setup: a working RAG implementation using something like Pinecone or pgvector plus an OpenAI or Anthropic API key runs $40-80 in our experience, including a day of chunking strategy work. Fine-tuning a model with enough examples to actually shift behavior — call it 1,500 to 3,000 quality pairs — costs $3,000 to $8,000 in data prep alone, before a single GPU-hour of training. Most SMB clients don't have 3,000 labeled examples lying around. Creating them is the expensive part.
Maintenance is where the difference gets brutal. Their catalog changed last week? Update the documents in the index, done. The fine-tuned alternative is retraining, re-evaluating, redeploying — a process that on most teams becomes "we'll do it next quarter" and then never happens.
Drift is the hidden cost. A fine-tuned model encodes a snapshot of your business at the moment of training. Six months later, when products and policies have moved, the model still confidently quotes the old prices. RAG just reads whatever you tell it to read, today.
What RAG looks like in practice
For the plumbing distributor: their product catalog as CSV, descriptions and spec sheets as PDFs, FAQ entries as Markdown. We chunked everything at roughly 500 tokens with semantic overlap, embedded with text-embedding-3-small, stored in pgvector on the database they already had. Queries hit the top eight chunks and go to Claude Haiku as the generation model. Total cost per query: well under a penny. Response time: under two seconds.
Accuracy on a 200-question test set we built with their head of sales: 91%. The remaining 9% were almost all questions that required combining information across categories in ways no amount of retrieval would have solved cleanly — those got routed to a human and we accepted that.
Where fine-tuning still earns its place
It does still earn it. Three cases where we'd reach for fine-tuning over RAG:
Style and voice that's hard to specify. If you need a model to write in a particular tone — a specific legal voice, a specific brand of casual — RAG can put examples in the prompt, but fine-tuning bakes it in cleaner. We did this once for a wealth-management firm whose compliance language had to land in a particular register every single time.
Latency-critical inference with no retrieval step. If you're embedding the LLM in a workflow where the extra 200ms of vector search is a problem (rare in SMB work, but it happens), fine-tuning removes the lookup.
Closed domains where the knowledge truly doesn't change. A taxonomy that's been stable for thirty years and will be stable for thirty more. We've seen one of these in our portfolio. Most "stable" knowledge bases turn out to change quarterly.
The honest assessment
For maybe 95% of the AI work crossing a small agency's desk — customer support assistants, internal Q&A bots, content generation against a knowledge base, lead qualification with company-specific rules — RAG is the right call and it isn't close. The reason fine-tuning gets proposed so often isn't technical. It's that "we trained a custom AI on your data" sounds more impressive in a sales meeting than "we set up document retrieval."
The plumbing distributor doesn't care which technique we used. Their reps now answer customer questions in 40 seconds instead of digging through a binder. That's the only metric the work was for.