The policy got written the afternoon we caught Claude Code halfway through dropping a column on a client's production database. It wasn't malicious. It wasn't even wrong, exactly — the migration it was generating would have run cleanly. The problem was that nobody had told it the column was being read by a nightly export script that lived in a different repo. The tool didn't know, couldn't have known, and we were ten seconds away from a Monday morning we'd still be apologizing for in July.
We turned the assistant off, reverted the branch, and that evening drafted what is now a two-page internal document called AI Tool Usage. This is what's in it.
The tools we use
Cursor is our default editor for everything except quick scripts. GitHub Copilot still runs in VS Code for the two developers who prefer it. Claude Code we run for larger refactors and any task that needs to read across more than four or five files. We pay for all three. They cost less per seat per month than a single bad merge.
None of these tools have read access to client databases. None of them have credentials. The MCP servers we've wired up are read-only for filesystem and search, and our git hooks block any branch with the string DROP TABLE or ALTER COLUMN from being pushed without a human-signed commit.
What we let them do
PRs under 200 lines, with passing tests, where the code change is local — meaning it touches functions whose call sites the assistant has actually read. A new component. A bug fix in a controller. A typed wrapper around a fetch call. Renaming a variable across a module. CSS work, almost without restriction; the worst an AI-written stylesheet can do is look ugly, and we'll catch that in review.
Refactors with tests, where the assistant is asked to keep behavior identical and the test suite is the contract. We've shipped probably 40 of these in the last six months. They're the highest-value AI work we do. A human writes the test, an assistant rewrites the implementation, the test stays green, and a developer reads the diff before merging. The diff is the part that matters.
Documentation. Boilerplate. The seventh time we have to write the same Stripe webhook handler shape. Anything where the cost of a small mistake is small.
What we don't
Schema changes are off the table. Period. Migrations are written by humans, reviewed by humans, and run by humans. The same goes for anything that touches payments — Stripe integrations, refund logic, anything that moves money. We don't let AI write the test for those, either; the failure mode where the test agrees with the broken code is too easy.
Authentication and authorization code. Session handling. Password reset flows. Anything where a subtle bug means a user sees another user's data. We've audited the assistants on this kind of code more than once. They're competent. They are not competent enough for the blast radius.
And no AI-generated commits to main without a human author on the commit. Cursor's "agent mode" can run for ten minutes and produce something that looks plausible. We've seen it invent a function that doesn't exist, call it confidently, and have the test pass because the test was also invented. Plausibility is the failure mode.
The audit step
Every PR that started in an assistant gets a label: ai-assisted. The reviewer's job on those is slightly different. We're not just reading for "does this work" — we're reading for "does this do something the author didn't ask for." Phantom error handling. A dependency added to package.json that nobody requested. A loop that's subtly O(n²) because the assistant pattern-matched on a similar-looking example.
The audit takes maybe 10% longer than a normal review. The compensating speedup on the write side is far more than 10%, so the math works. But the audit is non-negotiable. The day we skip it is the day we ship the bug we're trying to avoid.
The honest part
We like these tools. We're faster with them. Our junior developers are getting better faster than they would have without them, because Cursor is a tireless pair programmer that doesn't get annoyed at the third "why does this not work" of the morning.
But we have also watched a senior developer accept four suggestions in a row from Copilot without reading them, and ship a regression that took ninety minutes to track down. The tools are good enough to be dangerous. The discipline is what makes them worth having.