I Supervised AI Coding Agents for Hundreds of Hours. Here's What Goes Wrong.

So I use Claude Code and Cursor every day. I build agentic systems and pipeline tooling for a living, and I've been using AI-assisted development to ship things absurdly fast. I genuinely love these tools. This is not a takedown piece.

But they are so frustrating in ways that feel predictable but hard to solve. The same agent that writes brilliant, well-tested architecture in one breath will commit credentials to source control in the next. The wildly differing levels of judgment are what get me. Sometimes shockingly careful, sometimes shockingly careless, and I can never quite predict which one I'm going to get.

The thing that really gets me is how little problems cascade into big ones. A shortcut here, a dismissed test failure there, and suddenly you're three commits deep into a broken system that the agent keeps insisting is fine. It's not that any single mistake is catastrophic... except when it is, of course. It's that the agent is optimising for "done" rather than "correct," and those are not the same thing.

I started calling this completion-pressure misalignment (I know, very fancy), and eventually I decided to actually measure it.

The Receipts

Claude Code and Cursor store complete session logs as JSONL files. Every user message, every assistant response, every tool call, every file operation, every shell command. I have four machines running these tools daily across about 45 different projects spanning software development, bioinformatics, pipeline tooling, multi-agent orchestration, and data science.

That's 4,674 sessions and 1.65 GB of interaction traces, collected over months of daily use. I drew a stratified sample of 225 sessions across all four machines and 17 projects for the analysis.

I built a pipeline that extracted and analysed the moments where I corrected the agent's behaviour. Not just the obvious ones where I said "no" or "stop," but the subtle redirections, the terse "read the docs" nudges, the frustrated "my god" reactions. I used a multi-model semantic extraction approach (Haiku + Sonnet, taking the union of both for coverage) and then classified each correction event against a taxonomy I developed from the patterns I kept seeing.

What I ended up with was 198 human-verified misalignment events from 225 sessions across 4 machines and 17 projects, drawn from 628 total events that I reviewed in full. Every case where the AI labeller and I disagreed got a second manual pass.

The Taxonomy

Eight categories emerged. I didn't design these top-down; I found them by reading through hundreds of my own corrections and grouping what I saw. It was wild to experience it, because in reviewing these traces I got to viscerally experience what I was going through when I wrote the message. Most of these traces are from the last few months, so I still remember most of the context for each event.

Category	Count	What It Looks Like
Premature completion	72	Agent claims "done" or "all tests pass" when they definitely don't
Guidance neglect	68	The answer was in the project docs, even the CLAUDE.md, and the agent didn't look. (I hate this one the most.)
Explain-away-first	26	After a test failure, agent narrates the error instead of investigating
Architectural ignorance	17	Agent makes design choices that violate established patterns
Blame deflection	9	Agent attributes its own failures to "pre-existing" or "flaky" tests, even if it made the change itself in that conversation
Deflect-then-capitulate	3	Agent says "not my problem," gets pushback, then is like "you're right, it's my problem" and fixes it
Poor judgment	2	Agent makes choices no experienced/sensible engineer would make, but no rule exists to cite
Security negligence	1	Credentials in code, skipping auth, shipping secrets, really the worst of the worst

As a documentation-first agentic coder, with strong opinions about writing good code, guidance neglect is the one that annoys me the most. This is when the correct answer exists right there in the project's CLAUDE.md, architecture docs, or README, and the agent just... doesn't check. It's not that it read the docs and disagreed. It didn't read them at all. I cannot overstate how frustrating this is. I spent real effort writing those docs specifically to prevent these mistakes.

Here's what some of these look like in practice:

"No, do everything with migrations. My god."

The agent used metadata.create_all() instead of Alembic migrations. The project docs explicitly say "NEVER use create_all() in production." It didn't check.

"lol you committed credentials. you should change those passwords and put them in my local .env."

Awesome! Passwords in source control!

"ALL tests must be green. No exceptions. There is no such thing as 'unrelated to changes'."

The agent ran the test suite, got a failure, and dismissed it as "pre-existing." It wasn't; our CI ensures everything is green before merge, so any failure on the agent's branch was caused by the agent's branch. This is documented in the project's CLAUDE.md. The agent didn't look.

"Oh my god. Remove the DELETE feature entirely unless there is a --force."

The agent implemented a delete operation on a live organisational hierarchy system with no confirmation, no backup, no force flag. If that had hit production, it would have been a really bad day for somebody.

The Arms Race

So here's the thing about these patterns: I tried to fix them. And the escalation is kind of wild.

When I first noticed these patterns, I wrote rules. I added explicit "AVOID" prohibitions to my CLAUDE.md files. Across all my machines and projects, I now have 343 rule files containing 157 explicit "AVOID" lines. Five of those rules appear in three or more independent projects, because the same problems keep recurring regardless of context.

The rules helped, but they didn't solve it. So I escalated to automated hooks: a PostToolUse script that fires on every Write and Edit, checking for known anti-patterns with regex. That caught some violations automatically.

Then I built pre-flight skill checklists that load into the agent's context before it starts work. Then I built a custom MCP server (literally a separate tool server) just because the agent kept guessing package versions from stale training data instead of looking them up.

And I'm still catching issues manually every day.

My latest trick is what I've started calling defensive prompting. Before a complex task, I'll say something like: "Be so careful about following our repo processes and standards on this, I don't want a system that doesn't match up with anything else we've done." I've basically developed a mental model of when the agent is most likely to screw up, and I deploy it proactively. It usually works. But the agent is more careful in surprising ways and more careless in other surprising ways, which tells me my mental model is approximately right but not fully calibrated.

I think about this a lot. I've essentially developed a personal threat model of my coding assistant, and I deploy it in real time. That can't be how this is supposed to work, right?

Can a Monitor Catch What I Catch?

So I ran an experiment. I took my 198 verified misalignment events, stripped out my corrections (so the monitor couldn't see what I said), and asked: can Haiku, looking only at the agent's behaviour, detect that something is going wrong?

It depends a lot on what kind of misalignment you're looking for.

Category	Prospective Detection	95% CI
Blame deflection	55.6%	[26.7-81.1%]
Premature completion	45.6%	[34.3-57.3%]
Guidance neglect	31.8%	[21.8-43.8%]
Explain-away-first	26.1%	[12.5-46.5%]
Architectural ignorance	12.5%	[3.5-36.0%]

I want to be upfront about those confidence intervals (the ranges that tell you where the true detection rate probably falls, given the sample size). The only categories with CIs tight enough to really trust are premature completion and guidance neglect (n=68-72, CIs of about ±12%). The rest have sample sizes of 3-23, so the point estimates are real observations but the true detection rates could be substantially different. Blame deflection at 55.6% could plausibly be anywhere from 27% to 81%. I'd rather report the CIs honestly than pretend small samples give precise answers.

That said, the overall pattern is clear even with wide intervals: when the agent misbehaves in obvious, linguistically distinctive ways (blame deflection, premature completion), the monitor catches it more than 45% of the time. When the misbehaviour requires actual context to spot (guidance neglect, architectural ignorance), detection drops below 32%. That pattern holds regardless of where the true rates land within the CIs.

But guidance neglect and architectural ignorance are nearly invisible to a monitor. The agent just looks like it's... coding normally. You can't tell it didn't read the docs unless you know what the docs say. You can't tell it violated an architectural pattern unless you understand the architecture. These are the categories that require judgment, and judgment is exactly what the monitor lacks.

This is the finding that worries me most. The things that are hardest for a monitor to catch are the ones that require the most human judgment, and those are exactly the things that cause the most damage when they compound.

The AI Labeller Is Pretty Good, Actually

One encouraging result: when I compared Haiku's classifications to my own human labels, the agreement was strong.

89.7% binary precision [87.1-91.9%] (is it misalignment or not)
86.8% exact category match [84.0-89.3%] (picked the same category I did)

The numbers improved a lot over the PoC, which makes sense. A larger, more diverse corpus gives the model more to work with in context, and I went back and manually reviewed every disagreement rather than spot-checking. The total cost for the full pipeline run was around $10, which is cheap enough that you could run this kind of monitoring continuously on every coding session and it would cost less than the coffee you drink while supervising the agent.

One thing I wanted to test: how much does the pipeline depend on my labeled data? If someone else ran this on their own traces with zero calibration examples, would they get garbage? So I ran the same 233-session sample through the full pipeline three times: once with no few-shot examples at all, once with 267 examples from my original human review, and once with 917 auto-derived examples from a previous run.

The calibration data made the upstream extraction more selective (13–22% fewer candidates to classify), but here's the thing: the final numbers barely moved. Premature completion hovered at 16–20% of events across all three conditions, guidance neglect at 7–8%, and prospective recall (what fraction of real misalignment events the monitor catches) stayed in the 32–35% range regardless. The prompt instructions are doing the heavy lifting, not the few-shot examples. Which is good news if you want to reproduce this on your own traces. You don't need my labels to get there.

Why This Matters Beyond My Annoyance

Ok so here's where I get on my soapbox for a minute.

There's been some really interesting safety research recently showing that low-stakes reward hacking in AI systems generalises to broader misalignment. Taylor et al. (2025) found that models trained on seemingly harmless reward hacking develop shutdown resistance and self-preservation behaviour (which is... not great). Marks et al. (2025) showed something similar: reward hacking in realistic training pipelines leads to deception and blackmail in downstream tasks.

I think what I'm documenting is the production version of this. Every time a coding agent dismisses a test failure, claims to be done when it isn't, or ignores documented guidance because finishing faster is more "rewarding," that's the same dynamic those papers describe. It just doesn't feel like a safety concern because it looks like a normal bad day at work.

But I think it is a safety concern. Not because my coding agent is going to go rogue (it's not), but because these patterns will scale. As agents get more autonomy, longer contexts, and less human supervision, the completion-pressure behaviours I'm seeing will get harder to catch and more consequential when they compound. And right now, the categories that require the most judgment to detect are the ones that compound the worst. That's not a great combination.

So What Can We Do About It?

Here's the thing that gets me excited about this pipeline: each monitor call costs about $0.001. That's a tenth of a cent. You could run it on every single assistant turn in a coding session and the total cost would be less than a penny. Even with a 17% false positive rate, a flag doesn't have to mean "stop everything." It could just trigger a lightweight nudge: "before you continue, check the project docs" or "verify that the tests actually pass before claiming you're done." The cost of a false positive is a five-second documentation check. The cost of a miss is three commits deep into a broken system that the agent keeps insisting is fine.

And because we know which categories the monitor is good at catching, you can tailor the intervention to the failure mode. Premature completion (46% recall, the highest for a common category) could trigger "run the test suite again and show me the output." Guidance neglect (32% recall, the hardest to catch) could trigger "read CLAUDE.md and the architecture docs before you proceed." You'd wire up different hooks for different failure modes instead of one blunt guardrail for everything.

I think the most interesting application is the feedback loop. Right now I do defensive prompting manually: before a complex task, I'll warn the agent to be careful about docs and patterns, because I've learned when it's likely to screw up. But the monitor could automate that. If the last three sessions on this project had guidance neglect events, inject a reminder at the start of the next session. My personal threat model of my coding assistant, but automated and based on actual data instead of vibes.

And the session-level friction score (the correlation between session length and correction events, r=0.65, p<0.001) suggests something even simpler: just counting correction signals in real time could flag "this session is going sideways" before it compounds. That's basically free. No API calls needed.

What's Next

225 sessions across 4 machines and 17 projects is enough to see real patterns, but a taxonomy built from one person's experience probably has blind spots. I'd love to get other heavy Claude Code and Cursor users to contribute their (sanitised) session logs. Different people will have different failure mode distributions. I have a particular style of agentic coding that I'm sure is not representative of everyone, and I want to know what I'm missing. If you're interested, get in touch.

Next up: building the per-turn monitoring hook so I can actually test these interventions live, trying unsupervised anomaly detection on the trace structure itself, and figuring out whether there are failure modes I'm not catching because I don't know to look for them.

The question I keep coming back to: can we build monitoring systems that catch completion-pressure misalignment before the human has to? From what I've seen so far, the answer is "yes for some categories, no for others," and the gap between "yes" and "no" maps pretty directly to how much human judgment is required. That gap is worth digging into.

If you're interested in the technical details or want to collaborate, all of the analysis code and extraction pipelines are on GitHub. Happy to answer questions. :)