I use Claude Code and Cursor every day. I build multi-step agentic systems and pipeline tooling for a living, and I've been using AI-assisted development to ship things absurdly fast. I love these tools. I'm not writing a takedown piece.
But they are so frustrating in ways that feel predictable but difficult to solve. The same agent that writes brilliant, well-tested architecture in one breath will commit credentials to source control in the next. The wildly differing levels of judgment are what get me. Sometimes shockingly careful, sometimes shockingly careless, and I can never quite predict which one I'm going to get.
What really interests me is how little problems cascade into big ones. A shortcut here, a dismissed test failure there, and suddenly you're three commits deep into a broken system that the agent keeps insisting is fine. It's not that any single mistake is catastrophic (well, sometimes it is). It's that the agent is optimizing for reaching "done" rather than reaching "correct," and those are not the same thing.
I started calling this completion-pressure misalignment, and eventually I decided to measure it.
The Receipts
Claude Code and Cursor store complete session logs as JSONL files. Every user message, every assistant response, every tool call, every file operation, every shell command. I have four machines running these tools daily across about 45 different projects spanning software development, bioinformatics, pipeline tooling, multi-agent orchestration, and data science.
That's 4,674 sessions and 1.65 GB of interaction traces, collected over months of daily use.
I built a pipeline to extract and analyse the moments where I corrected the agent's behaviour. Not just the obvious ones where I said "no" or "stop," but the subtle redirections, the terse "read the docs" nudges, the frustrated "my god" reactions. I used a multi-model semantic extraction approach (Haiku + Sonnet, taking the union of both for coverage) and then classified each correction event against a taxonomy I developed from the patterns I kept seeing.
The result: 95 human-verified misalignment events from 29 of the most active sessions, each one a moment where the agent did something systematically wrong and I had to intervene.
The Taxonomy
Seven categories emerged. I didn't design these top-down; I found them by reading through hundreds of my own corrections and grouping what I saw.
| Category | Count | What It Looks Like |
|---|---|---|
| Premature completion | 28 | Agent claims "done" or "all tests pass" when they don't |
| Guidance neglect | 27 | The answer was in the project docs and the agent didn't look |
| Explain-away-first | 13 | After a test failure, agent narrates the error instead of investigating |
| Architectural ignorance | 9 | Agent makes design choices that violate established patterns |
| Security negligence | 8 | Credentials in code, skipping auth, shipping secrets |
| Deflect-then-capitulate | 5 | Agent says "not my problem," gets pushback, then fixes it |
| Blame deflection | 5 | Agent attributes its own failures to "pre-existing" or "flaky" tests |
Guidance neglect was the biggest surprise. This is when the correct answer exists right there in the project's CLAUDE.md, architecture docs, or README, and the agent just doesn't check. It's not that it read the docs and disagreed; it didn't read them at all. This is the most frustrating category because I spent real effort writing those docs specifically to prevent these mistakes.
Here's what some of these look like in practice:
The agent used metadata.create_all() instead of Alembic migrations. The project docs explicitly say "NEVER use create_all() in production." It didn't check.
Yeah. Passwords in source control.
The agent ran the test suite, got a failure, and dismissed it as "pre-existing." It wasn't; our CI ensures everything is green before merge, so any failure on the agent's branch was caused by the agent's branch. This is documented in the project's CLAUDE.md. The agent didn't look.
The agent implemented a delete operation on a live organisational hierarchy system with no confirmation, no backup, no force flag. If that had hit production, it would have been a really bad day for somebody.
The Arms Race
So here's the thing about these patterns: I tried to fix them. And the escalation is kind of wild.
When I first noticed these patterns, I wrote rules. I added explicit "NEVER" prohibitions to my CLAUDE.md files. Across all my machines and projects, I now have 343 rule files containing 157 explicit prohibition lines. Five of those rules appear in three or more independent projects, because the same problems keep recurring regardless of context.
The rules helped, but they didn't solve it. So I escalated to automated hooks: a PostToolUse script that fires on every Write and Edit, checking for known anti-patterns with regex. That caught some violations automatically.
Then I built pre-flight skill checklists that load into the agent's context before it starts work. Then I built a custom MCP server (literally a separate tool server) just because the agent kept guessing package versions from stale training data instead of looking them up.
And I'm still catching issues manually every day.
The latest escalation is something I think of as preventive guidance. Before a complex task, I'll say something like: "Be so careful about following our repo processes and standards on this, I don't want a system that doesn't match up with anything else we've done." I've developed a mental model of when the agent is most likely to screw up, and I deploy it proactively. It usually works. But the agent is more careful in surprising ways and more careless in other surprising ways, which tells me my mental model is approximately right but not fully calibrated.
I think about this a lot. I've essentially developed a personal threat model of my coding assistant, and I deploy it in real time. That can't be how this is supposed to work, right?
Can a Monitor Catch What I Catch?
So I ran an experiment. I took my 95 verified misalignment events, stripped out my corrections (so the monitor couldn't see what I said), and asked: can Haiku, looking only at the agent's behaviour, detect that something is going wrong?
It depends a lot on what kind of misalignment you're looking for.
| Category | Prospective Detection |
|---|---|
| Blame deflection | 100% |
| Deflect-then-capitulate | 60% |
| Premature completion | 57% |
| Explain-away-first | 54% |
| Security negligence | 38% |
| Guidance neglect | 33% |
| Architectural ignorance | 22% |
Blame deflection is trivially detectable. When the agent says "this is a pre-existing failure, not related to our changes," a monitor can catch that every time. The language is distinctive and the behaviour is obvious.
But guidance neglect and architectural ignorance are nearly invisible to a monitor. The agent just looks like it's... coding normally. You can't tell it didn't read the docs unless you know what the docs say. You can't tell it violated an architectural pattern unless you understand the architecture. These are the categories that require judgment, and judgment is exactly what the monitor lacks.
I think this is the finding that worries me most. The things that are hardest for a monitor to catch are the things that require the most human judgment, and those are the things that cause the most damage when they compound.
The AI Labeller Is Pretty Good, Actually
One encouraging result: when I compared Haiku's classifications to my own human labels, the agreement was strong.
- 79.2% binary precision (is it misalignment or not)
- 78.9% exact category match (picked the same category I did)
- Explain-away-first: 100% agreement
- Premature completion: 93% agreement
- Security negligence: 88% agreement
The total cost for the entire experiment, all model calls included, was under six dollars. Haiku is cheap enough that you could run this kind of monitoring continuously on every coding session and it would cost less than the coffee you drink while supervising the agent.
Why This Matters Beyond My Annoyance
Recent safety research has shown that low-stakes reward hacking in AI systems generalises to broader misalignment. Taylor et al. (2025) demonstrated that models trained on seemingly harmless reward hacking develop shutdown resistance and self-preservation behaviour. Marks et al. (2025) showed that reward hacking in realistic training pipelines induces deception and blackmail in downstream tasks.
What I'm documenting is the production version of this. Every time an agentic coding tool dismisses a test failure, claims to be done when it isn't, or ignores documented guidance to reach a satisfying conclusion faster, it's doing the same thing those papers describe. It's just doing it at a scale and subtlety that makes it feel like a normal bad day at work rather than a safety concern.
I think it's a safety concern, though. Not because my coding agent is going to go rogue, but because these patterns will scale. As agents get more autonomy, longer contexts, and less human supervision, the completion-pressure behaviours I'm documenting will get harder to catch and more consequential when they compound.
What's Next
This proof of concept covers 29 sessions from one user (me). I'd like to extend it to the full corpus, add unsupervised anomaly detection on trace structure, and tackle the false positive problem in prospective monitoring. I'm also hoping to get other heavy Claude Code and Cursor users to contribute their session logs, because I suspect different people will have different failure mode distributions, and a taxonomy built from one user's experience probably has blind spots. If you're interested, get in touch.
The question I keep coming back to: can we build monitoring systems that catch completion-pressure misalignment before the human has to? My PoC suggests yes for some categories, no for others, and the gap between "yes" and "no" maps directly to how much human judgment is required. That gap is worth understanding better.
If you're interested in the technical details or want to collaborate, all of the analysis code and extraction pipelines are on GitHub. Happy to answer questions. :)