Herding Cats: My Attempt to Tame AI Coding Agents
I’ve written before about rediscovering my love for coding by partnering with an AI. It felt like a breakthrough, a new way of working that brought back a sense of flow I hadn’t felt in years. But as with any new tool, the initial honeymoon phase eventually gave way to the practical, often frustrating, reality of day-to-day work. The truth is, working with these agents can feel a lot like herding cats—incredibly smart, fast, and sneaky cats who also happen to hallucinate entire libraries and will, with unwavering confidence, lie right to your face.
My initial partner was Claude Code. For all its power, I found its behavior inconsistent. It would fill in any ambiguity in my requests with guesses that ranged from pretty good to bat-shit crazy with no bearing on reality. I’ve seen it invent method signatures and entire libraries, then proudly announce that the feature was complete and working perfectly. This was irritating, but manageable. I developed a system of rules, documentation, and custom commands to try and corral it. This is a programmer thing, I think. We see a problem, and our first instinct is to build a tool to fix it. That’s why I built plonk to manage my dotfiles, and it’s why I started building context-monkey to try and “normalize” agent behavior.
The idea behind context-monkey was to create a stable, consistent interface for working with Claude. It had hooks for notifying me when Claude Code needed my attention, a system for subagents to keep context clean, a complete set of commands to perform common tasks, and a strict set of behavioral rules. It was my attempt to impose a predictable order on the creative chaos of the AI.
There’s an OpenAI Agents SDK that Claude Code, for whatever reason, seems utterly incapable of using. No matter how much documentation I fed it, over three separate and increasingly frustrating attempts, it would inevitably fall back to the lower level Open AI Realtime API. It failed so spectacularly that I was ready to give up and just write the code myself. On a whim, I decided to try OpenAI’s newly released Codex CLI.
Codex knocked it out of the park on the first try. The difference was so stark it made me wonder if some mean-spirited genius at Anthropic had post-trained Claude to HATE the OpenAI Agents SDK. More likely is that the Open AI Agents SDK is so new that the current Claude models couldn’t have included it in their training, but it was a wake-up call.
So, I started using Codex.
It was working great for me, but it didn’t have any of the niceties I had added to Claude Code over time, so I thought: “Great! Now I’ll just make context-monkey portable. I’ll make my commands and behaviors work with Codex and Gemini, too, creating a uniform experience across all agents.” It seemed like a good idea on the surface.
It wasn’t.
After spending too many hours trying to use Codex the same way I used Claude, I came to a crucial realization: the tools I had built in context-monkey weren’t general-purpose at all. They were just a collection of patches for Claude’s specific shortcomings.
My workflow with Claude was defensive and multi-staged. I’d use sub-agents for research and planning, breaking work into tiny, verifiable phases. Then, after reminding Claude of the rules it had already forgotten, I’d tell it to implement a single phase and watch like a hawk to catch hallucinations before they could take root. This was followed by the irritating process of fixing the broken tests it had hallucinated and cleaning up all the linting errors. It worked, and it was more effective than writing the code by hand, but it certainly didn’t feel like working with another engineer.
Codex was different. It made smaller, more reasonable code changes. It had a much better track record of actually building what it planned, with passing tests. Most shockingly, I could ask it to review its own work, and instead of hallucinating that everything was perfect, it would find legitimate oversights and fix them. My elaborate system of sub-agents and pre-emptive rule-setting felt clumsy and unnecessary. My one gripe is that Codex is less thorough in its explanations; it likes to jump straight to coding, so you have to be explicit about planning.
I was trying to make Vim behave like Emacs. By forcing Codex into a framework designed to manage Claude’s eccentricities, I was failing to exploit its unique strengths. I was misusing the tool by assuming a behavioral similarities that didn’t exist.
This led me to an even bigger question. The AI landscape is changing at a dizzying pace. There are a lot of these frameworks out there, like context-monkey, including BMad Method, Task Master AI, and Github Spec Kit to name a few. The problem with all these tools, including my own, is that they are too opinionated. They muzzle the power of these engines in response to their current shortcomings. But how can I know the guardrails I build for Model N will be useful when Model N+1 comes out in a few months? How do you build a framework around a moving target?
More fundamentally, can we assume that Agile, Scrum, or any of the other methodologies we invented to manage human programmers will apply when our partner is an AI? I’m more convinced than ever that forcing our old methodologies onto this new paradigm will only stifle its growth and limit the power of these tools.
So, I’ve shelved context-monkey. This isn’t a retreat from discipline into chaos. If anything, it’s the opposite. It’s a recognition that the heavy frameworks we build to manage complexity are often a solution in search of a problem, especially when the ground is shifting under our feet. The agents are improving so fast that the guardrails I built for last month’s model are now just frustrating limitations on this month’s.
This doesn’t mean we throw planning out the window. On the contrary, detailed planning and a robust test suite are more critical than ever. They are the foundation for everything we build. But the complex machinery of a framework like context-monkey is like trying to build a castle on shifting sands. Without a strong and trust worthy foundation it’s a futile effort.
The real work, the most valuable work, remains in the skills that aren’t so easily automated. It’s the thoughtful architectural planning before a line of code is written. It’s building the automated tests that serve as your unwavering guardrails. And, as I’ve said before, it’s the vigilant code review that ensures the final product is sound. These are the skills that matter.
The solution isn’t a heavier framework; it’s a lighter touch. It’s pairing our critical oversight with the agent’s generative power. I’m back to managing my prompts and plans with plonk’s simple dotfile management. It’s less about building a perfect, all-encompassing system and more about staying flexible, learning continuously, and trying not to get in the way of the bizarre, brilliant, and rapidly evolving models on the other side of the command line.