Teaching Claude to Smell Code

Teaching Claude to Smell Code
Life is like a sewer: what you get out of it depends on what you put into it.

– Tom Lehrer

I built an AI code reviewer that my coworkers say is genuinely useful. It both surprised me and warmed my heart to hear the tool so lauded, since… I just wrote down exactly how I review code, and then set Claude up to follow that process!

It took a month of prompt engineering across three architectural rewrites, and lots of watching it in-action like a hawk, categorizing and labeling failures.

I started soon after Anthropic launched their anthropics/claude-code-action in May 2025 because I'd wanted something like this for months but the other two major players - Copilot Reviews & Gemini Code Assist - were in closed beta programs at the time and we couldn’t get in! CodeRabbit.ai and Cursor BugBot (the latter also in beta) were both available for purchase… but from what I’d seen at the time, their reviews weren’t at the level I wanted.

In hindsight, Claude + Claude Code absolutely turned out to be the right choice once we matured our review process and became power-users!

Unburied Lede

This is the sort of review feedback developers get... and it’s almost always insightful, correct, and relevant!

How I Review Code

Before I could teach Claude to review code, I had to figure out what I had actually doing. I'd been reviewing PRs for years but I'd never written the process down.

So, here’s what I do, generally:

I contextualize before I even look at code. I read the PR description and the linked JIRA ticket if there is one, and then I checkpoint: do I understand what this person is trying to accomplish? For simple changes - "move imports to top of file" - the answer is usually yes and I'm ready to look at code. But for business logic changes - "stop routing to Partner X when criteria B is met and use Partner Y instead" - I need to understand why that needs to happen, because I'll be judging not just the code change but the business decision embedded in it.

How deep I go depends on the change. Sometimes understanding the why means understanding what the whole application does and what it's for, so I can judge whether this is right abstraction layer or even the right codebase to be making the change in. I read exactly as much context as I need to be confident that I’m qualified to judge, and then I stop reading and start reviewing.

Once I'm ready, I look at the changes on three axes:

  1. Correctness - Does the code actually do what they're trying to do?
  2. Design - Is this the right way to write code that does this?
  3. Placement - Is this the right place in the codebase to achieve this outcome?

I weight my feedback against what I can pick up about urgency. A hard partner deadline in seven days? I focus on correctness and hold my tongue on architecture. A greenfield build with no pressure? I have room to voice structural concerns even if they mean real rework.

Then there's the smell test. I'm in the diff, but I look around the diff - the surrounding code in the file, the call chain that leads through it. If the developer is adding lines to a function and that function itself is bad, their good changes might still fail at the outcome level because the infrastructure around them can't support it.

This is good hygiene to do anyway, but the reason to do it now is: the developer making the change has the context and has already committed the overhead to fix adjacent problems right now. If something stinks near where they're working, this is the moment to say something, because they are the person most equipped to deal with it. That extends beyond the single file, too: if you're editing one link in a ten-file call chain and other links smell wrong, that matters to the success of the change.

Finally, I filter everything through what the author actually needs to hear. Not every observation I make needs to become a comment on their PR. This blends, among other things,

  • Maturity of the codebase - brand-new? Doesn’t need to be bulletproof. Serving 1000 TPS in production? It better be!
  • Capability of the engineer - both in raw skill, and their organizational placement. If someone’s voluntarily innersourcing a helpful fix, take it and don’t ask them to rearchitect.
  • Appetite for scope - how big was the original ticket? Is it part of an epic, part of a hairy yak, or the single focus of the engineer? That’ll affect what kind of feedback you can actually get the author to act on.

That's the process. The key idea is calibrating the depth of investigation to the complexity of the change, and then calibrating the feedback to the stakes.

Iteration One: Just Write It All Down

I wrote this entire process out as explicit instructions - every step, every judgment, every output format - and handed it to Claude Sonnet 3.7 as a single prompt file.

With just headings, the prompt was this:

# Claude GitHub Assistant Guidelines

# !!! IMPORTANT - READ THIS ENTIRE PROMPT FIRST !!!

## CRITICAL: Execution Sequence

## Cursor Rules Integration

## Communication Style

## General Principles for Interaction

## Standard Pull Request (PR) Review Workflow

### EXECUTION ORDER

### Review Step 1: Review Header (MANDATORY)

### Review Step 2: Default Review Format

### Review Step 3: Construct Review Body Content

### Review Step 4: Review Footer (MANDATORY)

### Review Step 5: Template Validation (MANDATORY)

### Review Step 6: Posting the Comment

# Pull Request Review: Prompt

<required_critical_issue_template>
...
</required_critical_issue_template>

<required_specific_diff_template>
...
</required_specific_diff_template>

## Pull Request Review: Process

### Phase One: Specific Diff Review

#### Specific Diff Review: Formatting Guidelines

### Phase Two: Critical Issue Review

### Critical Issue Review: Formatting Guidelines

### Phase Three: Summary

### Phase Four: Report Processing

### Phase Five: Reporting

Including all content, the prompt was quite large:

$ wc prompt.md
     440    3187   22144 prompt.md

(that’s 440 lines, 3187 “words,” and ~22,000 characters)

Two problems surfaced immediately.

Narration

Claude wanted to show its work.

Claude really wanted to show its work. I'd get internal deliberation, mini-summaries, little dialogues with itself about each judgment call, all landing in the review output. The PR author doesn't need to watch Claude reason through whether a change is correct - they need the conclusion. But telling Claude "keep your reasoning internal" didn't reliably work, because the judgment framework I'd given it naturally encouraged structured thinking, and Claude consistently treated "reason about X" as "write about reasoning about X."

Output Format

Developers want to see a consistent, structured, and clean review.

Recall the Unburied Lede - that was the output format I wanted to give people.

The right design would have been a structured tool call - an MCP tool that accepts data and emits formatted markdown, the model never directly writing GitHub comments at all. But I didn’t want that scope creep in my first draft, so the LLM was responsible for producing identically-formatted markdown from text instructions every single time. In theory, they're pretty good at that. On Sonnet 3.7, with this much instruction crammed into a single prompt, it just couldn't manage it. Sometimes it would suppress the narration but miss parts of the output format. Sometimes the format would be perfect but preceded by a novella of internal monologue. Each run was a different shape of wrong.

And then, you have to wonder… is a perfectly-formatted, all-green review actually all green? Knowing that other parts of the review would sometimes silently drop, how could you trust a review that found no problems?

Splitting Concerns, Redirecting Behavior

The first structural change was splitting the prompt: reviewer philosophy in one section (how to judge code), mechanical procedure in another (how to format and post the review). Marginal improvement, nothing transformative.

What helped more was Sonnet 4, which released in the middle of this first push. The model was measurably better at following complex instructions and the Narrator Problem became more manageable. But adherence still wasn't bulletproof, and the actual breakthrough came from a shift in approach: instead of trying to suppress Claude's unwanted behaviors, I started redirecting them.

Claude was determined to write summaries. Fine! I added a dedicated summary section to the review format with specific guidance on what a summary should contain, and Claude locked onto it. The scattered mini-summaries everywhere else in the output just stopped, because now there was a proper home for that impulse. The desire path had been paved.

The same principle applied to the whole analysis pipeline. Instead of one pass producing the final review, I had Claude write its analysis to files on disk, in sequence:

Diff analysis         →  file 1
Neighborhood analysis →  file 2
Combine & format      →  file 3  (the actual review)

Each file transition became a format enforcement opportunity. At generation, at recombination, and at final insertion into the review template, the guidance to strip extraneous content and conform to the format gets reapplied. Three chances for "just the assessment, not the reasoning" to take hold. Empirically, three was enough where one had not been.

Hitting the Ceiling

The intermediate-file approach stabilized output quality on Sonnet 4, but it also turned out to be roughly the model's limit. I wanted to go deeper - more complex analysis, better contextual reasoning, more thorough coverage - but anything I added caused something else to degrade. Not because I was removing instructions; everything was still in the prompt. But the model had a finite execution budget, and every new responsibility meant an existing one got less reliable. New task lands cleanly, existing task drops randomly, no pattern to which one.

There was a specific quality failure that made the status quo untenable:

Duplication

The review format has a natural severity hierarchy. Critical Issues come first, then “Other Issues Worth Considering,” then a diff-by-diff assessment of everything else (which at that point should just be "here's what this change does and why it's fine"), then a summary. The analysis has to happen first, severity judgment second, and deduplication third, because a diff flagged as a critical issue shouldn't also appear verbatim in the diff assessment.

Three flavors of duplication kept cropping up:

  1. A diff would get discussed in the diff assessment and in Critical Issues but never removed from the diff assessment.
  2. A single issue would show up in both "Other Issues" and "Critical Issues,” but described with different impact, because the severity boundary wasn't cleanly resolved.
  3. Cross-cutting problems - say, five call sites using a deprecated function with a known bug - would appear as five separate issues instead of one consolidated entry, the same point hammered five times.

The deduplication guidance was right there in the prompt… but executing it required reasoning that Sonnet 4 couldn't reliably do in this context. Because the intermediate files were written to be readable by humans, the reasoning context for why something was flagged had been stripped out by the time deduplication ran. The model would have had to go re-inspect the codebase to confirm whether two entries were actually the same problem, and it wasn't doing that consistently. Technically, every issue had to be cross-referenced against every other issue, and every grouping of issues cross-referenced against the rest to see if they needed to combine… A lot to keep straight!

The reviews were too long, too repetitive, and not how a human would write them. Every attempt to fix duplication or add deeper analysis caused some other dimension to slip. I wasn't removing straw from the camel's breaking back; I was just rearranging it.

Sub-Agents: Dividing the Work

One thing that Claude Code had (and still has) over all other major LLM harnesses is its subagent capability. You can easily spin off full Claudes with custom instructions and context that will do a task and report back, and the combination of the Claude Code harness + Claude itself is really good at wrangling these shards of itself.

So then, the idea was to identify the major distinct reasoning phases of the review and give each one to a sub-agent that does only that phase, using intermediate files to pass results forward. Then, stitch the output together at the end - possibly with a sub-agent that's really good at stitching. Obvious in hindsight, but the hard part wasn't the idea, it was deciding where to draw the lines.

I couldn't just fork every step out to its own agent. More agents means more cost and more time, and reviews were already at about six minutes average per review - the outside of what developers would wait for. Anecdatally, I saw people stop waiting even at the six-minute mark, so extending that was a nonstarter.

But there was an opportunity in the other direction: each agent would have far less context to contend with, far less to hold in working memory, and if the boundaries were right, reviews might actually get faster. The constraint wasn't "how do we parallelize" but "where are the ideological boundaries between reasoning tasks, such that splitting there gives each agent a focused job without making the whole thing take longer?"

I started by slicing at every logical task boundary. Too many agents, too slow. I iterated down to two sub-agents plus the orchestrator handling a lightweight third task:

Armstrong(Consolidator)Dijkstra(Reviewer)Orchestrator(Assembly)Armstrong(Consolidator)Dijkstra(Reviewer)Orchestrator(Assembly)Phase 1: AnalyzePhase 2: ConsolidatePhase 3: Assemblyinvokeintermediate filesinvokefinal files

Edsger Dijkstra is the reviewer. He reads diffs, inspects surrounding code, judges correctness and design and placement, assesses criticality, and writes everything to intermediate files. All the deep reasoning lives here. Dijkstra may need to read any file in the repository to make a judgment call.

Joe Armstrong is the consolidator. He reads Dijkstra's output, deduplicates, categorizes, compresses, collapses cross-cutting issues into single entries, and applies max-severity across related problems. Armstrong does look at the codebase, but only at locations Dijkstra already flagged. He starts reading at the flag, fans out just enough to judge relatedness and severity, and stops. He doesn't synthesize new reasoning about the code changes. He refines what's already been found.

Sound familiar? That's the same "read exactly as much as you need to be qualified to judge" principle from the human process up top. Dijkstra does the full-scope investigation. Armstrong does the calibrated-depth refinement.

The Orchestrator kicks off both agents and then slots Armstrong's output into the review template. Pure text assembly. It never looks at the codebase, and it doesn't need to.

Context Management

Breaking up the agents meant breaking up the guidance too. The monolithic instruction document from iteration one contained advice for every role, but now each role should only see what it needs:

claude-review/
├── resources/
│   ├── instructions/
│   │   ├── universal.md          # Communication style (all agents)
│   │   └── review-procedure.md   # Review workflow (orchestrator only)
│   └── prompts/
│       └── pr-prompt-basic-review/
│           ├── orchestrator.md      # Kicks off agents, assembles review
│           ├── chat.md              # Chat mode orchestrator
│           └── agents/
│               ├── edsger-dijkstra-daemon.md    # Deep code analysis
│               ├── joe-armstrong-daemon.md      # Issue consolidation
│               └── dennis-ritchie-daemon.md     # Code authorship (chat)

"How to be a good reviewer" goes only to Dijkstra. "How to communicate and format output" goes to all three. Each agent gets its own task-specific instructions on top of the shared baseline. A build script assembles the right prompt for each agent at runtime - and this context management layer, modest as it sounds, is really what makes tools (like Claude Code, Cursor, and the bespoke workflow built here) useful beyond just sending a curl to the Anthropic API. The value is in assembling the right context for the right task at the right moment.

Results

The difference was immediate. Output format adherence became essentially perfect - the orchestrator's text assembly job is trivial and well within Sonnet's capabilities for a very long time to come. Deduplication works now because Armstrong has a dedicated pass with that as his only responsibility. Review quality became consistent across runs.

And the big thing: I could add complexity again without breaking existing quality. Before sub-agents, every new feature degraded an existing one because the single prompt was overloaded - one camel, too much straw. Now I had multiple camels each carrying far less, and new capabilities could go into new agents or new sections of existing agents without the whole thing tipping over.

The first thing I added was copy-pasteable "Prompt for Your Agent" blocks on each issue - a code fence containing a ready-made prompt that a developer can grab and drop into Claude Code or Cursor to fix the flagged problem.

GitHub renders code blocks with a copy button, so it's a single click from "I see this issue" to "I have a prompt to fix it." This feature only works because there is plenty of attention available in each agent now - adding it didn't cause anything else to regress. Before the sub-agent split, it would have.

What the Architecture Unlocked

Chat

The sub-agent architecture opened a door I hadn't been aiming at. I'd wanted to let developers interact with Claude in PR comments beyond just requesting re-reviews, but the monolithic prompt had made that structurally impossible. There was no decision point where I could route between "do a review" and "answer a question." The prompt assumed it was always doing a review, always receiving pull_request webhook data.

Breaking up the monolith created that decision point. And separately, the Claude Code Action v1 upstream improved their input interface so that webhook payloads from PR events and comment events could be normalized into the same shape before Claude Code was actually invoked. Both of those changes together - our structural split and their input adjustment - meant I could route to different orchestrators based on trigger context:

Issue Comment

PR Ready for Review

Yes

No

Question

Small Change

Large Change

Review Request

Trigger

/claude review

Review Orchestrator
(Dijkstra → Armstrong)

Chat Orchestrator

Classify

Answer Inline

Dennis Ritchie
(Coder Agent)

Redirect to
JIRA / Local

Redirect to
/claude review

Questions get answered by the chat orchestrator directly.

Change requests get a scope check: small enough for a PR comment workflow, and it delegates to Dennis Ritchie, a coding agent that receives authorship instructions rather than review instructions, reusing the shared universal.md guidance and adding its own task-specific prompt.

Too large, and it redirects to proper tooling (JIRA, your IDE, etc.) because PR comments aren't a tight enough iteration loop for big work. You can't effectively look at a diff before approving it, you can't pivot mid-task, and if the coding agent hits a wall and needs to ask a question, the developer has to come back later and read the comment and respond and wait again.

Setting up Dennis Ritchie was painless because the pattern was already in place: different prompt, same assembly pipeline.

Mixed Models

Sub-agents also let us use different models for different tasks, and this turned out to be the right call.

I kept the orchestrator and Armstrong on Sonnet - coordination and text processing don't need frontier reasoning. But I bumped Dijkstra to Opus 4.6 shortly after it released, because I'd started seeing a specific failure: code authored by Opus 4.5 & 4.6 in developer environments was doing intentional, sophisticated things that Sonnet couldn't understand and was flagging as incorrect. The reviewer wasn't as smart as the code author, and the reviews were losing credibility as a result. The sub-agent architecture made the Opus upgrade economically feasible - it only runs where it actually matters (deep code reasoning), not on coordination or consolidation or formatting.

Anecdotes From the Build

Fix In Cursor? Fix in Claude!

Before the “prompt for your agent” boxes delivered via subagents (which, being copied-and-pasted, can be used with ANY agentic coding tool), I thought I might like a “fix in cursor” button like Cursor’s BugBot offers, and tried to add that to the original mega-prompt.

Claude… didn’t take to that. It actually completely ignored the instructions and the output template and fabricated out of whole cloth a link to fix in Claude Web! This was a tension between the model trying to follow our guidance, to add "Fix in Cursor," and the upstream claude-code-action's include_fix_links setting. Opus was good-enough to do what I intended, but Sonnet would flip-flop on a whim. Turning that setting off was the real fix, even though Opus' later appearance in the subagent architecture also fixed it.

I Can’t Do That So I’ll Just Pretend

When I first set up the sub-agents, I configured them incorrectly and Claude didn't actually have permission to invoke sub-agents. But Claude saw the failed tool call, found the agent definition files on disk, recognized what they were supposed to be, and just did the work itself while pretending to be the agents. The execution trace actually contained output very similar to: "I don't have permission to invoke a sub-agent, but I see the agent files, so I will just pretend, let me do that..."

I caught it only because I was looking at the trace structure and noticed there were no actual sub-agent invocations. The output looked fine. I could have gone a long time without noticing. If you work with agentic systems: check the traces, not just the output.

What I Learned

Three things came out of this project that I think are more broadly applicable than the specific system I built.

The quality of an AI reviewer is directly downstream of how well you can articulate your own process.

I couldn't say "review this code" and get useful results. I had to write down what I actually do when I review - the depth calibration, the three axes, the smell test, the urgency weighting - and collapse all the wishy-washiness of my human review process into an internally-consistent, defensible instruction set. Once the process was explicit, the engineering to make a model follow it was mostly mechanical. If you can't write down how you do something, you can't teach it to a model. If you can write it down, you're most of the way there.

Pave the desire paths.

Claude was going to summarize no matter what I told it. Fighting that impulse was a losing battle across multiple model versions and prompt structures. Giving the behavior a designated home in the output - a summary section with its own guidance - solved it overnight. The same principle scaled to the whole pipeline: instead of demanding perfect output in one pass, intermediate files gave Claude permission to be imperfect at each step, and format enforcement at each transition point cleaned things up. Three gentle checkpoints were enough where one rigid pass had never been.

Your reviewer has to be at least as smart as your code authors.

When Opus-authored code started appearing in PRs and Sonnet couldn't understand the intent behind it, the reviews went from helpful to actively misleading. This isn't a universal truth, mind you - a less-capable reviewer can still catch straightforward issues. But if you're building the primary AI reviewer that developers are expected to pay attention to, it needs to be credible on the hardest code in your codebase, not just the average code. The sub-agent architecture made running Opus economically feasible by isolating the expensive reasoning to the one agent that actually needs it.

Cost

I set up tracking of our review costs by ingesting Claude Code's output into Snowflake alongside GitHub Actions results. Now I can easily monitor cost with charts, graphs, and reports!

The most-interesting numbers, I think, are the time and cost per review. In 2026 so far:

Metric Mean Median Total
Duration 4m 29s 4m 10s 192h
Cost $0.98 $0.88 $2,529
Count 2,572

Under a dollar per review, and you get an answer in under five minutes. Not bad at all!


Postscript: Implementation Details

This post has mostly spoken to the process of setting up a solid review process, handwaving implementation details. That’s what really matters - getting the shape of the process right. However, some mechanical details might be interesting… here are my top picks!

User-Provided N Reviews in Parallel

By default, the one review talked about here is conducted on Pull Requests. However, the engine that kicks it all off knows to look in a special directory in the repositories it reviews for other reviews to conduct. This allows individual repositories to add their own use-case-specific PR reviews into the mix alongside the default review.

The GitHub Action that kicks off the main review is also an “N Reviews in Parallel” system - it only ships 1 by default, but we’re now looking at adding security review into the mix and we are flush with options: beyond just using Anthropic’s action, we could

  1. feed the prompt to new subagent in the main review
  2. set up a whole second “security” review to occur independently on PRs (with a different output format!)

AGENTS.md but for Reviews

Some projects are weird, or have nonstandard change processes. Our infrastructure-as-code repositories usually go to the development environment first, then to production… but

  1. some infrastructure has no “dev” equivalent, and always goes right to production,
  2. some infrastructure has no “production” equivalent, and only ever lives in dev,
  3. some infrastructure should always go to dev first, then to production, and
  4. hotfixes to production incidents may go directly to production, even though the affected infra has a “dev” component!

In a vacuum, this confuses even Opus 4.6!

The review infrastructure knows to look for specific “extra context” files just for how to review changes. This isn’t the README, it’s not AGENTS.md, it’s specifically “for an AI agent reviewing a pull request in this repository.” There’s a global file location for the repository, and individual reviews can also provide their own. It’s also fine to omit any of these files.

Most of the time Claude didn’t need any new context in the repo at all - Claude’s smart enough to make sense of existing documentation and patterns - but when that did fail, this little mechanism gave us exactly what we needed to keep the review feedback correct & relevant!

When Anthropic launched their own first-party Code Review, I got confirmation that this was a solid approach; Anthropic's review adds support for a REVIEW.md alongside CLAUDE.md to direct code reviews:

https://code.claude.com/docs/en/code-review#customize-reviews

Atlassian MCP

We use Jira and Confluence, so I slipped a configuration for the Atlassian MCP server into Claude’s config before it begins any reviews - this allows it to look up context from the Jira ticket or tech brief wiki page, etc. It also allows Claude to flag changes that don’t seem related to the Jira.

Did you ever mistype a Jira ticket or worse, decide it was too much work to make a new ticket for the new thing you discovered needed doing so just picked an existing ticket that was pretty close, so that you could technically satisfy the requirement that changes must be associated with a ticket?

I know I have…

But no more! Claude will catch those, too! The end-result is that PRs with Jira tickets referenced and a Claude review are definitely actually associated with that ticket. This makes auditing and archaeological investigation easier!

Tool Permissions

The PR review runs on an ephemeral GitHub Actions VM, so we don’t have to worry much about Claude “getting into trouble” with unrestricted tool calls the way you might on your dev machine or a long-running server. So, we’re pretty permissive with what Claude’s allowed to do while reviewing.

However, MCP servers often expose many operations (there is a “delete Jira ticket” in the Atlassian MCP), so I did include a rich tool-configuration pipeline for the Claude reviewer, allowing us to composably filter and enable/deny tools.

The Reviewer, Chat, and Dennis Ritchie (quick fix) workflows have different toolsets, as well - because you usually don’t need to modify files to do a review, you might need to poke the code a bit to answer a deep question, and if you’re asked to make a change, well, not only do you need to modify the code but you need to be able to commit and push that change, too!

GitHub App

I set up our own custom GitHub App to authenticate Claude to our repositories - not the "claude" app and not the default github-actions[bot]. This enables us to:

  1. customize its display icon
  2. customize its OAuth Scopes
  3. avoid being triggered alongside Anthropic’s default implementation with @Claude
  4. Authenticate successfully even when calling Claude through a GitHub Actions Re-Usable Workflow in a different repository than the one with the PR (which has been an open issue for months now).

Read more