Article Image
read

I run /review before every PR push. It’s part of the workflow now.

But I never looked at what’s behind it. What instructions is the model actually following? What’s it told to flag, ignore, prioritize? I just trusted it works.

So I pulled the full prompts from both Claude Code and Codex CLI. The difference is bigger than I expected.

Claude Code’s hand

Claude Code’s /review is transparent. The prompt gets injected into the conversation when you invoke it. You can find it in Piebald-AI/claude-code-system-prompts, which tracks every release.

Here’s the entire thing:

You are an expert code reviewer. Follow these steps:

1. If no PR number is provided in the args, run `gh pr list` to show open PRs
2. If a PR number is provided, run `gh pr view <number>` to get PR details
3. Run `gh pr diff <number>` to get the diff
4. Analyze the changes and provide a thorough code review that includes:
   - Overview of what the PR does
   - Analysis of code quality and style
   - Specific suggestions for improvements
   - Any potential issues or risks

Keep your review concise but thorough. Focus on:
- Code correctness
- Following project conventions
- Performance implications
- Test coverage
- Security considerations

Format your review with clear sections and bullet points.

That’s it. ~200 tokens. No bug criteria. No priority system. No output format. No guidance on what not to flag.

“Read the diff and tell me what you think.”

Codex’s hand

Codex is open source too. The review prompt lives in codex-rs/core/review_prompt.md, compiled into the Rust binary via include_str!.

Considerably more opinionated:

You are acting as a reviewer for a proposed code change
made by another engineer.

Your job is to find bugs introduced by the patch.
Do not give a general PR summary unless it helps explain
the findings. Do not optimize for volume.
If there are no issues the author would definitely want
to fix, say so plainly.

Then 6 gating criteria. An issue only gets flagged if all are true:

  1. Meaningfully affects correctness, reliability, performance, security, or maintainability
  2. Discrete and actionable
  3. Introduced by this change, not pre-existing
  4. Doesn’t depend on unstated assumptions
  5. You can point to exact code and explain the concrete failure mode
  6. The author would likely fix it if told

And an explicit “do not flag” list: style nits (unless they obscure meaning), speculative risks without a concrete path, broad architectural opinions, intentional behavior changes.

Priority levels P0 through P3. A strict output template. And behavioral guardrails: “Prefer no finding over a weak finding. Do not praise the author. Do not pad the review with generic comments.”

The gap

  Claude /review Codex /review
Prompt length ~200 tokens ~600 tokens
Bug criteria None 6 explicit gates
“Do not flag” rules None 4 exclusions
Priority levels None P0 through P3
Output format “Sections and bullet points” Strict template with verdict
Pre-existing bugs Not addressed Explicitly excluded
False positive control None “Prefer no finding over a weak finding”

Claude trusts the model to figure out what a good review looks like. Codex tells the model exactly what counts and what doesn’t.

In practice

I’ve been running both on real PRs. Codex consistently flags better findings.

The priority system is the killer feature. P0 and P1 genuinely need fixing. P2 and P3? Mostly not worth the churn. That triage is built into the output. Scan, fix what matters, move on.

Claude’s /review gives you a wall of bullet points at the same level. You do the triage yourself. Which kinda defeats the point.

One caveat: if you search for “Claude Code Review benchmark,” you’ll find articles claiming less than 1% false positive rates. That’s not the built-in /review. That’s a separate /code-review plugin with a completely different architecture.

Claude’s other reviewer (the one nobody uses)

Claude Code ships an official /code-review plugin that’s far more sophisticated than /review. It’s an 8-step pipeline:

  1. Eligibility check (skip drafts, closed PRs, automated PRs)
  2. Gather CLAUDE.md files from affected directories
  3. Summarize the PR
  4. Launch 5 parallel Sonnet agents, each reviewing from a different angle: CLAUDE.md compliance, obvious bugs, git blame history, previous PR comments, and code comment compliance
  5. Score every finding 0-100 with Haiku agents using a detailed rubric
  6. Filter to 80+. If nothing qualifies, it stops and posts nothing.
  7. Re-check PR is still open
  8. Post a gh pr comment with only the surviving findings

If nothing scores 80+, it posts nothing. That’s the reviewer the benchmarks are testing. Not /review.

  /review (built-in) /code-review (plugin) Codex /review
Architecture Single pass 8-step, 9-15 model calls Single pass
Prompt size ~200 tokens ~1500 tokens + 5 agent prompts ~600 tokens
Bug criteria None Implicit via 5 specialized angles 6 explicit gates
False positive control None 10 examples + 0-100 scoring, 80+ threshold 6 “do not flag” rules
Priority system None Confidence score P0-P3
Output format “Sections and bullet points” Strict template with GitHub permalinks Strict verdict template
Posts to GitHub No Yes (only if 80+ findings) No
No findings Generic review anyway Posts nothing “patch is correct” + residual risks
Cost per review 1 call 9-15+ calls 1 call

The plugin is powerful but expensive. Codex achieves surprisingly close quality in a single call through better prompt engineering. Claude’s plugin throws compute at the problem.

The trade-off nobody talks about

Codex compiles its review prompt into the Rust binary. You can’t change it without rebuilding from source. There’s a review_model config to swap the model, but the instructions are locked.

Claude’s prompt is a replaceable markdown file. You could paste Codex’s entire prompt into .claude/commands/review.md and get Codex-quality instructions running on Claude’s model. Or write something better tuned for your stack.

Codex chose polish. Claude chose extensibility.

I’m not sure which is right. But I know which one I can fix myself.

The real lesson

~400 tokens of difference separates “generic feedback” from “structured, prioritized, low-noise findings.” The models are the same tier of capability. The instructions make the difference.

This applies to every AI tool you use, not just code review. There’s always a prompt behind it. The quality of that prompt determines whether you get useful output or plausible-sounding noise.

Now you’ve seen what yours actually says.

Maybe it’s time to customize your review instructions.


Image

@samwize

¯\_(ツ)_/¯

Back to Home