I run /review before every PR push. It’s part of the workflow now.
But I never looked at what’s behind it. What instructions is the model actually following? What’s it told to flag, ignore, prioritize? I just trusted it works.
So I pulled the full prompts from both Claude Code and Codex CLI. The difference is bigger than I expected.
Claude Code’s hand
Claude Code’s /review is transparent. The prompt gets injected into the conversation when you invoke it. You can find it in Piebald-AI/claude-code-system-prompts, which tracks every release.
Here’s the entire thing:
You are an expert code reviewer. Follow these steps:
1. If no PR number is provided in the args, run `gh pr list` to show open PRs
2. If a PR number is provided, run `gh pr view <number>` to get PR details
3. Run `gh pr diff <number>` to get the diff
4. Analyze the changes and provide a thorough code review that includes:
- Overview of what the PR does
- Analysis of code quality and style
- Specific suggestions for improvements
- Any potential issues or risks
Keep your review concise but thorough. Focus on:
- Code correctness
- Following project conventions
- Performance implications
- Test coverage
- Security considerations
Format your review with clear sections and bullet points.
That’s it. ~200 tokens. No bug criteria. No priority system. No output format. No guidance on what not to flag.
“Read the diff and tell me what you think.”
Codex’s hand
Codex is open source too. The review prompt lives in codex-rs/core/review_prompt.md, compiled into the Rust binary via include_str!.
Considerably more opinionated:
You are acting as a reviewer for a proposed code change
made by another engineer.
Your job is to find bugs introduced by the patch.
Do not give a general PR summary unless it helps explain
the findings. Do not optimize for volume.
If there are no issues the author would definitely want
to fix, say so plainly.
Then 6 gating criteria. An issue only gets flagged if all are true:
- Meaningfully affects correctness, reliability, performance, security, or maintainability
- Discrete and actionable
- Introduced by this change, not pre-existing
- Doesn’t depend on unstated assumptions
- You can point to exact code and explain the concrete failure mode
- The author would likely fix it if told
And an explicit “do not flag” list: style nits (unless they obscure meaning), speculative risks without a concrete path, broad architectural opinions, intentional behavior changes.
Priority levels P0 through P3. A strict output template. And behavioral guardrails: “Prefer no finding over a weak finding. Do not praise the author. Do not pad the review with generic comments.”
The gap
Claude /review |
Codex /review |
|
|---|---|---|
| Prompt length | ~200 tokens | ~600 tokens |
| Bug criteria | None | 6 explicit gates |
| “Do not flag” rules | None | 4 exclusions |
| Priority levels | None | P0 through P3 |
| Output format | “Sections and bullet points” | Strict template with verdict |
| Pre-existing bugs | Not addressed | Explicitly excluded |
| False positive control | None | “Prefer no finding over a weak finding” |
Claude trusts the model to figure out what a good review looks like. Codex tells the model exactly what counts and what doesn’t.
In practice
I’ve been running both on real PRs. Codex consistently flags better findings.
The priority system is the killer feature. P0 and P1 genuinely need fixing. P2 and P3? Mostly not worth the churn. That triage is built into the output. Scan, fix what matters, move on.
Claude’s /review gives you a wall of bullet points at the same level. You do the triage yourself. Which kinda defeats the point.
One caveat: if you search for “Claude Code Review benchmark,” you’ll find articles claiming less than 1% false positive rates. That’s not the built-in /review. That’s a separate /code-review plugin with a completely different architecture.
Claude’s other reviewer (the one nobody uses)
Claude Code ships an official /code-review plugin that’s far more sophisticated than /review. It’s an 8-step pipeline:
- Eligibility check (skip drafts, closed PRs, automated PRs)
- Gather
CLAUDE.mdfiles from affected directories - Summarize the PR
- Launch 5 parallel Sonnet agents, each reviewing from a different angle: CLAUDE.md compliance, obvious bugs, git blame history, previous PR comments, and code comment compliance
- Score every finding 0-100 with Haiku agents using a detailed rubric
- Filter to 80+. If nothing qualifies, it stops and posts nothing.
- Re-check PR is still open
- Post a
gh pr commentwith only the surviving findings
If nothing scores 80+, it posts nothing. That’s the reviewer the benchmarks are testing. Not /review.
/review (built-in) |
/code-review (plugin) |
Codex /review |
|
|---|---|---|---|
| Architecture | Single pass | 8-step, 9-15 model calls | Single pass |
| Prompt size | ~200 tokens | ~1500 tokens + 5 agent prompts | ~600 tokens |
| Bug criteria | None | Implicit via 5 specialized angles | 6 explicit gates |
| False positive control | None | 10 examples + 0-100 scoring, 80+ threshold | 6 “do not flag” rules |
| Priority system | None | Confidence score | P0-P3 |
| Output format | “Sections and bullet points” | Strict template with GitHub permalinks | Strict verdict template |
| Posts to GitHub | No | Yes (only if 80+ findings) | No |
| No findings | Generic review anyway | Posts nothing | “patch is correct” + residual risks |
| Cost per review | 1 call | 9-15+ calls | 1 call |
The plugin is powerful but expensive. Codex achieves surprisingly close quality in a single call through better prompt engineering. Claude’s plugin throws compute at the problem.
The trade-off nobody talks about
Codex compiles its review prompt into the Rust binary. You can’t change it without rebuilding from source. There’s a review_model config to swap the model, but the instructions are locked.
Claude’s prompt is a replaceable markdown file. You could paste Codex’s entire prompt into .claude/commands/review.md and get Codex-quality instructions running on Claude’s model. Or write something better tuned for your stack.
Codex chose polish. Claude chose extensibility.
I’m not sure which is right. But I know which one I can fix myself.
The real lesson
~400 tokens of difference separates “generic feedback” from “structured, prioritized, low-noise findings.” The models are the same tier of capability. The instructions make the difference.
This applies to every AI tool you use, not just code review. There’s always a prompt behind it. The quality of that prompt determines whether you get useful output or plausible-sounding noise.
Now you’ve seen what yours actually says.
Maybe it’s time to customize your review instructions.