If you have ever benchmarked LLMs, you have probably noticed something unsettling: different evaluation methods give wildly different results. A model that ranks first on one benchmark ranks fourth on another. A response that one evaluator calls excellent, another calls mediocre.
We decided to investigate this systematically — specifically in the context of our trading system, where we evaluate model responses to quantitative finance prompts. The results challenge some basic assumptions about how we measure AI performance.
The Experiment
We designed a controlled evaluation study with the following parameters:
- 8 models evaluated: Claude Opus, GPT-4o, Gemini 2.5 Pro, and five other models in the frontier range
- 15 domain-specific prompts: All focused on quantitative finance tasks — risk analysis, trade reasoning, market commentary, strategy evaluation, backtesting critique, and portfolio management
- 3 LLM judges: Claude, GPT, and Gemini each evaluated all responses from all 8 models
- 7 metrics per response: Clarity, accuracy, risk management quality, actionability, completeness, reasoning depth, and domain expertise
- Total: 7,560 individual metric judgments (8 models × 15 prompts × 3 judges × ~7 metrics, with some structural variation)
Each judge received identical instructions, identical evaluation rubrics, and identical model responses with model identities masked (the judge did not know which model produced each response). The only variable was which LLM was doing the judging.
The Clarity Paradox
The most striking finding involved the "clarity" metric — arguably the most commonly used dimension in LLM evaluation. When Gemini evaluated Claude Opus's responses to our finance prompts, it rated clarity at 90.5%. When GPT evaluated the exact same Claude Opus responses using the exact same rubric, it rated clarity at 45.7%.
That is a 44.8 percentage point gap on identical responses, using an identical rubric, measuring the same metric.
This is not a minor calibration difference. This is a fundamental disagreement about what "clarity" means in practice, despite both judges apparently agreeing on the definition when asked.
After analyzing the scoring patterns, we identified the underlying cause: the two judges were applying different implicit standards for what constitutes clarity in financial writing.
Gemini's implicit standard: Clarity means well-structured, organized responses. If a response uses headers, bullet points, and a logical flow from premise to conclusion, it scores high on clarity. Gemini valued structural organization as a proxy for clarity.
GPT's implicit standard: Clarity means directness and concision. If a response gets immediately to the point without preamble, uses precise technical language without over-explanation, and avoids hedging, it scores high on clarity. GPT valued brevity and directness.
Claude Opus's responses tended to be well-organized but comprehensive — high on Gemini's structure criterion, lower on GPT's brevity criterion. Neither judge was wrong. Both were evaluating "clarity" legitimately. But the 45-point gap shows they were measuring different things.
Why Some Metrics Are Reliable and Others Are Not
We computed Krippendorff's alpha — a statistical measure of inter-rater reliability that ranges from 0 (random agreement) to 1 (perfect agreement) — for each of our seven metrics:
| Metric | Krippendorff's Alpha | Reliability Level |
|---|---|---|
| Risk Management quality | 0.698 | Substantial agreement |
| Domain Expertise | 0.621 | Substantial agreement |
| Accuracy | 0.587 | Moderate agreement |
| Completeness | 0.544 | Moderate agreement |
| Reasoning Depth | 0.478 | Moderate agreement |
| Actionability | 0.401 | Fair agreement |
| Clarity | 0.309 | Minimal agreement |
The pattern is revealing. Metrics grounded in verifiable properties achieve reasonable inter-judge agreement. Metrics that are inherently subjective produce unreliable evaluations.
Risk Management quality (alpha 0.698) had the highest agreement because it is partly falsifiable: either the response correctly identifies the primary risk factors or it does not. A response that recommends a 10x leveraged position without discussing stop-losses will score low on risk management quality by any reasonable judge. A response that discusses position sizing, stop-loss placement, correlation risk, and drawdown scenarios will score high. The judges converge because the evaluation criteria can be anchored to concrete, observable features.
Clarity (alpha 0.309) had the lowest agreement because "clear" genuinely means different things to different readers. There is no objective fact of the matter about whether a 400-word structured explanation is clearer than a 100-word direct answer. The judges diverged not because they applied the rubric incorrectly, but because the rubric itself permits multiple valid interpretations.
Does Self-Favorability Exist?
A common concern with LLM-as-judge evaluation is that models might rate their own outputs more favorably — a form of narcissistic bias. We tested this directly.
The self-favorability effect exists but is small. Models rated their own outputs approximately 3–5 percentage points higher than other judges rated them, on average. However, this effect was not statistically significant (p = 0.858 using a paired comparison test on per-response scores).
The critical observation is that the between-judge variance on subjective metrics (44.8 percentage points for clarity) dwarfed the self-favorability effect (3–5 percentage points). In other words, the bigger problem is not that judges are biased toward themselves — it is that judges fundamentally disagree about what good looks like, regardless of whose output is being evaluated.
Self-favorability is real but small. Judge disagreement is real and large.
Frontier Models Are Closer Than Benchmarks Suggest
One of the most practically important findings: when we restricted analysis to only the metrics with Krippendorff's alpha above 0.5 — the metrics where all three judges actually agreed — the performance gap between Claude Opus, GPT-4o, and Gemini 2.5 Pro narrowed substantially.
On reliable metrics (risk management, domain expertise, accuracy, completeness), all three frontier models performed within approximately 8–12 percentage points of each other, with no model consistently dominating.
On the unreliable metrics (clarity, actionability), the apparent gaps were much larger — but these gaps reflected judge preferences more than model capability. Claude Opus appeared to significantly outperform on clarity when Gemini was judging, and appeared significantly worse when GPT was judging. The "performance difference" was an artifact of the evaluation, not the models.
Published benchmarks that aggregate across reliable and unreliable metrics together produce rankings that are partially real (the reliable metrics reflect genuine capability differences) and partially illusory (the unreliable metrics add noise that can flip rankings depending on which judge was used).
The Anchor Validation
To verify that our three-judge findings were not artifacts of the specific judges we chose, we brought in a fourth independent judge: Mistral. Mistral had not participated in the original evaluation.
Mistral's inter-judge agreement scores correlated strongly with the alpha values from our original study: high agreement on risk management and domain expertise, lower agreement on clarity and actionability. This confirmed that the reliability differences are properties of the metrics themselves, not of our specific choice of three judges.
The finding generalizes: clarity is genuinely harder for any judge to evaluate consistently, regardless of which judge you use.
Practical Implications for Our Trading System
This research changed how we evaluate our trading agents. Specifically:
We only use multi-judge evaluation for any metric that matters. When evaluating whether Hugo's trade reasoning is sound, we use at least two judges and report both scores. If they disagree significantly, we investigate rather than averaging.
We weight metrics by their reliability. A finding that "three judges agree the response lacks risk awareness" is treated as robust. A finding that "one judge rated the response low on clarity" is treated as a weak signal.
We prefer objective metrics where possible. "Did the response correctly identify the primary risk factor?" is a better evaluation criterion than "Was the response clear?" because it has a more definite answer.
We use risk management quality as our primary trading agent evaluation metric precisely because it has the highest inter-judge agreement. For our purposes, a trading agent that correctly identifies risks is more valuable than one that explains itself in a stylistically preferred manner.
Takeaways
- LLM-as-judge evaluation produces reliable results for some metrics (risk management: alpha = 0.698) and unreliable results for others (clarity: alpha = 0.309)
- The same model responses can receive scores ranging from 45.7% to 90.5% depending solely on which LLM is the judge
- Self-favorability exists but is small (3–5 pp) and not statistically significant (p = 0.858)
- The clarity gap of 44.8 percentage points reflects fundamentally different conceptions of what "clear writing" means, not measurement error
- Frontier models are closer in capability than most benchmarks suggest — the gaps shrink when you isolate reliable metrics
- Always use multiple judges and report inter-judge agreement alongside scores — a single judge's evaluation is as much about the judge as the model
- Benchmarks that aggregate reliable and unreliable metrics produce rankings that can be reversed by simply changing the judge