Evaluation pipelines built on language models have a structural vulnerability that has little to do with any individual model's quality. It concerns what happens when you use one statistical system to judge the outputs of a statistically similar system, and why the results can look credible while being systematically wrong.

LLM-as-judge became standard practice for good reasons. Human evaluation is expensive, slow, and inconsistent at scale. A language model judge can process thousands of outputs in minutes, produce structured assessments, and deliver results that correlate reasonably well with human preference on many tasks. For rapid iteration and A/B testing, it offers genuine advantages over the alternatives.

The problem is not that LLM-as-judge does not work. The problem is the specific class of errors it makes, and the difficulty of detecting them from inside the evaluation pipeline.

The Shared Distribution Problem

Modern language models are trained on overlapping datasets processed through similar pipelines. They share distributional characteristics at a level that goes deeper than surface style preferences. When you use a language model to evaluate another language model's outputs, the judge is not producing an independent assessment. It is filtering the evaluated model's outputs through a lens shaped by partially shared training data.

The practical consequence is that a judge model tends to rate outputs more highly when they match its own distributional preferences: its characteristic sentence structures, its implicit assumptions about what a good answer looks like, its weighting of different kinds of evidence. An output that seems excellent to the judge may seem excellent precisely because it resembles what the judge would have produced, not because it is actually better by any objective measure.

This is distinct from the well-documented individual biases of LLM judges, such as preferences for longer responses or certain tonal registers. Those biases can be partially mitigated through calibration and adversarial examples. The shared distribution problem is structural. It cannot be calibrated away, because it is not a deviation from correct judgement. It is a feature of the evaluation architecture itself.

The shared distribution problem cannot be calibrated away, because it is not a deviation from correct judgement. It is a feature of the evaluation architecture itself.

What This Means for Evaluation Pipelines

An evaluation pipeline that uses LLM-as-judge to measure improvement may be measuring something different from improvement. If Model B scores higher than Model A under a judge model, one explanation is that Model B produces genuinely better outputs. Another explanation is that Model B has become more similar to the judge model, and the evaluation pipeline cannot distinguish between the two.

For capability evaluations, this is a nuisance. For safety evaluations, it is a serious problem. If your judge model and your evaluated model share gaps in their training, the judge will not reliably detect outputs that fall into those gaps. The evaluation gives you confidence that the safety controls are working. That confidence may not be warranted.

This is not a theoretical concern. The cases where LLM-as-judge is most systematically misleading are also the cases where the outputs are most likely to cause harm: situations that are underrepresented in training data, edge cases that require reasoning not well supported by common patterns, and scenarios where the correct answer is counterintuitive or technically complex.

What Better Evaluation Looks Like

The goal is not to abandon LLM-as-judge. It is to use it with an accurate understanding of what it can and cannot tell you, and to supplement it with evaluation approaches that address its structural limitations.

Cross-lineage evaluation, using judge models from substantially different training backgrounds, reduces the shared distribution problem without eliminating it. Adversarial evaluation, specifically designed to surface cases where the LLM judge and human evaluators disagree, provides a calibration signal that continuous use of the same judge cannot. Behavioural testing that does not rely on LLM judgement at all, testing whether outputs produce the expected effects in controlled environments rather than whether they seem good to another model, provides a ground truth that evaluation scores cannot.

More fundamentally, evaluation pipelines need to be evaluated. The judge model should be audited periodically for systematic disagreement with human assessors on specific categories of output. If the audit shows the judge has blind spots, the governance implication is clear: any safety or quality assessments that relied on the judge during that period need to be reviewed.

Most current AI evaluation infrastructure is not built to do this. Building it requires treating evaluation as a first-class governance problem, not an engineering convenience.