This changes how you assess AI vendors claiming 'automated evaluation' capabilities.
When a vendor claims 95% evaluation accuracy, ask: 'What happens when your judge model isn't smarter than the production model?' This paper proves there's a mathematical ceiling they're not telling you about.
When you use an LLM to judge another LLM's outputs, you can never reduce your need for human-labeled ground truth by more than 50% — mathematically proven.
Most vendors selling 'automated evaluation' don't mention this ceiling. They show accuracy metrics on benchmarks where the judge is stronger than the evaluated model — a scenario that disappears as models improve.