“An 'LLM as a judge' is most effective when scoped to evaluate a single, narrow failure mode with a binary pass/fail output.”