“A trend in agent evaluation is to build an eval loop directly into the agent's workflow, allowing the agent to grade its own outputs.”