“Anthropic uses an internal evaluation benchmark called "Vending Bench" where an AI model is tasked with running a virtual vending machine business.”