Current model leaderboards like SweeBench are insufficient for evaluating real-world performance ..., Sonic AI
“Current model leaderboards like SweeBench are insufficient for evaluating real-world performance due to test set leakage and reliance on subjective human preferences.”