The SWE-bench, a popular benchmark for coding models, is limited in scope as it primarily evaluat..., Sonic AI
“The SWE-bench, a popular benchmark for coding models, is limited in scope as it primarily evaluates performance on bug fixing and unit tests in Python, not the broader set of tasks involved in software engineering.”