“On the multilingual version of the SWE-bench benchmark, AI agent performance drops significantly to the 20-30% success range.”