Extension of SWE-bench to multiple programming languages beyond Python, testing real-world bug fixing across TypeScript, Java, Go, Rust and more.
Why it matters: Most real codebases are polyglot. This benchmark tests whether coding models can handle the diversity of languages seen in production software engineering.
Top Model
73.7%
Composer 2
Average Score
73.7%
Across 1 model
Models Tested
1
Metric: resolved rate
Human Baseline
—
Range: 0%–100%
All models with a reported SWE-bench ML score, ranked by highest resolved rate.
SWE-bench ML is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.
Composer 2 currently holds the top score on the SWE-bench ML benchmark. See our full rankings table above for the complete leaderboard with 1 models.
We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.
No. While SWE-bench ML is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.