Tests AI models on complex terminal-based tasks including shell commands, debugging, system administration, and multi-step CLI workflows.
Why it matters: Measures agentic capability in terminal environments — critical for AI coding assistants that execute commands and manage development workflows.
Top Model
61.7%
Composer 2
Average Score
61.7%
Across 1 model
Models Tested
1
Metric: pass rate
Human Baseline
—
Range: 0%–100%
All models with a reported Terminal-Bench score, ranked by highest pass rate.
Terminal-Bench is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.
Composer 2 currently holds the top score on the Terminal-Bench benchmark. See our full rankings table above for the complete leaderboard with 1 models.
We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.
No. While Terminal-Bench is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.