Last updated: just now

Math benchmarkSaturated

Grade School Math 8K

8,500 multi-step grade-school math word problems. Tests basic arithmetic reasoning — adding, subtracting, multiplying with multi-step chains.

Why it matters: Useful baseline for math ability, but now saturated — top models exceed 95%. More useful for evaluating smaller or open-source models.

Top Model

96.8%

Llama 3.1 405B

Average Score

93.9%

Across 14 models

Models Tested

Metric: accuracy

Human Baseline

—

Range: 0%–100%

Model Rankings

All models with a reported GSM8K score, ranked by highest accuracy.

RankModelScorePerformance

Llama 3.1 405BMeta

96.8%

97%

96.8%

DeepSeek V3DeepSeek

96.7%

97%

96.7%

Claude 3.5 SonnetAnthropic

96.4%

96%

96.4%

GPT-4oOpenAI

95.8%

96%

95.8%

Qwen 2.5 72BAlibaba

95.8%

96%

95.8%

Llama 3.1 70BMeta

95.1%

95%

95.1%

Claude 3 OpusAnthropic

95%

Grok 2xAI

94.6%

95%

94.6%

GPT-4 TurboOpenAI

94.2%

94%

94.2%

#10

Gemini 2.0 FlashGoogle

93.8%

94%

93.8%

#11

GPT-4o miniOpenAI

93.2%

93%

93.2%

#12

Mistral Large 2Mistral AI

93.1%

93%

93.1%

#13

Mixtral 8x22BMistral AI

88.4%

88%

88.4%

#14

Command R+Cohere

86.1%

86%

86.1%

About GSM8K

Full Name: Grade School Math 8K
Category: Math
Metric: accuracy (%)
Score Range: 0%–100%
Human Baseline: Not established
Status: Saturated

Frequently Asked Questions

GSM8K is a standardized evaluation that measures AI model performance on specific tasks. It provides comparable scores across different models, helping developers choose the right model for their needs.

Llama 3.1 405B currently holds the top score on the GSM8K benchmark. See our full rankings table above for the complete leaderboard with 14 models.

We update benchmark data from multiple sources including HuggingFace Open LLM Leaderboard and LMArena. Scores are refreshed regularly as new evaluations are published and new models are released.

No. While GSM8K is an important indicator, real-world performance depends on many factors including pricing, latency, context window, and specific task requirements. We recommend using our composite score which weighs multiple benchmarks and practical factors.

All Benchmarks|Math Benchmarks|Compare Models