Best AI Models for Multilingual Tasks

AI models ranked by multilingual performance using MMLU benchmark scores across languages. Find the best LLM for translation and non-English tasks.

Last updated: just now

#1 Model

Gemini 2.5 Pro Preview 05-06

Score: 91

Average Score

70.5

Across all ranked models

Models Ranked

With benchmark data

Weights:MMLU (60%)Arena Elo (40%)

#	Model	Provider	Score	MMLU	Arena Elo
1	Gemini 2.5 Pro Preview 05-06Google	Google	91	--	91
2	Gemini 2.5 ProGoogle	Google	90.7	90.8	90.7
3	Claude Opus 4.5Anthropic	Anthropic	90.2	91.4	88.3
4	o3OpenAI	OpenAI	88	--	88
5	Qwen3.5 397B A17BAlibaba	Alibaba	87.8	--	87.8
6	Grok 3xAI	xAI	87.2	--	87.2
7	Llama 4 MaverickMeta	Meta	86	--	86
8	R1DeepSeek	DeepSeek	85.7	90.8	78.2
9	Qwen3 235B A22B Thinking 2507Alibaba	Alibaba	85.5	--	85.5
10	Gemini 2.5 FlashGoogle	Google	85.2	--	85.2
11	o1OpenAI	OpenAI	85.1	91.8	75
12	Claude 3.7 SonnetAnthropic	Anthropic	84.4	90.2	75.7
13	o3 MiniOpenAI	OpenAI	83.5	86.9	78.5
14	DeepSeek V3DeepSeek	DeepSeek	81	88.5	69.7
15	GPT-4o (2024-11-20)OpenAI	OpenAI	80.2	--	80.2
16	GPT-4.1OpenAI	OpenAI	80	--	80
17	Qwen3 30B A3B Instruct 2507Alibaba	Alibaba	80	--	80
18	DeepSeek V3 0324DeepSeek	DeepSeek	79.3	--	79.3
19	GPT-4oOpenAI	OpenAI	79	88.7	64.3
20	Mistral Medium 3Mistral AI	Mistral AI	78.3	--	78.3
21	Claude 3.5 SonnetAnthropic	Anthropic	78	88.7	61.8
22	Claude Opus 4.6Anthropic	Anthropic	77.7	--	77.7
23	Qwen3 235B A22BAlibaba	Alibaba	77.5	--	77.5
24	Grok 3 Mini BetaxAI	xAI	76.7	--	76.7
25	Gemini 2.0 FlashGoogle	Google	76	76.4	75.3
26	MiniMax M1MiniMax	MiniMax	75.7	--	75.7
27	GPT-4 TurboOpenAI	OpenAI	75.6	86.5	59.2
28	Llama 3.3 70B InstructMeta	Meta	74.6	86.3	57.2
29	Mistral LargeMistral AI	Mistral AI	73.7	84	58.3
30	Qwen3 32BAlibaba	Alibaba	73.5	--	73.5
31	Llama 3.3 Nemotron Super 49B V1.5NVIDIA	NVIDIA	73.2	--	73.2
32	o3 Mini HighOpenAI	OpenAI	73	--	73
33	Claude Sonnet 4.6Anthropic	Anthropic	73	--	73
34	QwQ 32BAlibaba	Alibaba	71.8	--	71.8
35	Gemini 2.0 Flash LiteGoogle	Google	71.7	--	71.7
36	Llama 3.1 70B InstructMeta	Meta	71.5	86	49.7
37	GPT-4o-miniOpenAI	OpenAI	70.7	82	53.7
38	Qwen3 30B A3BAlibaba	Alibaba	70	--	70
39	Claude 3.5 HaikuAnthropic	Anthropic	67.1	80.9	46.3
40	GPT-4o (2024-05-13)OpenAI	OpenAI	67	--	67
41	GPT-4o-mini (2024-07-18)OpenAI	OpenAI	64.7	--	64.7
42	GPT-4o (2024-08-06)OpenAI	OpenAI	64.2	--	64.2
43	Llama 3.1 Nemotron 70B InstructNVIDIA	NVIDIA	64	--	64
44	Qwen2.5 72B InstructAlibaba	Alibaba	62	--	62
45	Mistral Large 2407Mistral AI	Mistral AI	61.5	--	61.5
46	Nova Pro 1.0Amazon	Amazon	60.2	--	60.2
47	Qwen2.5 Coder 32B InstructAlibaba	Alibaba	55.7	--	55.7
48	Command R+ (08-2024)Cohere	Cohere	55.5	--	55.5
49	Nova Lite 1.0Amazon	Amazon	55.5	--	55.5
50	Llama 3 70B InstructMeta	Meta	54	--	54
51	Phi 4Microsoft	Microsoft	53.7	--	53.7
52	Nova Micro 1.0Amazon	Amazon	52.3	--	52.3
53	Claude 3 HaikuAnthropic	Anthropic	50	--	50
54	Llama 3.1 8B InstructMeta	Meta	48.8	--	48.8
55	Llama 3 8B InstructMeta	Meta	45.2	--	45.2
56	Llama 3.2 3B InstructMeta	Meta	36.2	--	36.2
57	Llama 3.2 1B InstructMeta	Meta	27.8	--	27.8
58	Mistral 7B Instruct v0.1Mistral AI	Mistral AI	21.8	--	21.8

How scores are calculated

Each model's score is a weighted average of its available benchmark results. When a model is missing some benchmarks, the weights are re-normalized across the benchmarks that are available. All scores are on a 0-100 scale. Data sourced from official model cards, published papers, and third-party evaluation platforms.

Other Specialty Leaderboards

Best for Coding Best for Math Best for Reasoning Best for Writing Best for Instructions Best for Data Analysis Best for Roleplay

Frequently Asked Questions

Based on our benchmark analysis, Gemini 2.5 Pro Preview 05-06 by Google is currently the #1 ranked model for multilingual, with a weighted score of 91/100.

Models are ranked using a weighted average of MMLU, Arena Elo benchmark scores. All scores are normalized to a 0-100 scale.

We currently rank 58 models that have relevant benchmark data for multilingual tasks.