Overall
Capability: overall
Overview
Represents the average performance across all benchmark dimensions. Each capability contributes equally to the overall score regardless of the number of queries per capability.
Evaluation Method
The overall score is the arithmetic mean of all capability scores. Each capability receives equal weight in the final score, independent of query count.
Scoring
Each capability is scored separately using automated graders appropriate to that dimension (e.g., exact match, numeric match, code execution, LLM judges). The overall score averages these capability scores with equal weighting per capability.
Model Scores
- google_gemini-3-pro-preview_reasoning_high: 0.726
- anthropic_claude-sonnet-4.5_reasoning_high: 0.683
- google_gemini-3-flash-preview_reasoning_high: 0.670
- openai_gpt-5.2_reasoning_high: 0.661
- google_gemini-3-pro-preview_reasoning_low: 0.652
- x-ai_grok-4.1-fast_reasoning_high: 0.649
- anthropic_claude-sonnet-4.5_reasoning_low: 0.645
- google_gemini-3-flash-preview_reasoning_low: 0.635
- x-ai_grok-4.1-fast_reasoning_low: 0.619
- openai_gpt-5.2_reasoning_low: 0.614
- openai_gpt-5-mini_reasoning_high: 0.607
- anthropic_claude-sonnet-4.5_reasoning_none: 0.600
- openai_gpt-5-nano_reasoning_high: 0.592
- openai_gpt-5-nano_reasoning_low: 0.560
- openai_gpt-5-mini_reasoning_low: 0.552
- deepseek_deepseek-v3.2-speciale: 0.550
- openai_gpt-5-nano: 0.543
- z-ai_glm-4.7: 0.526
- kwaipilot_kat-coder-pro_free: 0.498
- anthropic_claude-haiku-4.5: 0.491
- x-ai_grok-4.1-fast_reasoning_none: 0.466
- qwen_qwen3-235b-a22b-2507: 0.464
- deepseek_deepseek-v3.2-exp: 0.446
- mistralai_devstral-2512_free: 0.437
- google_gemini-2.0-flash-001: 0.434
- openai_gpt-4o: 0.430
- qwen_qwen3-32b: 0.427
- qwen_qwen3-8b: 0.398
- anthropic_claude-3.5-haiku: 0.374
- meta-llama_llama-3.3-70b-instruct: 0.354
- openai_gpt-4o-mini: 0.332
- meta-llama_llama-3-70b-instruct: 0.314
- mistralai_ministral-8b: 0.283
- meta-llama_llama-3-8b-instruct: 0.270
- google_gemma-2-9b-it: 0.262
- mistralai_mistral-7b-instruct-v0.1: 0.245
