Task Completion Benchmarks

Compare how different language models perform across different categories of AI tasks. Performance scores are based on our comprehensive TaskBench evaluation suite and measure the percentage of tasks completed successfully.

Results

Rank
Model
Context
SQL
Agents
Normalization
Average
Cost
#1grok-4
88.3%
95.8%
91.7%
85.2%
90.3%
$2.162
#2claude-sonnet-4
96.7%
90.0%
89.6%
83.3%
89.9%
$2.773
#3o3
93.3%
93.3%
91.7%
79.6%
89.5%
$1.741
#4claude-opus-4.1
91.7%
95.0%
87.5%
81.5%
88.9%
$13.731
#5glm-4.5
90.0%
95.0%
83.3%
83.3%
87.9%
$0.532
#6gpt-5-mini
96.7%
95.0%
83.3%
75.9%
87.7%
$0.263
#7claude-opus-4
93.3%
94.2%
83.3%
79.6%
87.6%
$13.740
#8gpt-5
88.3%
95.0%
87.5%
79.6%
87.6%
$2.061
#9o1
91.7%
96.7%
75.0%
85.2%
87.1%
$16.504
#10claude-3.5-sonnet
90.0%
91.7%
85.4%
79.6%
86.7%
$2.649
#11grok-3
86.7%
91.7%
81.3%
87.0%
86.7%
$2.252
#12claude-3.7-sonnet
86.7%
94.2%
83.3%
81.5%
86.4%
$2.780
#13gemini-2.5-flash
93.3%
93.3%
77.1%
81.5%
86.3%
$0.439
#14o4-mini
88.3%
94.2%
87.5%
74.1%
86.0%
$1.053
#15gpt-oss-120b
88.3%
94.2%
85.4%
75.9%
86.0%
$0.139
#16gemini-2.5-pro
93.3%
91.7%
75.0%
81.5%
85.4%
$1.927
#17gpt-4.1
83.3%
96.7%
83.3%
74.1%
84.4%
$1.492
#18mistral-medium-2508-eu
86.7%
92.5%
81.3%
75.9%
84.1%
$0.393
#19deepseek-r1
81.7%
90.8%
75.0%
83.3%
82.7%
$0.525
#20glm-4.5-air
78.3%
92.5%
77.1%
81.5%
82.3%
$0.230
#21qwen-3-32bN/A
89.2%
75.0%
81.5%
81.9%
$0.164
#22moonshotai/kimi-k2-instruct
81.7%
94.2%
66.7%
83.3%
81.5%
$0.673
#23deepseek-v3.1
76.7%
86.7%
75.0%
81.5%
80.0%
$0.826
#24mistral-large-eu
78.3%
94.2%
75.0%
72.2%
79.9%
$1.793
#25gpt-oss-20b
76.7%
94.2%
72.9%
74.1%
79.5%
$0.095
#26o1-mini
78.3%
89.2%
68.8%
81.5%
79.4%
$1.197
#27gpt-5-nano
85.0%
93.3%
64.6%
74.1%
79.2%
$0.127
#28gpt-4.1-mini
70.0%
95.0%
79.2%
72.2%
79.1%
$0.295
#29o3-mini
80.0%
92.5%
62.5%
79.6%
78.7%
$1.078
#30gemini-2.0-flash
76.7%
87.5%
64.6%
85.2%
78.5%
$0.082
#31qwen3-coder-480b-a35b-instruct
78.3%
89.2%
66.7%
77.8%
78.0%
$0.364
#32qwen-3-235b-a22b-instruct-2507
70.0%
91.7%
70.8%
77.8%
77.6%
$0.147
#33gpt-4o
70.0%
92.5%
66.7%
79.6%
77.2%
$1.836
#34pixtral-large-latest-eu
71.7%
88.3%
64.6%
77.8%
75.6%
$1.730
#35gemini-2.5-flash-lite
76.7%
87.5%
54.2%
77.8%
74.0%
$0.086
#36claude-3.5-haiku
78.3%
88.3%
54.2%
70.4%
72.8%
$0.689
#37deepseek-v3
63.3%
85.0%
60.4%
81.5%
72.6%
$0.617
#38magistral-medium-2506-eu
48.3%
84.2%
72.9%
74.1%
69.9%
$0.560
#39llama-4-maverick-17b-128e-instruct
50.0%
90.8%
60.4%
77.8%
69.8%
$0.047
#40gemini-2.0-flash-lite
66.7%
80.8%
52.1%
77.8%
69.3%
$0.061
#41gpt-4o-mini
53.3%
85.0%
60.4%
68.5%
66.8%
$0.106
#42gpt-4.1-nano
56.7%
86.7%
54.2%
55.6%
63.3%
$0.072
#43mistral-small-eu
31.7%
70.8%
41.7%
59.3%
50.9%
$0.028
#44mistral-tiny-eu
11.7%
65.0%
16.7%
27.8%
30.3%
$0.075

About TaskBench

TaskBench is our comprehensive evaluation suite that tests language models across various real-world AI tasks. Unlike academic benchmarks, TaskBench focuses on practical tasks that reflect real-world usage patterns. Each task sample represents one API call with structured input/output, mirroring actual user goals.

All models are evaluated using their default settings with consistent prompting strategies. Scores represent the percentage of tasks completed successfully (0.0 to 1.0 scale). For more detailed methodology and examples, see ourTaskBench introduction.

Task Categories

Context

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

SQL

Natural language to SQL query generation evaluates text-to-query fidelity and schema reasoning. This task is particularly relevant for analytics chat assistants and simplified database interfaces where users need to query data using natural language. Models must understand both the intent behind the question and the structure of the underlying database schema.

Agents

AI agent reasoning and tool selection tests planning, tool choice, and self-diagnosis capabilities. These skills are crucial for autonomous ticket triage systems and complex agent workflows where models must decide which tools to use, plan multi-step processes, and diagnose when things go wrong. This represents some of the most challenging tasks as they require open-ended reasoning and decision-making.

Normalization

Data processing and normalization tasks evaluate structured output from messy prose and different structures. This capability is essential for catalogue and product pipelines where data needs to be extracted from unstructured text and formatted consistently. Models must understand the desired output format and extract relevant information accurately.

Swedish Language Understanding

Multilingual comprehension and reasoning in Swedish, covering fact checking, summarization, inference, literary and legal analysis, and dialectal understanding. This task evaluates how well models follow Swedish instructions and produce accurate, well-formed answers in Swedish.

Evaluation Methodology

All model completions are measured against ground truth - we compare each model's output to known correct answers to ensure accurate and reliable performance assessment.

String-Level Evaluators

We use fast, deterministic methods for text-based outputs including exact match, ROUGE/BLEU metrics, regex heuristics, and static analysis. These evaluators are particularly effective for tasks with well-defined expected outputs where we can compare the model response against a known correct answer.

Execution-Level Evaluators

For tasks where the output needs to be executed, we test functionality rather than just syntax. For example, with SQL generation tasks, we run the generated queries against mock databases and compare the results. This ensures that the model output not only looks correct but actually produces the intended results when executed.

LLM-Based Evaluators

We use LLMs to judge other LLMs on semantic accuracy, factual correctness, style, and safety. This approach is particularly useful for open-ended or context-heavy tasks where there is no single correct answer. LLM-based evaluation allows us to assess the quality of responses that require nuanced understanding and reasoning.

Task completion API for AI

Opper Get started