Task Completion Benchmarks

Compare how different language models perform across different categories of AI tasks. Performance scores are based on our comprehensive TaskBench evaluation suite and measure the percentage of tasks completed successfully.

Results

Rank	Model	Context	SQL	Agents	Normalization	Average	Cost
#1	grok-4-fast-reasoning	95.0%	94.2%	93.8%	85.2%	92.0%	$0.137
#2	gemini-flash-latest	93.3%	95.8%	87.5%	88.9%	91.4%	$0.478
#3	grok-4	88.3%	95.8%	91.7%	85.2%	90.3%	$2.162
#4	claude-sonnet-4	96.7%	90.0%	89.6%	83.3%	89.9%	$2.773
#5	o3	93.3%	93.3%	91.7%	79.6%	89.5%	$1.741
#6	claude-opus-4.1	91.7%	95.0%	87.5%	81.5%	88.9%	$13.731
#7	claude-sonnet-4.5	98.3%	95.0%	85.4%	75.9%	88.7%	$2.877
#8	glm-4.5	90.0%	95.0%	83.3%	83.3%	87.9%	$0.532
#9	gpt-5-mini	96.7%	95.0%	83.3%	75.9%	87.7%	$0.263
#10	claude-opus-4	93.3%	94.2%	83.3%	79.6%	87.6%	$13.740
#11	gpt-5	88.3%	95.0%	87.5%	79.6%	87.6%	$2.061
#12	o1	91.7%	96.7%	75.0%	85.2%	87.1%	$16.504
#13	claude-3.5-sonnet	90.0%	91.7%	85.4%	79.6%	86.7%	$2.649
#14	grok-3	86.7%	91.7%	81.3%	87.0%	86.7%	$2.252
#15	claude-3.7-sonnet	86.7%	94.2%	83.3%	81.5%	86.4%	$2.780
#16	gemini-2.5-flash	93.3%	93.3%	77.1%	81.5%	86.3%	$0.439
#17	o4-mini	88.3%	94.2%	87.5%	74.1%	86.0%	$1.053
#18	gpt-oss-120b	88.3%	94.2%	85.4%	75.9%	86.0%	$0.139
#19	gemini-2.5-pro	93.3%	91.7%	75.0%	81.5%	85.4%	$1.927
#20	gpt-4.1	83.3%	96.7%	83.3%	74.1%	84.4%	$1.492
#21	mistral-medium-2508-eu	86.7%	92.5%	81.3%	75.9%	84.1%	$0.393
#22	deepseek-r1	81.7%	90.8%	75.0%	83.3%	82.7%	$0.525
#23	gemini-flash-lite-latest	81.7%	87.5%	81.3%	79.6%	82.5%	$0.092
#24	glm-4.5-air	78.3%	92.5%	77.1%	81.5%	82.3%	$0.230
#25	grok-4-fast-non-reasoning	83.3%	85.8%	81.3%	77.8%	82.0%	$0.136
#26	moonshotai/kimi-k2-instruct	81.7%	94.2%	66.7%	83.3%	81.5%	$0.673
#27	deepseek-v3.1	76.7%	86.7%	75.0%	81.5%	80.0%	$0.826
#28	mistral-large-eu	78.3%	94.2%	75.0%	72.2%	79.9%	$1.793
#29	gpt-oss-20b	76.7%	94.2%	72.9%	74.1%	79.5%	$0.095
#30	o1-mini	78.3%	89.2%	68.8%	81.5%	79.4%	$1.197
#31	gpt-5-nano	85.0%	93.3%	64.6%	74.1%	79.2%	$0.127
#32	gpt-4.1-mini	70.0%	95.0%	79.2%	72.2%	79.1%	$0.295
#33	o3-mini	80.0%	92.5%	62.5%	79.6%	78.7%	$1.078
#34	qwen-3-32b	68.3%	89.2%	75.0%	81.5%	78.5%	$0.169
#35	gemini-2.0-flash	76.7%	87.5%	64.6%	85.2%	78.5%	$0.082
#36	qwen3-coder-480b-a35b-instruct	78.3%	89.2%	66.7%	77.8%	78.0%	$0.364
#37	qwen-3-235b-a22b-instruct-2507	70.0%	91.7%	70.8%	77.8%	77.6%	$0.147
#38	gpt-4o	70.0%	92.5%	66.7%	79.6%	77.2%	$1.836
#39	pixtral-large-latest-eu	71.7%	88.3%	64.6%	77.8%	75.6%	$1.730
#40	gemini-2.5-flash-lite	76.7%	87.5%	54.2%	77.8%	74.0%	$0.086
#41	claude-3.5-haiku	78.3%	88.3%	54.2%	70.4%	72.8%	$0.689
#42	deepseek-v3	63.3%	85.0%	60.4%	81.5%	72.6%	$0.617
#43	magistral-medium-2506-eu	48.3%	84.2%	72.9%	74.1%	69.9%	$0.560
#44	llama-4-maverick-17b-128e-instruct	50.0%	90.8%	60.4%	77.8%	69.8%	$0.047
#45	gemini-2.0-flash-lite	66.7%	80.8%	52.1%	77.8%	69.3%	$0.061
#46	gpt-4o-mini	53.3%	85.0%	60.4%	68.5%	66.8%	$0.106
#47	gpt-4.1-nano	56.7%	86.7%	54.2%	55.6%	63.3%	$0.072
#48	mistral-small-eu	31.7%	70.8%	41.7%	59.3%	50.9%	$0.028

About TaskBench

TaskBench is our comprehensive evaluation suite that tests language models across various real-world AI tasks. Unlike academic benchmarks, TaskBench focuses on practical tasks that reflect real-world usage patterns. Each task sample represents one API call with structured input/output, mirroring actual user goals.

All models are evaluated using their default settings with consistent prompting strategies. Scores represent the percentage of tasks completed successfully (0.0 to 1.0 scale). For more detailed methodology and examples, see ourTaskBench introduction.

Task Categories

Context

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

SQL

Natural language to SQL query generation evaluates text-to-query fidelity and schema reasoning. This task is particularly relevant for analytics chat assistants and simplified database interfaces where users need to query data using natural language. Models must understand both the intent behind the question and the structure of the underlying database schema.

Agents

AI agent reasoning and tool selection tests planning, tool choice, and self-diagnosis capabilities. These skills are crucial for autonomous ticket triage systems and complex agent workflows where models must decide which tools to use, plan multi-step processes, and diagnose when things go wrong. This represents some of the most challenging tasks as they require open-ended reasoning and decision-making.

Normalization

Data processing and normalization tasks evaluate structured output from messy prose and different structures. This capability is essential for catalogue and product pipelines where data needs to be extracted from unstructured text and formatted consistently. Models must understand the desired output format and extract relevant information accurately.

Swedish Language Understanding

Multilingual comprehension and reasoning in Swedish, covering fact checking, summarization, inference, literary and legal analysis, and dialectal understanding. This task evaluates how well models follow Swedish instructions and produce accurate, well-formed answers in Swedish.

Evaluation Methodology

All model completions are measured against ground truth - we compare each model's output to known correct answers to ensure accurate and reliable performance assessment.

String-Level Evaluators

We use fast, deterministic methods for text-based outputs including exact match, ROUGE/BLEU metrics, regex heuristics, and static analysis. These evaluators are particularly effective for tasks with well-defined expected outputs where we can compare the model response against a known correct answer.

Execution-Level Evaluators

For tasks where the output needs to be executed, we test functionality rather than just syntax. For example, with SQL generation tasks, we run the generated queries against mock databases and compare the results. This ensures that the model output not only looks correct but actually produces the intended results when executed.

LLM-Based Evaluators

We use LLMs to judge other LLMs on semantic accuracy, factual correctness, style, and safety. This approach is particularly useful for open-ended or context-heavy tasks where there is no single correct answer. LLM-based evaluation allows us to assess the quality of responses that require nuanced understanding and reasoning.