Task Completion Benchmarks

Compare how different language models perform across different categories of AI tasks. Performance scores are based on our comprehensive TaskBench evaluation suite and measure the percentage of tasks completed successfully (0.0 to 1.0 scale).

Results

Rank
Model
Context
SQL
Agents
Normalization
Average
#1grok-4
90.0%
96.7%
93.8%
90.7%
92.8%
#2claude-opus-4.1
98.3%
93.3%
85.4%
83.3%
90.1%
#3claude-sonnet-4
98.3%
93.3%
81.3%
85.2%
89.5%
#4o3
88.3%
94.2%
93.8%
79.6%
89.0%
#5gemini-2.5-pro
96.7%
92.5%
83.3%
83.3%
89.0%
#6gpt-4.1
88.3%
97.5%
87.5%
79.6%
88.2%
#7gpt-5
86.7%
95.0%
91.7%
77.8%
87.8%
#8claude-opus-4
95.0%
94.2%
79.2%
81.5%
87.5%
#9gemini-2.5-flash
88.3%
95.0%
81.3%
83.3%
87.0%
#10o1
91.7%
96.7%
79.2%
79.6%
86.8%
#11gpt-5-mini
93.3%
96.7%
85.4%
70.4%
86.4%
#12grok-3
88.3%
96.7%
72.9%
87.0%
86.2%
#13claude-3.5-sonnet
88.3%
90.8%
89.6%
75.9%
86.2%
#14claude-3.7-sonnet
86.7%
94.2%
79.2%
83.3%
85.8%
#15o3-mini
85.0%
95.8%
72.9%
85.2%
84.7%
#16o4-mini
86.7%
93.3%
79.2%
79.6%
84.7%
#17gpt-oss-120b
81.7%
91.7%
81.3%
81.5%
84.0%
#18moonshotai/kimi-k2-instruct
85.0%
92.5%
72.9%
81.5%
83.0%
#19deepseek-r1
78.3%
89.2%
83.3%
79.6%
82.6%
#20gpt-4.1-mini
71.7%
97.5%
83.3%
77.8%
82.6%
#21gpt-oss-20b
81.7%
93.3%
70.8%
72.2%
79.5%
#22o1-mini
76.7%
92.5%
66.7%
81.5%
79.3%
#23gpt-5-nano
81.7%
93.3%
72.9%
66.7%
78.6%
#24qwen3-coder-480b-a35b-instruct
71.7%
90.0%
72.9%
N/A
78.2%
#25gemini-2.0-flash
76.7%
87.5%
58.3%
85.2%
76.9%
#26gpt-4o
63.3%
92.5%
64.6%
83.3%
75.9%
#27claude-3.5-haiku
76.7%
86.7%
62.5%
68.5%
73.6%
#28deepseek-v3
63.3%
89.2%
64.6%
72.2%
72.3%
#29gemini-2.0-flash-lite
70.0%
86.7%
52.1%
79.6%
72.1%
#30pixtral-large-latest-eu
60.0%
89.2%
62.5%
74.1%
71.4%
#31mistral-large-eu
50.0%
87.5%
66.7%
79.6%
70.9%
#32gemini-2.5-flash-lite
75.0%
87.5%
43.8%
N/A
68.8%
#33gpt-4o-mini
53.3%
85.0%
60.4%
72.2%
67.7%
#34gpt-4.1-nano
48.3%
88.3%
52.1%
66.7%
63.9%
#35mistral-small-eu
36.7%
76.7%
33.3%
51.9%
49.6%
#36mistral-tiny-eu
30.0%
64.2%
25.0%
44.4%
40.9%

About TaskBench

TaskBench is our comprehensive evaluation suite that tests language models across various real-world AI tasks. Unlike academic benchmarks, TaskBench focuses on practical tasks that reflect real-world usage patterns. Each task sample represents one API call with structured input/output, mirroring actual user goals.

All models are evaluated using their default settings with consistent prompting strategies. Scores represent the percentage of tasks completed successfully (0.0 to 1.0 scale). For more detailed methodology and examples, see ourTaskBench introduction.

Task Categories

Context

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

SQL

Natural language to SQL query generation evaluates text-to-query fidelity and schema reasoning. This task is particularly relevant for analytics chat assistants and simplified database interfaces where users need to query data using natural language. Models must understand both the intent behind the question and the structure of the underlying database schema.

Agents

AI agent reasoning and tool selection tests planning, tool choice, and self-diagnosis capabilities. These skills are crucial for autonomous ticket triage systems and complex agent workflows where models must decide which tools to use, plan multi-step processes, and diagnose when things go wrong. This represents some of the most challenging tasks as they require open-ended reasoning and decision-making.

Normalization

Data processing and normalization tasks evaluate structured output from messy prose and different structures. This capability is essential for catalogue and product pipelines where data needs to be extracted from unstructured text and formatted consistently. Models must understand the desired output format and extract relevant information accurately.

Evaluation Methodology

String-Level Evaluators

We use fast, deterministic methods for text-based outputs including exact match, ROUGE/BLEU metrics, regex heuristics, and static analysis. These evaluators are particularly effective for tasks with well-defined expected outputs where we can compare the model response against a known correct answer.

Execution-Level Evaluators

For tasks where the output needs to be executed, we test functionality rather than just syntax. For example, with SQL generation tasks, we run the generated queries against mock databases and compare the results. This ensures that the model output not only looks correct but actually produces the intended results when executed.

LLM-Based Evaluators

We use LLMs to judge other LLMs on semantic accuracy, factual correctness, style, and safety. This approach is particularly useful for open-ended or context-heavy tasks where there is no single correct answer. LLM-based evaluation allows us to assess the quality of responses that require nuanced understanding and reasoning.