Task Completion Benchmarks
Compare how different language models perform across different categories of AI tasks. Performance scores are based on our comprehensive TaskBench evaluation suite and measure the percentage of tasks completed successfully (0.0 to 1.0 scale).
Results
Rank | Model | Context | SQL | Agents | Normalization | Average |
---|---|---|---|---|---|---|
#1 | grok-4 | 90.0% | 96.7% | 93.8% | 90.7% | 92.8% |
#2 | claude-opus-4.1 | 98.3% | 93.3% | 85.4% | 83.3% | 90.1% |
#3 | claude-sonnet-4 | 98.3% | 93.3% | 81.3% | 85.2% | 89.5% |
#4 | o3 | 88.3% | 94.2% | 93.8% | 79.6% | 89.0% |
#5 | gemini-2.5-pro | 96.7% | 92.5% | 83.3% | 83.3% | 89.0% |
#6 | gpt-4.1 | 88.3% | 97.5% | 87.5% | 79.6% | 88.2% |
#7 | gpt-5 | 86.7% | 95.0% | 91.7% | 77.8% | 87.8% |
#8 | claude-opus-4 | 95.0% | 94.2% | 79.2% | 81.5% | 87.5% |
#9 | gemini-2.5-flash | 88.3% | 95.0% | 81.3% | 83.3% | 87.0% |
#10 | o1 | 91.7% | 96.7% | 79.2% | 79.6% | 86.8% |
#11 | gpt-5-mini | 93.3% | 96.7% | 85.4% | 70.4% | 86.4% |
#12 | grok-3 | 88.3% | 96.7% | 72.9% | 87.0% | 86.2% |
#13 | claude-3.5-sonnet | 88.3% | 90.8% | 89.6% | 75.9% | 86.2% |
#14 | claude-3.7-sonnet | 86.7% | 94.2% | 79.2% | 83.3% | 85.8% |
#15 | o3-mini | 85.0% | 95.8% | 72.9% | 85.2% | 84.7% |
#16 | o4-mini | 86.7% | 93.3% | 79.2% | 79.6% | 84.7% |
#17 | gpt-oss-120b | 81.7% | 91.7% | 81.3% | 81.5% | 84.0% |
#18 | moonshotai/kimi-k2-instruct | 85.0% | 92.5% | 72.9% | 81.5% | 83.0% |
#19 | deepseek-r1 | 78.3% | 89.2% | 83.3% | 79.6% | 82.6% |
#20 | gpt-4.1-mini | 71.7% | 97.5% | 83.3% | 77.8% | 82.6% |
#21 | gpt-oss-20b | 81.7% | 93.3% | 70.8% | 72.2% | 79.5% |
#22 | o1-mini | 76.7% | 92.5% | 66.7% | 81.5% | 79.3% |
#23 | gpt-5-nano | 81.7% | 93.3% | 72.9% | 66.7% | 78.6% |
#24 | qwen3-coder-480b-a35b-instruct | 71.7% | 90.0% | 72.9% | N/A | 78.2% |
#25 | gemini-2.0-flash | 76.7% | 87.5% | 58.3% | 85.2% | 76.9% |
#26 | gpt-4o | 63.3% | 92.5% | 64.6% | 83.3% | 75.9% |
#27 | claude-3.5-haiku | 76.7% | 86.7% | 62.5% | 68.5% | 73.6% |
#28 | deepseek-v3 | 63.3% | 89.2% | 64.6% | 72.2% | 72.3% |
#29 | gemini-2.0-flash-lite | 70.0% | 86.7% | 52.1% | 79.6% | 72.1% |
#30 | pixtral-large-latest-eu | 60.0% | 89.2% | 62.5% | 74.1% | 71.4% |
#31 | mistral-large-eu | 50.0% | 87.5% | 66.7% | 79.6% | 70.9% |
#32 | gemini-2.5-flash-lite | 75.0% | 87.5% | 43.8% | N/A | 68.8% |
#33 | gpt-4o-mini | 53.3% | 85.0% | 60.4% | 72.2% | 67.7% |
#34 | gpt-4.1-nano | 48.3% | 88.3% | 52.1% | 66.7% | 63.9% |
#35 | mistral-small-eu | 36.7% | 76.7% | 33.3% | 51.9% | 49.6% |
#36 | mistral-tiny-eu | 30.0% | 64.2% | 25.0% | 44.4% | 40.9% |
About TaskBench
TaskBench is our comprehensive evaluation suite that tests language models across various real-world AI tasks. Unlike academic benchmarks, TaskBench focuses on practical tasks that reflect real-world usage patterns. Each task sample represents one API call with structured input/output, mirroring actual user goals.
All models are evaluated using their default settings with consistent prompting strategies. Scores represent the percentage of tasks completed successfully (0.0 to 1.0 scale). For more detailed methodology and examples, see ourTaskBench introduction.
Task Categories
Context
Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.
SQL
Natural language to SQL query generation evaluates text-to-query fidelity and schema reasoning. This task is particularly relevant for analytics chat assistants and simplified database interfaces where users need to query data using natural language. Models must understand both the intent behind the question and the structure of the underlying database schema.
Agents
AI agent reasoning and tool selection tests planning, tool choice, and self-diagnosis capabilities. These skills are crucial for autonomous ticket triage systems and complex agent workflows where models must decide which tools to use, plan multi-step processes, and diagnose when things go wrong. This represents some of the most challenging tasks as they require open-ended reasoning and decision-making.
Normalization
Data processing and normalization tasks evaluate structured output from messy prose and different structures. This capability is essential for catalogue and product pipelines where data needs to be extracted from unstructured text and formatted consistently. Models must understand the desired output format and extract relevant information accurately.
Evaluation Methodology
String-Level Evaluators
We use fast, deterministic methods for text-based outputs including exact match, ROUGE/BLEU metrics, regex heuristics, and static analysis. These evaluators are particularly effective for tasks with well-defined expected outputs where we can compare the model response against a known correct answer.
Execution-Level Evaluators
For tasks where the output needs to be executed, we test functionality rather than just syntax. For example, with SQL generation tasks, we run the generated queries against mock databases and compare the results. This ensures that the model output not only looks correct but actually produces the intended results when executed.
LLM-Based Evaluators
We use LLMs to judge other LLMs on semantic accuracy, factual correctness, style, and safety. This approach is particularly useful for open-ended or context-heavy tasks where there is no single correct answer. LLM-based evaluation allows us to assess the quality of responses that require nuanced understanding and reasoning.