Agents
AI agent reasoning and tool selection tests planning, tool choice, and self-diagnosis capabilities. These skills are crucial for autonomous ticket triage systems and complex agent workflows where mode...
View taskBenchmark results across diverse tasks — reasoning, generation, and understanding — so you can choose the right model with confidence.
AI agent reasoning and tool selection tests planning, tool choice, and self-diagnosis capabilities. These skills are crucial for autonomous ticket triage systems and complex agent workflows where mode...
View taskContext understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledg...
View taskData processing and normalization tasks evaluate structured output from messy prose and different structures. This capability is essential for catalogue and product pipelines where data needs to be ex...
View taskNatural language to SQL query generation evaluates text-to-query fidelity and schema reasoning. This task is particularly relevant for analytics chat assistants and simplified database interfaces wher...
View taskMultilingual comprehension and reasoning in Swedish, covering fact checking, summarization, inference, literary and legal analysis, and dialectal understanding. This task evaluates how well models fol...
View task