Context Reasoning

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

93
anthropic/claude-opus-4
Average duration
27s
Average tokens
22049
Average cost
$0.00
100
41s
3802
opper_context_sample_01
100
16s
3368
opper_context_sample_02
100
26s
3480
opper_context_sample_03
50
30s
3552
opper_context_sample_04
0
26s
3449
opper_context_sample_05
100
16s
3344
opper_context_sample_06
100
32s
3560
opper_context_sample_07
100
24s
3410
opper_context_sample_08
100
20s
3440
opper_context_sample_09
100
24s
3387
opper_context_sample_10
100
36s
4435
opper_context_sample_11
100
26s
4289
opper_context_sample_12
100
24s
4272
opper_context_sample_13
100
24s
4276
opper_context_sample_14
100
12s
4158
opper_context_sample_15
100
14s
4173
opper_context_sample_16
100
22s
4299
opper_context_sample_17
100
26s
8004
opper_context_sample_18
100
42s
8151
opper_context_sample_19
100
23s
8021
opper_context_sample_20
100
32s
8170
opper_context_sample_21
100
18s
7973
opper_context_sample_22
50
26s
8035
opper_context_sample_23
100
28s
8165
opper_context_sample_24
100
50s
90087
opper_context_sample_25
100
23s
89899
opper_context_sample_26
100
18s
89907
opper_context_sample_27
100
52s
90495
opper_context_sample_28
100
34s
90077
opper_context_sample_29
100
18s
89793
opper_context_sample_30