Context Reasoning

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

93
openai/o3
Average duration
33s
Average tokens
19285
Average cost
$0.00
100
28s
4371
opper_context_sample_01
100
24s
3007
opper_context_sample_02
100
24s
3140
opper_context_sample_03
50
1m 48s
5148
opper_context_sample_04
50
22s
3180
opper_context_sample_05
100
23s
3000
opper_context_sample_06
100
23s
3220
opper_context_sample_07
100
24s
2919
opper_context_sample_08
100
12s
3045
opper_context_sample_09
100
23s
3073
opper_context_sample_10
100
23s
3654
opper_context_sample_11
100
24s
3431
opper_context_sample_12
100
22s
3399
opper_context_sample_13
50
22s
3391
opper_context_sample_14
100
24s
3330
opper_context_sample_15
100
22s
3393
opper_context_sample_16
100
22s
3467
opper_context_sample_17
100
22s
6215
opper_context_sample_18
100
22s
6360
opper_context_sample_19
100
22s
6479
opper_context_sample_20
100
22s
6594
opper_context_sample_21
100
9s
6140
opper_context_sample_22
50
22s
6328
opper_context_sample_23
100
24s
6600
opper_context_sample_24
100
24s
79211
opper_context_sample_25
100
24s
79299
opper_context_sample_26
100
24s
79225
opper_context_sample_27
100
22s
79343
opper_context_sample_28
100
34s
79502
opper_context_sample_29
100
4m 14s
79087
opper_context_sample_30