Context Reasoning

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

92
openai/o1
Average duration
29s
Average tokens
19637
Average cost
$0.00
100
35s
4981
opper_context_sample_01
100
59s
3554
opper_context_sample_02
100
18s
3486
opper_context_sample_03
100
37s
5294
opper_context_sample_04
50
26s
3771
opper_context_sample_05
100
9s
3283
opper_context_sample_06
100
18s
3607
opper_context_sample_07
100
26s
3611
opper_context_sample_08
100
26s
3427
opper_context_sample_09
100
11s
3236
opper_context_sample_10
100
26s
3924
opper_context_sample_11
100
1m 32s
3754
opper_context_sample_12
100
16s
3539
opper_context_sample_13
50
12s
3610
opper_context_sample_14
100
18s
3391
opper_context_sample_15
50
12s
3647
opper_context_sample_16
100
18s
3866
opper_context_sample_17
100
1m 14s
6569
opper_context_sample_18
100
25s
6564
opper_context_sample_19
50
17s
6783
opper_context_sample_20
100
31s
8304
opper_context_sample_21
100
16s
6702
opper_context_sample_22
50
27s
7254
opper_context_sample_23
100
25s
6684
opper_context_sample_24
100
26s
79501
opper_context_sample_25
100
1m 12s
79462
opper_context_sample_26
100
25s
79428
opper_context_sample_27
100
42s
79220
opper_context_sample_28
100
22s
79474
opper_context_sample_29
100
16s
79195
opper_context_sample_30