Context Reasoning

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

88
openai/o4-mini
Average duration
43s
Average tokens
19445
Average cost
$0.00
100
34s
3964
opper_context_sample_01
50
2m 3s
3214
opper_context_sample_02
100
35s
3096
opper_context_sample_03
50
35s
4924
opper_context_sample_04
0
11s
3270
opper_context_sample_05
100
35s
3024
opper_context_sample_06
50
33s
3306
opper_context_sample_07
100
1m 13s
3506
opper_context_sample_08
100
33s
3489
opper_context_sample_09
100
9s
3172
opper_context_sample_10
100
44s
3669
opper_context_sample_11
100
26s
3673
opper_context_sample_12
100
35s
3663
opper_context_sample_13
100
1m 17s
3687
opper_context_sample_14
100
1m 26s
3417
opper_context_sample_15
100
9s
3440
opper_context_sample_16
100
35s
4101
opper_context_sample_17
100
2m 46s
6251
opper_context_sample_18
100
33s
6394
opper_context_sample_19
100
9s
6326
opper_context_sample_20
100
16s
6982
opper_context_sample_21
100
35s
6217
opper_context_sample_22
0
33s
6602
opper_context_sample_23
100
35s
6886
opper_context_sample_24
100
1m 12s
79614
opper_context_sample_25
100
35s
79967
opper_context_sample_26
100
36s
79254
opper_context_sample_27
100
33s
79168
opper_context_sample_28
100
35s
79812
opper_context_sample_29
100
10s
79256
opper_context_sample_30