Context Reasoning

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

80
openai/o3-mini
Average duration
35s
Average tokens
19674
Average cost
$0.00
100
1m 26s
4785
opper_context_sample_01
0
28s
3077
opper_context_sample_02
100
28s
3173
opper_context_sample_03
50
49s
7906
opper_context_sample_04
0
28s
3459
opper_context_sample_05
100
23s
3069
opper_context_sample_06
50
28s
3562
opper_context_sample_07
100
14s
3835
opper_context_sample_08
100
23s
3505
opper_context_sample_09
100
27s
3305
opper_context_sample_10
100
13s
3959
opper_context_sample_11
100
23s
3552
opper_context_sample_12
100
27s
3535
opper_context_sample_13
50
8s
3491
opper_context_sample_14
100
6s
3315
opper_context_sample_15
50
23s
3443
opper_context_sample_16
100
9s
3669
opper_context_sample_17
100
27s
6369
opper_context_sample_18
100
1m 30s
6121
opper_context_sample_19
100
25s
8472
opper_context_sample_20
100
27s
6991
opper_context_sample_21
100
2m 45s
6542
opper_context_sample_22
0
23s
7015
opper_context_sample_23
100
23s
6879
opper_context_sample_24
100
16s
79536
opper_context_sample_25
100
15s
79549
opper_context_sample_26
100
1m 25s
79647
opper_context_sample_27
100
27s
79389
opper_context_sample_28
0
57s
79859
opper_context_sample_29
100
27s
79216
opper_context_sample_30