Context Reasoning

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

70
openai/gpt-4o
Average duration
27s
Average tokens
18971
Average cost
$0.00
0
25s
2901
opper_context_sample_01
100
25s
2862
opper_context_sample_02
100
7s
2849
opper_context_sample_03
50
26s
3005
opper_context_sample_04
0
26s
2892
opper_context_sample_05
100
25s
2866
opper_context_sample_06
0
26s
2918
opper_context_sample_07
0
26s
2940
opper_context_sample_08
100
26s
2886
opper_context_sample_09
100
18s
2903
opper_context_sample_10
100
25s
3318
opper_context_sample_11
100
18s
3268
opper_context_sample_12
100
25s
3244
opper_context_sample_13
100
18s
3249
opper_context_sample_14
100
25s
3198
opper_context_sample_15
100
18s
3227
opper_context_sample_16
100
25s
3272
opper_context_sample_17
100
26s
6018
opper_context_sample_18
100
18s
6113
opper_context_sample_19
50
26s
6089
opper_context_sample_20
50
26s
6254
opper_context_sample_21
50
18s
6069
opper_context_sample_22
0
25s
6088
opper_context_sample_23
100
1m 47s
6163
opper_context_sample_24
100
26s
79094
opper_context_sample_25
100
18s
79137
opper_context_sample_26
0
25s
79082
opper_context_sample_27
100
26s
79069
opper_context_sample_28
0
53s
79110
opper_context_sample_29
100
24s
79047
opper_context_sample_30