Context Reasoning

Context understanding and reasoning tasks test accurate answers grounded in provided context. This capability is essential for knowledge-base support bots, policy lookup systems, and internal knowledge Q&A applications. Models are evaluated on their ability to provide accurate answers that are properly grounded in the given context rather than hallucinating information.

87
anthropic/claude-3.7-sonnet
Average duration
13s
Average tokens
22081
Average cost
$0.00
0
17s
3688
opper_context_sample_01
100
9s
3434
opper_context_sample_02
100
11s
3535
opper_context_sample_03
50
17s
3864
opper_context_sample_04
100
11s
3640
opper_context_sample_05
100
10s
3444
opper_context_sample_06
100
9s
3411
opper_context_sample_07
100
9s
3512
opper_context_sample_08
100
12s
3590
opper_context_sample_09
100
8s
3440
opper_context_sample_10
100
10s
4269
opper_context_sample_11
100
9s
4260
opper_context_sample_12
100
10s
4318
opper_context_sample_13
100
8s
4274
opper_context_sample_14
100
7s
4160
opper_context_sample_15
100
7s
4184
opper_context_sample_16
100
12s
4297
opper_context_sample_17
100
19s
8270
opper_context_sample_18
100
12s
8227
opper_context_sample_19
50
15s
8219
opper_context_sample_20
100
13s
8188
opper_context_sample_21
100
8s
7983
opper_context_sample_22
100
14s
8212
opper_context_sample_23
100
15s
8161
opper_context_sample_24
100
18s
89977
opper_context_sample_25
100
22s
90165
opper_context_sample_26
100
18s
89927
opper_context_sample_27
100
17s
89930
opper_context_sample_28
0
19s
90012
opper_context_sample_29
0
19s
89840
opper_context_sample_30