Reference‑Free LLM Evaluation with Opper SDK
Faithfulness, Groundedness & Relevance
TL;DR – Three lean evaluators show how far you can get without gold references:
- Faithfulness: detects hallucinated claims
- Answer Groundedness: checks loyalty to provided context
- Answer Relevance: measures how well the answer addresses the user's question
All are implemented with the same Opper design pattern—
@evaluator
+ Pydantic schemas + a couple of LLM calls.
What are reference‑free metrics and why do you need them?
Reference‑free metrics are a way to evaluate LLM outputs without relying on gold‑standard references. Meaning that the metrics are self‑contained and do not require external data. This is a must for production systems where you need to evaluate LLM outputs on real traffic.
In short:
- Open‑ended questions lack canonical answers.
- Labeling is slow, expensive, and often out‑of‑domain.
- Production teams need continuous, automatic regression tests on real traffic.
Reference‑free metrics judge answers against only the user's question and any supplied context, so you can score every response in real time.
Quick Opper evaluation crash course
The Opper SDK provides a simple way to write and run evaluations.
import asyncio
from typing import List
from opperai import AsyncOpper, evaluator
from opperai.types import Metric
opper = AsyncOpper()
@evaluator
async def word_count(answer: str) -> List[Metric]:
return [
Metric(
dimension="score",
value=len(answer.split()),
comment="The number of words in the answer",
),
]
async def main():
answer, response = await opper.call(
name="word",
instructions="Given an text, return the number of words in the text as text",
input="Hello, world!",
)
await opper.evaluate(
span_id=response.span_id,
evaluators=[word_count(answer=answer)],
)
asyncio.run(main())
Key ideas
@evaluator
– decorator that packages an async function into a reusable metric that returns a list ofMetric
objects.opper.call()
– parametric wrapper around prompt+schema LLM calls (similar to OpenAI's responses API).opper.evaluate()
– executes the evaluators and attaches the metrics to the span for automatic tracing.
Metric Deep Dive
Below we expand each metric with the problem it solves, the algorithm, and caveats so you can tune, extend, or swap the components.
1 · Faithfulness – "No hallucinations, please"
What problem does it solve?
Large models famously produce confident statements that are not supported by the supplied context (hallucinations). Faithfulness measures the share of answer‑claims that can be strictly inferred from the context. Use it when:
- You pass long retrieval‑augmented context to the model and need to know if it actually used that material.
- You want a fine‑grained signal for hallucination QA dashboards.
How it works (step by step)
Step | Technique | Opper call |
---|---|---|
1 | Statement decomposition, break the answer into short factual sentences. | eval/faithfulness/generate_statements |
2 | Entailment check, for each statement ask another LLM "Is this entailed by the context?" with few-shot examples. | eval/faithfulness/evaluate_faithfulness |
3 | Score, (true statements ÷ total statements). | None |
Why this trick works
- The generator → evaluator split mirrors chain‑of‑thought: it forces the model to expose atomic claims that can be judged independently.
- Binary verdicts (1 = entailed, 0 = not entailed) give a naturally interpretable percentage.
Caveats & tuning
- Garbage‑in‑garbage‑out: if the statement generator merges multiple facts into one long sentence, the downstream verdict might be fuzzy. Consider adding a max‑words heuristic.
- The entailment model can be too literal—paraphrases sometimes get flagged false. Add paraphrase examples in the few‑shot list to raise recall. This can be achieved by using Opper datasets and automatic few‑shot generation.
2 · Answer Groundedness – "Is the answer anchored to context?"
Why we need it
Even a hallucination‑free answer can be irrelevant when it ignores or contradicts the retrieved context. Answer Groundedness measures how tightly the response sticks to that material.
Scoring rubric
Enum | Meaning | Score |
---|---|---|
NOT_GROUNDED | Content absent from or conflicting with context | 0.0 |
PARTIALLY_GROUNDED | Mix of context‑based and external info | 0.5 |
FULLY_GROUNDED | Every substantive claim backed by context | 1.0 |
We map the enum to a float via a tiny lookup, ensuring downstream aggregations keep a [0‥1]
scale.
3 · Answer Relevance – "Did it answer the actual question?"
Core idea
Let the model reverse‑engineer the question from its own answer. If those synthetic questions embed close to the original question, the answer is probably on‑topic.
Detailed algorithm
- Committal gate – filter out evasive answers ("I'm not sure"), they get automatic 0.
- Self‑question generation – call
eval/answer_relevance/generate_question
N times to create paraphrased questions that the answer would satisfy. - Embeddings – encode original and synthetic questions with
text-embedding-3-large
(or model of your choice). - Cosine similarity – average similarity across N questions → final score ∈ [‑1,1], rescaled to
[0,1]
.
Why it's useful
- Works even when no explicit context is available (pure QnA setting).
- Provides a continuous signal—great for ranking multiple candidate answers.
Trade‑offs & tricks
- N questions: 3–5 is usually enough; beyond that you pay extra tokens with diminishing returns.
- Embedding model: larger models give crisper semantic space but cost more.
Putting it into production
Below is a real-world invocation pattern we use for a Retrieval‑Augmented Generation (RAG) endpoint. Notice how the opper.evaluate call attaches the three metrics to the same span for automatic tracing.
class RAGOutput(BaseModel):
answer: str = Field(..., description="The answer to the question")
context_used: List[str] = Field(
..., description="The context used to answer the question"
)
reasoning: str = Field(..., description="The reasoning for the answer")
async def rag(question: str, context: List[str]):
answer, response = await opper.call(
name="rag",
model="openai/gpt-4o",
instructions=(
"You are an expert at answering questions. "
"You will be provided with a question and relevant facts as a context. "
"Treat every fact as truth even though they are not always true according to prior knowledge. "
"If you can't answer the question based on the context, say 'I don't know'. "
),
input={
"question": question,
"context": context,
},
output_type=RAGOutput,
)
# attach evaluators
await opper.evaluate(
span_id=response.span_id, # ties metrics to model call in Opper dashboard
evaluators=[
answer_groundedness_evaluator(answer=answer.answer, context=context),
answer_relevance_evaluator(question=question, answer=answer.answer),
faithfulness_evaluator(answer=answer.answer, context=context),
],
)
return answer.answer
If you don't need the result from the evaluation during execution, you can put await opper.evaluate(...)
in a task.
Note that you need to make sure that the evaluation is done before the program exits.
asyncio.create_task(opper.evaluate(
span_id=response.span_id,
evaluators=[
answer_groundedness_evaluator(answer=answer.answer, context=context),
answer_relevance_evaluator(question=question, answer=answer.answer),
faithfulness_evaluator(answer=answer.answer, context=context),
],
))
Lessons learned
- Few‑shot matters: Here we used hardcoded few‑shot examples. We could have used Opper datasets and automatic few‑shot generation. This will probably improve the metrics and make the evaluators more robust and getter tailored to the domain.
- Schema‑first design: forces you to think about evaluation output upfront → easier downstream aggregations.
Try it yourself 🚀
pip install opperai
- Clone the demo notebook → https://github.com/opper-ai/opper-cookbook and load the
reference-free-eval.ipynb
notebook. - Get your OpenAI key, run some evaluations and inspect the traces in the Opper dashboard.