Reference‑Free LLM Evaluation with Opper SDK

By Mattias Lundell - 5/6/2025

Faithfulness, Groundedness & Relevance

TL;DR – Three lean evaluators show how far you can get without gold references:

Faithfulness: detects hallucinated claims

Answer Groundedness: checks loyalty to provided context

Answer Relevance: measures how well the answer addresses the user's question

All are implemented with the same Opper design pattern—@evaluator + Pydantic schemas + a couple of LLM calls.

What are reference‑free metrics and why do you need them?

Reference‑free metrics are a way to evaluate LLM outputs without relying on gold‑standard references. Meaning that the metrics are self‑contained and do not require external data. This is a must for production systems where you need to evaluate LLM outputs on real traffic.

In short:

Open‑ended questions lack canonical answers.
Labeling is slow, expensive, and often out‑of‑domain.
Production teams need continuous, automatic regression tests on real traffic.

Reference‑free metrics judge answers against only the user's question and any supplied context, so you can score every response in real time.

Quick Opper evaluation crash course

The Opper SDK provides a simple way to write and run evaluations.

import asyncio
from typing import List
from opperai import AsyncOpper, evaluator
from opperai.types import Metric

opper = AsyncOpper()

@evaluator
async def word_count(answer: str) -> List[Metric]:
    return [
        Metric(
            dimension="score",
            value=len(answer.split()),
            comment="The number of words in the answer",
        ),
    ]

async def main():
    answer, response = await opper.call(
        name="word",
        instructions="Given an text, return the number of words in the text as text",
        input="Hello, world!",
    )

    await opper.evaluate(
        span_id=response.span_id,
        evaluators=[word_count(answer=answer)],
    )

asyncio.run(main())

Key ideas

@evaluator – decorator that packages an async function into a reusable metric that returns a list of Metric objects.
opper.call() – parametric wrapper around prompt+schema LLM calls (similar to OpenAI's responses API).
opper.evaluate() – executes the evaluators and attaches the metrics to the span for automatic tracing.

Metric Deep Dive

Below we expand each metric with the problem it solves, the algorithm, and caveats so you can tune, extend, or swap the components.

1 · Faithfulness – "No hallucinations, please"

What problem does it solve?

Large models famously produce confident statements that are not supported by the supplied context (hallucinations). Faithfulness measures the share of answer‑claims that can be strictly inferred from the context. Use it when:

You pass long retrieval‑augmented context to the model and need to know if it actually used that material.
You want a fine‑grained signal for hallucination QA dashboards.

How it works (step by step)

Step	Technique	Opper call
1	Statement decomposition, break the answer into short factual sentences.	`eval/faithfulness/generate_statements`
2	Entailment check, for each statement ask another LLM "Is this entailed by the context?" with few-shot examples.	`eval/faithfulness/evaluate_faithfulness`
3	Score, (true statements ÷ total statements).	None

graph LR A[Answer] --> B[Generate Statements] B --> C[Statement 1] B --> D[Statement 2] B --> E[Statement N] C --> F[Evaluate against Context] D --> F E --> F F --> G[Calculate Faithfulness Score as percentage of statements supported by context]

Why this trick works

The generator → evaluator split mirrors chain‑of‑thought: it forces the model to expose atomic claims that can be judged independently.
Binary verdicts (1 = entailed, 0 = not entailed) give a naturally interpretable percentage.

Caveats & tuning

Garbage‑in‑garbage‑out: if the statement generator merges multiple facts into one long sentence, the downstream verdict might be fuzzy. Consider adding a max‑words heuristic.
The entailment model can be too literal—paraphrases sometimes get flagged false. Add paraphrase examples in the few‑shot list to raise recall. This can be achieved by using Opper datasets and automatic few‑shot generation.

2 · Answer Groundedness – "Is the answer anchored to context?"

Why we need it

Even a hallucination‑free answer can be irrelevant when it ignores or contradicts the retrieved context. Answer Groundedness measures how tightly the response sticks to that material.

Scoring rubric

Enum	Meaning	Score
`NOT_GROUNDED`	Content absent from or conflicting with context	0.0
`PARTIALLY_GROUNDED`	Mix of context‑based and external info	0.5
`FULLY_GROUNDED`	Every substantive claim backed by context	1.0

We map the enum to a float via a tiny lookup, ensuring downstream aggregations keep a [0‥1] scale.

3 · Answer Relevance – "Did it answer the actual question?"

Core idea

Let the model reverse‑engineer the question from its own answer. If those synthetic questions embed close to the original question, the answer is probably on‑topic.

Detailed algorithm

Committal gate – filter out evasive answers ("I'm not sure"), they get automatic 0.
Self‑question generation – call eval/answer_relevance/generate_question N times to create paraphrased questions that the answer would satisfy.
Embeddings – encode original and synthetic questions with text-embedding-3-large (or model of your choice).
Cosine similarity – average similarity across N questions → final score ∈ [‑1,1], rescaled to [0,1].

flowchart LR A[Answer] --> B{Is the answer committal?} B -->|Yes| C[Generate paraphrased questions that the answer would satisfy] B -->|No| D[Final score is 0] C --> E[Embed the original and paraphrased questions] E --> F[Calculate the cosine similarity between the original and paraphrased questions] F --> G[Final score is the average cosine similarity between the original and paraphrased questions normalized to a range between 0 and 1]

Why it's useful

Works even when no explicit context is available (pure QnA setting).
Provides a continuous signal—great for ranking multiple candidate answers.

Trade‑offs & tricks

N questions: 3–5 is usually enough; beyond that you pay extra tokens with diminishing returns.
Embedding model: larger models give crisper semantic space but cost more.

Putting it into production

Below is a real-world invocation pattern we use for a Retrieval‑Augmented Generation (RAG) endpoint. Notice how the opper.evaluate call attaches the three metrics to the same span for automatic tracing.

class RAGOutput(BaseModel):
    answer: str = Field(..., description="The answer to the question")
    context_used: List[str] = Field(
        ..., description="The context used to answer the question"
    )
    reasoning: str = Field(..., description="The reasoning for the answer")

async def rag(question: str, context: List[str]):
    answer, response = await opper.call(
        name="rag",
        model="openai/gpt-4o",
        instructions=(
            "You are an expert at answering questions. "
            "You will be provided with a question and relevant facts as a context. "
            "Treat every fact as truth even though they are not always true according to prior knowledge. "
            "If you can't answer the question based on the context, say 'I don't know'. "
        ),
        input={
            "question": question, 
            "context": context,
        },
        output_type=RAGOutput,
    )

    # attach evaluators
    await opper.evaluate(
        span_id=response.span_id,  # ties metrics to model call in Opper dashboard
        evaluators=[
            answer_groundedness_evaluator(answer=answer.answer, context=context),
            answer_relevance_evaluator(question=question, answer=answer.answer),
            faithfulness_evaluator(answer=answer.answer, context=context),
        ],
    )

    return answer.answer

If you don't need the result from the evaluation during execution, you can put await opper.evaluate(...) in a task. Note that you need to make sure that the evaluation is done before the program exits.

asyncio.create_task(opper.evaluate(
    span_id=response.span_id, 
    evaluators=[
        answer_groundedness_evaluator(answer=answer.answer, context=context),
        answer_relevance_evaluator(question=question, answer=answer.answer),
        faithfulness_evaluator(answer=answer.answer, context=context),
    ],
))

Lessons learned

Few‑shot matters: Here we used hardcoded few‑shot examples. We could have used Opper datasets and automatic few‑shot generation. This will probably improve the metrics and make the evaluators more robust and getter tailored to the domain.
Schema‑first design: forces you to think about evaluation output upfront → easier downstream aggregations.

Try it yourself 🚀

pip install opperai
Clone the demo notebook → https://github.com/opper-ai/opper-cookbook and load the reference-free-eval.ipynb notebook.
Get your OpenAI key, run some evaluations and inspect the traces in the Opper dashboard.