By Mattias Lundell -

Reference‑Free LLM Evaluation with Opper SDK

Faithfulness, Groundedness & Relevance

TL;DR – Three lean evaluators show how far you can get without gold references:

  • Faithfulness: detects hallucinated claims
  • Answer Groundedness: checks loyalty to provided context
  • Answer Relevance: measures how well the answer addresses the user's question

All are implemented with the same Opper design pattern—@evaluator + Pydantic schemas + a couple of LLM calls.

What are reference‑free metrics and why do you need them?

Reference‑free metrics are a way to evaluate LLM outputs without relying on gold‑standard references. Meaning that the metrics are self‑contained and do not require external data. This is a must for production systems where you need to evaluate LLM outputs on real traffic.

In short:

Reference‑free metrics judge answers against only the user's question and any supplied context, so you can score every response in real time.

Quick Opper evaluation crash course

The Opper SDK provides a simple way to write and run evaluations.

import asyncio
from typing import List
from opperai import AsyncOpper, evaluator
from opperai.types import Metric

opper = AsyncOpper()

@evaluator
async def word_count(answer: str) -> List[Metric]:
    return [
        Metric(
            dimension="score",
            value=len(answer.split()),
            comment="The number of words in the answer",
        ),
    ]

async def main():
    answer, response = await opper.call(
        name="word",
        instructions="Given an text, return the number of words in the text as text",
        input="Hello, world!",
    )

    await opper.evaluate(
        span_id=response.span_id,
        evaluators=[word_count(answer=answer)],
    )

asyncio.run(main())

Key ideas

Metric Deep Dive

Below we expand each metric with the problem it solves, the algorithm, and caveats so you can tune, extend, or swap the components.

1 · Faithfulness – "No hallucinations, please"

What problem does it solve?

Large models famously produce confident statements that are not supported by the supplied context (hallucinations). Faithfulness measures the share of answer‑claims that can be strictly inferred from the context. Use it when:

How it works (step by step)

StepTechniqueOpper call
1Statement decomposition, break the answer into short factual sentences.eval/faithfulness/generate_statements
2Entailment check, for each statement ask another LLM "Is this entailed by the context?" with few-shot examples.eval/faithfulness/evaluate_faithfulness
3Score, (true statements ÷ total statements).None
graph LR A[Answer] --> B[Generate Statements] B --> C[Statement 1] B --> D[Statement 2] B --> E[Statement N] C --> F[Evaluate against Context] D --> F E --> F F --> G[Calculate Faithfulness Score as percentage of statements supported by context]

Why this trick works

Caveats & tuning

2 · Answer Groundedness – "Is the answer anchored to context?"

Why we need it

Even a hallucination‑free answer can be irrelevant when it ignores or contradicts the retrieved context. Answer Groundedness measures how tightly the response sticks to that material.

Scoring rubric

EnumMeaningScore
NOT_GROUNDEDContent absent from or conflicting with context0.0
PARTIALLY_GROUNDEDMix of context‑based and external info0.5
FULLY_GROUNDEDEvery substantive claim backed by context1.0

We map the enum to a float via a tiny lookup, ensuring downstream aggregations keep a [0‥1] scale.

3 · Answer Relevance – "Did it answer the actual question?"

Core idea

Let the model reverse‑engineer the question from its own answer. If those synthetic questions embed close to the original question, the answer is probably on‑topic.

Detailed algorithm

  1. Committal gate – filter out evasive answers ("I'm not sure"), they get automatic 0.
  2. Self‑question generation – call eval/answer_relevance/generate_question N times to create paraphrased questions that the answer would satisfy.
  3. Embeddings – encode original and synthetic questions with text-embedding-3-large (or model of your choice).
  4. Cosine similarity – average similarity across N questions → final score ∈ [‑1,1], rescaled to [0,1].
flowchart LR A[Answer] --> B{Is the answer committal?} B -->|Yes| C[Generate paraphrased questions that the answer would satisfy] B -->|No| D[Final score is 0] C --> E[Embed the original and paraphrased questions] E --> F[Calculate the cosine similarity between the original and paraphrased questions] F --> G[Final score is the average cosine similarity between the original and paraphrased questions normalized to a range between 0 and 1]

Why it's useful

Trade‑offs & tricks

Putting it into production

Below is a real-world invocation pattern we use for a Retrieval‑Augmented Generation (RAG) endpoint. Notice how the opper.evaluate call attaches the three metrics to the same span for automatic tracing.

class RAGOutput(BaseModel):
    answer: str = Field(..., description="The answer to the question")
    context_used: List[str] = Field(
        ..., description="The context used to answer the question"
    )
    reasoning: str = Field(..., description="The reasoning for the answer")

async def rag(question: str, context: List[str]):
    answer, response = await opper.call(
        name="rag",
        model="openai/gpt-4o",
        instructions=(
            "You are an expert at answering questions. "
            "You will be provided with a question and relevant facts as a context. "
            "Treat every fact as truth even though they are not always true according to prior knowledge. "
            "If you can't answer the question based on the context, say 'I don't know'. "
        ),
        input={
            "question": question, 
            "context": context,
        },
        output_type=RAGOutput,
    )

    # attach evaluators
    await opper.evaluate(
        span_id=response.span_id,  # ties metrics to model call in Opper dashboard
        evaluators=[
            answer_groundedness_evaluator(answer=answer.answer, context=context),
            answer_relevance_evaluator(question=question, answer=answer.answer),
            faithfulness_evaluator(answer=answer.answer, context=context),
        ],
    )

    return answer.answer

If you don't need the result from the evaluation during execution, you can put await opper.evaluate(...) in a task. Note that you need to make sure that the evaluation is done before the program exits.

asyncio.create_task(opper.evaluate(
    span_id=response.span_id, 
    evaluators=[
        answer_groundedness_evaluator(answer=answer.answer, context=context),
        answer_relevance_evaluator(question=question, answer=answer.answer),
        faithfulness_evaluator(answer=answer.answer, context=context),
    ],
))

Lessons learned

Try it yourself 🚀

  1. pip install opperai
  2. Clone the demo notebook → https://github.com/opper-ai/opper-cookbook and load the reference-free-eval.ipynb notebook.
  3. Get your OpenAI key, run some evaluations and inspect the traces in the Opper dashboard.