RAG metrics: answer correctness

By Mattias Lundell - 5/28/2024

In this blog post, we will implement and explore a metric to evaluate the quality of model-generated answers. Specifically, we will focus on the answer correctness metric to evaluate the accuracy of these answers. Notebook available at opper-cookbook.

Answer Correctness

Answer correctness is the metric that measures the accuracy of the answer generated by a RAG pipeline. It is calculated by comparing the generated answer to the ground truth answer and determining the number of true positives, false positives and false negatives.

True Positives: Statements that are present in the generated answer and are directly supported by the ground truth.
False Positives: Statements that are present in the generated answer but are not directly supported by the ground truth.
False Negatives: Statements that are present in the ground truth but are not present in the generated answer.

Given these classifications, we can calculate the correctness score using the formula for the F1 score:

score = tp / (tp + 0.5 * (fp + fn))

A higher score indicates a higher correctness. See f-score for more information.

Implementing Answer Correctness in Opper

To model this in Opper we need a couple of classes. We start by defining a Reason class that represents a statement and the reason why it was classified as such. We then define a CorrectnessClassifications class that represents the classifications of the answer to gether with a method to calculate the correctness score.

from typing import List
from pydantic import BaseModel, Field
from opperai import fn


class Reason(BaseModel):
    statement: str = Field(..., description="The statement that was classified")
    reason: str = Field(
        ..., description="The reason why the statement was classified as such"
    )


class CorrectnessClassifications(BaseModel):
    true_positives: List[Reason] = Field(..., description="True positives - statements that are present in answer that are also directly supported by the one or more statements in ground truth")
    false_positives: List[Reason] = Field(..., description="False positives - statements that are present in answer but not directly supported by any statement in ground truth")
    false_negatives: List[Reason] = Field(..., description="False negatives - statements found in the ground truth but not present in answer")

    @property
    def score(self) -> float:
        """Given an answer and a ground truth, calculate the correctness (f1) score."""

        tp = len(self.true_positives)
        fp = len(self.false_positives)
        fn = len(self.false_negatives)

        score = tp / (tp + 0.5 * (fp + fn)) if tp > 0 else 0

        return score

Finally, we define a Correctness class that contains the classify method to classify the answer.


class Correctness(BaseModel):
    @fn(model="openai/gpt4-turbo")
    def classify(answer: str, ground_truth: str) -> CorrectnessClassifications:
        """Given an answer and a ground truth, analyze each statement and classify it as belonging 
        to one of the classifications.

        NOTE Each statement can only belong to one classification.
        """

    def calculate(self, answer: str, ground_truth: str) -> float:
        """Given an answer and a ground truth, calculate the correctness score."""

        classifications = self.classify(answer=answer, ground_truth=ground_truth)

        return classifications.score

We can now try out our correctness calculation. We'll use a couple of samples from the Paul Graham qna dataset that can be found on (Hugging Face)[https://huggingface.co/datasets/LangChainDatasets/question-answering-paul-graham].

First we try providing the same input as answer and ground_truth

correctness_calculator = Correctness()

classification = correctness_calculator.classify(
    answer="The two main things the author worked on before college were writing and programming.",
    ground_truth="The two main things the author worked on before college were writing and programming.",
)

print(classification)
print(classification.score)

This should give us a score of 1.0 since the answer is exactly the same as the ground truth.

true_positives=[Reason(statement='The two main things the author worked on before college were writing and programming.', reason='This statement is exactly the same as the ground truth.')] false_positives=[] false_negatives=[]
1.0

We now modify the answer by changing writing to cleaning windows which should give us a lower score.

classification = correctness_calculator.classify(
    answer="The two main things the author worked on before college were cleaning windows and programming.",
    ground_truth="The two main things the author worked on before college were writing and programming.",
)

print(classification)
print(classification.score)

Which gives

true_positives=[Reason(statement='The two main things the author worked on before college were cleaning windows and programming.', reason='The statement correctly identifies programming as one of the main things the author worked on before college, as mentioned in the ground truth.')] false_positives=[Reason(statement='The two main things the author worked on before college were cleaning windows and programming.', reason='The mention of cleaning windows is incorrect, as the ground truth states the author worked on writing, not cleaning windows.')] false_negatives=[Reason(statement='The two main things the author worked on before college were writing and programming.', reason='The ground truth includes writing as one of the main focuses, which was not correctly mentioned in the answer.')]
0.5

Naive RAG in Opper

To try this out at a little larger scale we could try it out on the Paul Graham qna dataset. To have something to benchmark we can create an index in Opper and upload the dataset and then create a function that uses entries from the index to answer questions. This will form our RAG pipeline.

First we load the dataset into an Opper index.

from opperai import Opper

opper = Opper()

index = opper.indexes.get(name="qna")
if not index:
    index = opper.indexes.create("qna")
    res = index.upload_file("what_i_worked_on.txt")

We can now query the index

res = index.query("What were the two main things the author worked on before college?")

print(res)

[RetrievalResponse(content='What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines â€” CPU, disk drives, printer, card reader â€” sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack them in the card reader and press a button to load the program into memory and run it. The result would ordinarily be to print something on the spectacularly loud printer.\n\nI was puzzled by the 1401. I couldn\'t figure out what to do with it. And in retrospect there\'s not much I could have done with it. The only form of input to programs was data stored on punched cards, and I didn\'t have any data stored on punched cards. The only other option was to do things that didn\'t rely on any input, like calculate approximations of pi, but I didn\'t know enough math to do anything interesting of that type. So I\'m not surprised I can\'t remember any programs I wrote, because they can\'t have done much. My clearest memory is of the moment I learned it was possible for programs not to terminate, when one of mine didn\'t. On a machine without time-sharing, this was a social as well as a technical error, as the data center manager\'s expression made clear.', metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content="I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI.\n\nAI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words.\n\nThere weren't any classes in AI at Cornell then, not even graduate classes, so I started trying to teach myself. Which meant learning Lisp, since in those days Lisp was regarded as the language of AI. The commonly used programming languages then were pretty primitive, and programmers' ideas correspondingly so. The default language at Cornell was a Pascal-like language called PL/I, and the situation was similar elsewhere. Learning Lisp expanded my concept of a program so fast that it was years before I started to have a sense of where the new limits were. This was more like it; this was what I had expected college to do. It wasn't happening in a class, like it was supposed to, but that was ok. For the next couple years I was on a roll. I knew what I was going to do.\n\nFor my undergraduate thesis, I reverse-engineered SHRDLU. My God did I love working on that program. It was a pleasing bit of code, but what made it even more exciting was my belief â€” hard to imagine now, but not unique in 1985 â€” that it was already climbing the lower slopes of intelligence.", metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content='I didn\'t see a way out of this situation. I didn\'t want to drop out of grad school, but how else was I going to get out? I remember when my friend Robert Morris got kicked out of Cornell for writing the internet worm of 1988, I was envious that he\'d found such a spectacular way to get out of grad school.\n\nThen one day in April 1990 a crack appeared in the wall. I ran into professor Cheatham and he asked if I was far enough along to graduate that June. I didn\'t have a word of my dissertation written, but in what must have been the quickest bit of thinking in my life, I decided to take a shot at writing one in the 5 weeks or so that remained before the deadline, reusing parts of On Lisp where I could, and I was able to respond, with no perceptible delay "Yes, I think so. I\'ll give you something to read in a few days."\n\nI picked applications of continuations as the topic. In retrospect I should have written about macros and embedded languages. There\'s a whole world there that\'s barely been explored. But all I wanted was to get out of grad school, and my rapidly written dissertation sufficed, just barely.\n\nMeanwhile I was applying to art schools. I applied to two: RISD in the US, and the Accademia di Belli Arti in Florence, which, because it was the oldest art school, I imagined would be good. RISD accepted me, and I never heard back from the Accademia, so off to Providence I went.\n\nI\'d applied for the BFA program at RISD, which meant in effect that I had to go to college again. This was not as strange as it sounds, because I was only 25, and art schools are full of people of different ages. RISD counted me as a transfer sophomore and said I had to do the foundation that summer. The foundation means the classes that everyone has to take in fundamental subjects like drawing, color, and design.', metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content="With microcomputers, everything changed. Now you could have a computer sitting right in front of you, on a desk, that could respond to your keystrokes as it was running instead of just churning through a stack of punch cards and then stopping. [1]\n\nThe first of my friends to get a microcomputer built it himself. It was sold as a kit by Heathkit. I remember vividly how impressed and envious I felt watching him sitting in front of it, typing programs right into the computer.\n\nComputers were expensive in those days and it took me years of nagging before I convinced my father to buy one, a TRS-80, in about 1980. The gold standard then was the Apple II, but a TRS-80 was good enough. This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter.\n\nThough I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.\n\nI couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI.", metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content="So I looked around to see what I could salvage from the wreckage of my plans, and there was Lisp. I knew from experience that Lisp was interesting for its own sake and not just for its association with AI, even though that was the main reason people cared about it at the time. So I decided to focus on Lisp. In fact, I decided to write a book about Lisp hacking. It's scary to think how little I knew about Lisp hacking when I started writing that book. But there's nothing like writing a book about something to help you learn it. The book, On Lisp, wasn't published till 1993, but I wrote much of it in grad school.\n\nComputer Science is an uneasy alliance between two halves, theory and systems. The theory people prove things, and the systems people build things. I wanted to build things. I had plenty of respect for theory â€” indeed, a sneaking suspicion that it was the more admirable of the two halves â€” but building things seemed so much more exciting.\n\nThe problem with systems work, though, was that it didn't last. Any program you wrote today, no matter how good, would be obsolete in a couple decades at best. People might mention your software in footnotes, but no one would actually use it. And indeed, it would seem very feeble work. Only people with a sense of the history of the field would even realize that, in its time, it had been good.\n\nThere were some surplus Xerox Dandelions floating around the computer lab at one point. Anyone who wanted one to play around with could have one. I was briefly tempted, but they were so slow by present standards; what was the point? No one else wanted one either, so off they went. That was what happened to systems work.\n\nI wanted not just to build things, but to build things that would last.", metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content='As well as HN, I wrote all of YC\'s internal software in Arc. But while I continued to work a good deal in Arc, I gradually stopped working on Arc, partly because I didn\'t have time to, and partly because it was a lot less attractive to mess around with the language now that we had all this infrastructure depending on it. So now my three projects were reduced to two: writing essays and working on YC.\n\nYC was different from other kinds of work I\'ve done. Instead of deciding for myself what to work on, the problems came to me. Every 6 months there was a new batch of startups, and their problems, whatever they were, became our problems. It was very engaging work, because their problems were quite varied, and the good founders were very effective. If you were trying to learn the most you could about startups in the shortest possible time, you couldn\'t have picked a better way to do it.\n\nThere were parts of the job I didn\'t like. Disputes between cofounders, figuring out when people were lying to us, fighting with people who maltreated the startups, and so on. But I worked hard even at the parts I didn\'t like. I was haunted by something Kevin Hale once said about companies: "No one works harder than the boss." He meant it both descriptively and prescriptively, and it was the second part that scared me. I wanted YC to be good, so if how hard I worked set the upper bound on how hard everyone else worked, I\'d better work very hard.\n\nOne day in 2010, when he was visiting California for interviews, Robert Morris did something astonishing: he offered me unsolicited advice. I can only remember him doing that once before. One day at Viaweb, when I was bent over double from a kidney stone, he suggested that it would be a good idea for him to take me to the hospital. That was what it took for Rtm to offer unsolicited advice. So I remember his exact words very clearly. "You know," he said, "you should make sure Y Combinator isn\'t the last cool thing you do."', metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content="I wanted not just to build things, but to build things that would last.\n\nIn this dissatisfied state I went in 1988 to visit Rich Draves at CMU, where he was in grad school. One day I went to visit the Carnegie Institute, where I'd spent a lot of time as a kid. While looking at a painting there I realized something that might seem obvious, but was a big surprise to me. There, right on the wall, was something you could make that would last. Paintings didn't become obsolete. Some of the best ones were hundreds of years old.\n\nAnd moreover this was something you could make a living doing. Not as easily as you could by writing software, of course, but I thought if you were really industrious and lived really cheaply, it had to be possible to make enough to survive. And as an artist you could be truly independent. You wouldn't have a boss, or even need to get research funding.\n\nI had always liked looking at paintings. Could I make them? I had no idea. I'd never imagined it was even possible. I knew intellectually that people made art â€” that it didn't just appear spontaneously â€” but it was as if the people who made it were a different species. They either lived long ago or were mysterious geniuses doing strange things in profiles in Life magazine. The idea of actually being able to make art, to put that verb before that noun, seemed almost miraculous.\n\nThat fall I started taking art classes at Harvard. Grad students could take classes in any department, and my advisor, Tom Cheatham, was very easy going. If he even knew about the strange classes I was taking, he never said anything.\n\nSo now I was in a PhD program in computer science, yet planning to be an artist, yet also genuinely in love with Lisp hacking and working away at On Lisp. In other words, like many a grad student, I was working energetically on multiple projects that were not my thesis.", metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content='By then there was a name for the kind of company Viaweb was, an "application service provider," or ASP. This name didn\'t last long before it was replaced by "software as a service," but it was current for long enough that I named this new company after it: it was going to be called Aspra.\n\nI started working on the application builder, Dan worked on network infrastructure, and the two undergrads worked on the first two services (images and phone calls). But about halfway through the summer I realized I really didn\'t want to run a company â€” especially not a big one, which it was looking like this would have to be. I\'d only started Viaweb because I needed the money. Now that I didn\'t need money anymore, why was I doing this? If this vision had to be realized as a company, then screw the vision. I\'d build a subset that could be done as an open source project.\n\nMuch to my surprise, the time I spent working on this stuff was not wasted after all. After we started Y Combinator, I would often encounter startups working on parts of this new architecture, and it was very useful to have spent so much time thinking about it and even trying to write some of it.\n\nThe subset I would build as an open source project was the new Lisp, whose parentheses I now wouldn\'t even have to hide. A lot of Lisp hackers dream of building a new Lisp, partly because one of the distinctive features of the language is that it has dialects, and partly, I think, because we have in our minds a Platonic form of Lisp that all existing dialects fall short of. I certainly did. So at the end of the summer Dan and I switched to working on this new dialect of Lisp, which I called Arc, in a house I bought in Cambridge.', metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content="I started writing essays again, and wrote a bunch of new ones over the next few months. I even wrote a couple that weren't about startups. Then in March 2015 I started working on Lisp again.\n\nThe distinctive thing about Lisp is that its core is a language defined by writing an interpreter in itself. It wasn't originally intended as a programming language in the ordinary sense. It was meant to be a formal model of computation, an alternative to the Turing machine. If you want to write an interpreter for a language in itself, what's the minimum set of predefined operators you need? The Lisp that John McCarthy invented, or more accurately discovered, is an answer to that question. [19]\n\nMcCarthy didn't realize this Lisp could even be used to program computers till his grad student Steve Russell suggested it. Russell translated McCarthy's interpreter into IBM 704 machine language, and from that point Lisp started also to be a programming language in the ordinary sense. But its origins as a model of computation gave it a power and elegance that other languages couldn't match. It was this that attracted me in college, though I didn't understand why at the time.\n\nMcCarthy's 1960 Lisp did nothing more than interpret Lisp expressions. It was missing a lot of things you'd want in a programming language. So these had to be added, and when they were, they weren't defined using McCarthy's original axiomatic approach. That wouldn't have been feasible at the time. McCarthy tested his interpreter by hand-simulating the execution of programs. But it was already getting close to the limit of interpreters you could test that way â€” indeed, there was a bug in it that McCarthy had overlooked. To test a more complicated interpreter, you'd have had to run it, and computers then weren't powerful enough.", metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004}), RetrievalResponse(content="The following spring, lightning struck. I was invited to give a talk at a Lisp conference, so I gave one about how we'd used Lisp at Viaweb. Afterward I put a postscript file of this talk online, on paulgraham.com, which I'd created years before using Viaweb but had never used for anything. In one day it got 30,000 page views. What on earth had happened? The referring urls showed that someone had posted it on Slashdot. [10]\n\nWow, I thought, there's an audience. If I write something and put it on the web, anyone can read it. That may seem obvious now, but it was surprising then. In the print era there was a narrow channel to readers, guarded by fierce monsters known as editors. The only way to get an audience for anything you wrote was to get it published as a book, or in a newspaper or magazine. Now anyone could publish anything.\n\nThis had been possible in principle since 1993, but not many people had realized it yet. I had been intimately involved with building the infrastructure of the web for most of that time, and a writer as well, and it had taken me 8 years to realize it. Even then it took me several years to understand the implications. It meant there would be a whole new generation of essays. [11]\n\nIn the print era, the channel for publishing essays had been vanishingly small. Except for a few officially anointed thinkers who went to the right parties in New York, the only people allowed to publish essays were specialists writing about their specialties. There were so many essays that had never been written, because there had been no way to publish them. Now they could be, and I was going to write them. [12]\n\nI've worked on several different things, but to the extent there was a turning point where I figured out what to work on, it was when I started publishing essays online. From then on I knew that whatever else I did, I'd always write essays too.", metadata={'source': '/tmp/tmpirtblwfw', 'file_name': 'what_i_worked_on.txt', '_opper_key': '4/2047/7a34d87d-0e05-4f8c-a397-2202fc0f9843', '_opper_index_file_id': 3004})]

The result from querying an index is a list of RetrievalResults. We can create a function that given a list of RetrievalResults and a question, will return an answer to the question.

from opperai.types import RetrievalResponse


@fn(model="openai/gpt4-turbo")
def query_qna(query: str, context: List[RetrievalResponse]) -> str:
    """Given a query and a context, answer the question using the context."""


def answer_question(query: str) -> str:
    context = index.query(query)
    res = query_qna(query=query, context=context)

    return res

This looks good, now we load the dataset with all the questions and answers from the qna dataset.

import requests

url = "https://huggingface.co/datasets/LangChainDatasets/question-answering-paul-graham/raw/main/paul_graham_qa.json"

response = requests.get(url)
response.raise_for_status()

json_data = response.json()

for row in json_data[0:2]:
    q = row["question"]
    a = row["answer"]
    res = answer_question(q)
    print(f"Question: {q}")
    print(f"Answer: {a}")
    print(f"Opper Answer: {res}")
    print()

Question: What were the two main things the author worked on before college?
Answer: The two main things the author worked on before college were writing and programming.
Opper Answer: Before college, the author worked mainly on writing, specifically short stories, and programming, initially experimenting with an IBM 1401 for data processing.

Question: What made the author want to work on AI?
Answer: The novel 'The Moon is a Harsh Mistress' and a PBS documentary showing Terry Winograd using SHRDLU made the author want to work on AI.
Opper Answer: The author was motivated to work on AI by a combination of literary and practical influences. Specifically, the novel 'The Moon is a Harsh Mistress' by Heinlein, which featured an intelligent computer named Mike, and a PBS documentary showcasing Terry Winograd using the SHRDLU program. These influences, along with the author's experience at Cornell where there were no AI classes, led him to teach himself and dive into the world of AI.

Benchmark correctness

Now we stitch the RAG pipeline together with the answer correctness function. First create a dataframe with all questions and answers from the qna dataset.

import pandas as pd

df = pd.DataFrame(json_data[:10])
df["opper_answer"] = df["question"].apply(answer_question)

We then create a new column with the correctness score for each question.

df["correctness"] = df.apply(
    lambda row: correctness_calculator.calculate(row["opper_answer"], row["answer"]),
    axis=1,
)

for row in df[:5].itertuples():
    print(f"Question: {row.question}")
    print(f"Answer: {row.answer}")
    print(f"Opper Answer: {row.opper_answer}")
    print(f"Answer Correctness: {row.correctness}")
    print("\n")

Question: What were the two main things the author worked on before college?
Answer: The two main things the author worked on before college were writing and programming.
Opper Answer: Before college, the author mainly worked on writing and programming.
Answer Correctness: 0.6666666666666666


Question: What made the author want to work on AI?
Answer: The novel 'The Moon is a Harsh Mistress' and a PBS documentary showing Terry Winograd using SHRDLU made the author want to work on AI.
Opper Answer: The author was influenced by a novel by Heinlein titled 'The Moon is a Harsh Mistress' featuring an intelligent computer named Mike, and a PBS documentary that showed Terry Winograd using SHRDLU, which made him want to work on AI.
Answer Correctness: 1.0


Question: What did the author realize while looking at a painting at the Carnegie Institute?
Answer: The author realized that paintings were something that could be made to last and that making them was a way to be independent and make a living.
Opper Answer: The author realized that paintings were something you could make that would last. They didn't become obsolete, and some of the best ones were hundreds of years old.
Answer Correctness: 0.5


Question: What did the author write their dissertation on?
Answer: The author wrote their dissertation on applications of continuations.
Opper Answer: The author wrote their dissertation on the applications of continuations.
Answer Correctness: 1.0

Taking the mean of the correctness column gives us the average correctness score across all answers for the questions in the dataset.

print(df["correctness"].mean())

0.742051282051282

Conclusion

What we have implemented is the basis of a more comprehensive evaluation of RAG. This can be used to compare the correctness of different implementations, prompts, models, etc. We can also add more metrics like faithfulness and relevance.

Correctness prompt heavily inspired by ragas correctness prompt.