Simple RAG with citations

By Göran Sandahl - 5/9/2024

RAG, short for Retrieval Augmented Generation, is a technique used to enhance Language Models (LLMs) by adding context from external information sources. This method is particularly valuable when accuracy or up to date information is important, as it supplements the generative abilities of LLMs with specific, relevant information from documents or data sources.

In this blog we will show how to build one such feature. Our goal is to answer a question using a document and provide citations for facts.

We will use Opper to index and retrieve information from a PDF and the leading European model Mistral-Large to generate a response. The Reddit SEC IPO document will serve as the source of truth for this example. The source document Reddit S-1 document is 261 pages.

Indexing the PDF

We could talk for hours about how PDFs are notoriously difficult to index due to the mixed content it includes: Tables, Images, Text and Headings. But we will just rely on Oppers indexing API to manage this for us!

# we create an index
index = index.create(name="reddit-s1")
   
# we upload our pdf to the index
index.upload_file(
    file_path="./reddit-sec.pdf",
)

Simple retrieval

Lets pick a question that seems relevant for this kind of document: What are the key financial and growth numbers for Reddit? and issue a query to the index for the 3 (k) most semantically relevant parts:


question = "What are the key financial and growth numbers for Reddit?"

results = index.query(
    query=question,
    k=3
)

Using the results from the query we can gather the necessary attributes needed to build citations and the response. We need content, file_name and page number for each retrieved part of the PDF.


# New data type
class Source(BaseModel):
    file_name: str
    content: str
    page_number: int

# Process the results
processed_results = [
    Source(
        content=result.content,
        file_name=result.metadata.get("file_name"),
        page_number=result.metadata.get("page")
    ) for result in results
]

Building a response with citations

We can now extract relevant citations from retrieved information. We rely on Mistral-Large and Opper functions with structured input/output to do the work:


class Citation(BaseModel):
    source: str 
    page_number: int 
    citation: str 

@fn(model="mistral/mistral-large-eu")
def extract_citations(question: str, sources: List[Source]) -> List[Citation]:
    """ Build a list of citations for the question from the sources"""

citations = extract_citations(question, processed_results)

And then build the response using the citations:


class Response(BaseModel):
    answer: str 
    citations: List[Citation]

@fn(model="mistral/mistral-large-eu")
def produce_response(question: str, citations: List[Citation]) -> Response:
    """ Produce an answer to the question using the possible citations. Refer to any statements or facts from citations inline in the answer with [1], [2] etc """

response = produce_response(question, citations)

Delivering the response

Having both the answer and citations available in a structured form, we can construct a nice response.


print(response.answer)
index = 1
for citation in response.citations:
    print(f"[{index}]", f'"{citation.citation}"')
    index += 1

Our question

What are the key financial and growth numbers for Reddit?

The response

The key financial and growth numbers for Reddit include the market size for Reddit's user economy, which is currently estimated to be $1.3 trillion and is expected to grow at a CAGR of 12% to $2.1 trillion in 2027 [2]. In terms of user engagement, global Daily Active Users (DAUq) grew 27% compared to the prior year period, driven by 34% growth in DAUq in the United States and 21% growth in DAUq in the rest of the world. As of December 2023, global monthly average DAUq was 76.0 million [3][4]. Reddit also experienced approximately 35% growth in the number of videos watched for 10 seconds or more and an approximately 16% increase in daily active video viewers compared to December 2022 [5]. There was an increase of over 30% in 'Good Visits,' defined as a user consuming a post for more than 30 seconds, in the three months ended December 31, 2023, compared to the three months ended June 30, 2023 [6]. Additionally, approximately 50% of Redditors visited the platform from outside of the United States, with an average of 36.7 million international DAUq for the three months ended December 31, 2023, representing 21% growth year over year [7][8].

[1] "The market size for Reddit's user economy is estimated to be $1.3 trillion today and is expected to grow at a CAGR of 12% to $2.1 trillion in 2027. (reddit-sec.pdf, p. 19)"

[2] "In the three months ended December 31, 2023, global DAUq grew 27% compared to the prior year period, driven by 34% growth in DAUq in the United States and 21% growth in DAUq in the rest of world. (reddit-sec.pdf, p. 106)"

[3] "Global monthly average DAUq was 76.0 million in December 2023. (reddit-sec.pdf, p. 106)"

[4] "In December 2023, we experienced approximately 35% growth in the number of videos watched for 10 seconds or more and an approximately 16% increase in daily active video viewers compared to December 2022. (reddit-sec.pdf, p. 14)"

[5] "In the three months ended December 31, 2023 compared to the three months ended June 30, 2023, we observed an increase of over 30% in 'Good Visits,' defined as a user consuming a post for more than 30 seconds. (reddit-sec.pdf, p. 14)"

[6] "During the three months ended December 31, 2023, approximately 50% of Redditors visited the platform from outside of the United States. (reddit-sec.pdf, p. 14)"

[7] "We captured increased international momentum, with an average of 36.7 million international DAUq for the three months ended December 31, 2023, representing 21% growth year over year. (reddit-sec.pdf, p. 14)"

[6] "During the three months ended December 31, 2023, approximately 50% of Redditors visited the platform from outside of the United States. (reddit-sec.pdf, p. 14)"

[7] "We captured increased international momentum, with an average of 36.7 million international DAUq for the three months ended December 31, 2023, representing 21% growth year over year. (reddit-sec.pdf, p. 14)"

Takeaways

In this short example we showed how to utilize Mistral-Large to answer a question with citiations. We used the Opper indexing API to store and query the PDF and then used Mistral-Large and structured input/output with Opper to extract citations and form a response. It is very possible to use alternatives to frontier models for use cases like this!