Using Examples and Few Shot Retrieval to Shape LLM Responses

By Göran Sandahl - 6/20/2024

In the Getting the most out of LLMs and Examples are all you need we looked at how we can use few shot prompting to make LLMs be more performant.

In this post we'll expand on this and look at how high quality examples can be used to guide the output of LLMs in situations where we want to control the style of responses.

Lets try with an instruction first

When optimizing the output of LLM responses the first instinct is often to write better instructions. The downside of this approach is that the prompts often get complicated: "be nice", "write in the style of...", "don't do ..." etc. Extensive instructions are also subject to a lot of interpretation by the model, which means that they often become model specific and fragile to further change.

But for the sake of it, lets try this approach first. Lets say have a use case where a user asks some question and we use an LLM to construct an answer. Lets also say we want the answer to always start with repeating the question. Here is what a simple prompt and function could look like:

@fn(path="cookbook/few_shot", model="mistral/mistral-large-eu")
def respond(question: str) -> str:
    """ Create a brief response to the question. Repeat the question in the beginning of the response. """

respond("Why is the earth round?")

Running this outputs the following with mistral/mistral-large-eu:

You asked: Why is the earth round? The Earth appears round because it is an oblate spheroid, meaning it is mostly spherical but slightly flattened at the poles and slightly bulging at the equator. This shape is caused by the Earth's rotation."

This is indeed following the instruction by repeating the question in the beginning of the response. But it is not very "conversational". Lets say our aspiring response would be something like:

You asked why is the earth round, and it is because it of gravity in space and Earth's rotation."

We could continue playing around with the prompt to start getting closer to this. But lets see how we can use examples to guide the output of our LLM call instead!

Building a simple test

Let's first build a simple test to verify that outputs meet our criteria. This will enable us to systemize the process of validating the quality of responses.

We'll start by creating a simple function that takes a question an answer and return True if the answer meets a pattern that we define with regex:

import re

def evaluate_response(response: str) -> bool:
    
    pattern = r"^You asked (.+?), (.+?)$"

    match = re.match(pattern, response.strip())

    if match:
        return True
    else:
        return False

evaluate_response(respond("Why is the earth round?"))

Running this on the output of the above function returns False, as the colon breaks the pattern. We now have a goal to make this function pass!

Generating a dataset of examples

Since we know roughly how we want the output to look like, we can synthetically create additional examples.

Examples are essentially input and output pairs, so we can create a simple object to represent them and a function to generate them:

class Example(BaseModel):
    question: str
    response: str

@fn(path="cookbook/generate_examples")
def expand_examples(reference: Example) -> list[Example]:
    """ Create 10 more examples in the format of the reference example, but with different questions """

reference = Example(
    question="Why is the earth round?", 
    response="You asked why the earth is round, and the earth is round because of gravity in space.")

examples = expand_examples(reference)

We get a list of examples in the following style:

question='How do airplanes stay airborne?' response='You asked how airplanes stay airborne, and airplanes stay airborne due to the lift generated by their wings as they move through the air.'
question='Why does ice float on water?' response='You asked why does ice float on water, and ice floats on water because its solid form is less dense than its liquid form due to the way water molecules expand when they freeze.'

Populate the Function dataset

These examples should now be great for "showing" the model of how we want responses to look like, and we want to populate the Respond function dataset with these.

To do this, we need to use the Opper API. Note that we now have 10 examples and we use 70% for examples and 30% for testing later.


import requests

# We locate the dataset id for the respond() function in the ui
dataset_id = "346ac7e6-9bcc-4736-8ab6-7fad8928fca8" 

# Split the examples
training_set = examples[:7]
testing_set = examples[7:]

# Post examples to the endpoint
def post_example(dataset_id, input_text, output_text, comment_text):

    endpoint_url = "https://api.opper.ai/v1/datasets/" + dataset_id

    headers = {
    "x-opper-api-key": os.environ["OPPER_API_KEY"],
    "Content-Type": "application/json"
    }

    body = {
        "input": input_text,
        "output": output_text,
        "comment": comment_text
    }
    response = requests.post(endpoint_url, json=body, headers=headers)
    if response.status_code == 200:
        print("Successfully posted example:", input_text)
    else:
        print("Failed to post example:", input_text, "Status Code:", response.status_code)

# Loop through examples and post each one
for example in training_set:
    comment = "Generated example for model training"
    post_example(dataset_id, example.question, example.response, comment)

We can now see that the dataset for this function is populated with our examples:

Enabling few shot retrieval

To enable few shot, we need to update the function few_shot parameter, and we choose to use 3 examples. This means that upon calling the function, the 3 most semantically similar examples will be added to the prompt automatically.

@fn(path="cookbook/few_shot", model="mistral/mistral-large-eu", few_shot=True, few_shot_count=3)
def respond(question: str) -> str:
    """ Create a brief response to the question. Repeat the question in the beginning of the response. """

Lets try it:

print(respond("Why is there more water than land on earth?"))

This outputs:

You asked why there is more water than land on Earth, and this is due to the fact that Earth's early history involved a lot of volcanic activity, which caused the continents to form and drift apart, while the oceans remained relatively stable. The continents make up only about 30% of the Earth's total surface area, with the rest being covered by oceans.```

Success!

We can now run through all our entries in the testing_set and put them through the test:

for test in testing_set:
    response = respond(test.question)

    print(test.question)
    print(response)
    print(evaluate_response(response))
    print("-----------------")

This outputs the following:

How does the internet work?
You asked how the internet works, and the internet functions as a global network of interconnected computers and servers, communicating with each other using standardized protocols to exchange data packets, enabling worldwide information sharing and communication.
True
-----------------
What is the theory of relativity?
You asked what the theory of relativity is, and the theory of relativity is a scientific theory proposed by Albert Einstein that describes the laws of physics in the presence of gravity and motion, including the concepts of space-time curvature and the equivalence of mass and energy.
True
-----------------
Why do we have seasons?
You asked why we have seasons, and seasons occur due to the Earth's tilt on its axis and its orbit around the sun, which cause changes in the amount and angle of sunlight that reaches different parts of the planet.
True
-----------------

This looks great!

We have now succesfully provided the function with examples on how we would like the output to look like. And we did this without having to modify the prompt in any way. If we would like to have another style of the response we just modify the examples. If edge cases happen, we can add additional examples as we go without having to modify code.

Testing additional models

In the above example we used mistral/mistral-large-eu, which is a good and relatively powerful model. But one might wonder how this performs with smaller models? By having the validation function and test set ready we can easily run this through a selected set of small models on available on Opper with the same 3 test cases.

Here are the results for Haiku from Anthropic, Titan from AWS, Gemini-1.5-flash from Google, Mistral-tiny from Mistral, Gpt3.5 from OpenAI and Mistral-7b-instruct from Mistral/Opper.

Testing the few shot model on all available models.

All except aws/titan-text-express-v1-eu passed all tests. The likely explanation of the Titan model shortcomings is that it is very old and doesn't follow instructions or context well! But all the rest do which shows great promise in making this use case fast, qualitative and cheap!

Conclusion

In this post we showed how we can use examples to guide the output LLMs. We showed how to build a test to verify that outputs meet the criteria, how to build a data set of examples and how to add them to the dataset of our response function. Finally we showed how to enable few shot and see how well it did. Our test scenario passed 100% of the test, and we even verified this with some of the cheapest and fastest models on the market.

Examples are super powerful for getting LLMs to do what you want! And a bonus is that you don't clutter the code with extensive instructions, and you can manage this optimization step as a runtime dataset that is quick to update and change over time.

Thanks for reading! If you find this interesting, feel free to take Opper for a spin!