How we reduced agent costs by 98.6% using Opper context features

By Göran Sandahl -

When building AI-powered products, you inevitably hit a tension between quality, speed, and cost. Large frontier models deliver great results but are slow and expensive. Small models are fast and cheap but often fall short on complex tasks. With Opper's context features, we resolved this trade-off entirely: a small open-weight model with no loss in quality, 98.6% lower cost, and 72% faster response times.

This post walks through the real data behind that result.

The problem: building agents interactively

Agento lets users create AI agents through conversation. When a user describes what they want, say "I want an alert email when price for trip to Copenhagen goes below 1000 SEK via plane", an agent builder agent takes over. It asks the right follow-up questions, gathers the necessary information, and assembles the agent.

This agent builder process is critical to the product experience. It needs to:

  • Retrieve the right information in as few conversational turns as possible
  • Be fast, moving the user forward without unnecessary latency
  • Run at low cost, since every agent build is a cost we absorb

The challenge is that these requirements pull in opposite directions. The models that are good enough to handle the nuance of agent building are expensive and slow. The models that are fast and cheap tend to miss important details or require too many rounds of clarification.

Baseline: without Opper context features

We tested the agent builder on two models without applying any of Opper's context features.

ModelUser roundsLLM callsServer time (s)Passes acceptance testCost
Opus-4.54.014.3363.37Yes$0.451
Oss-120b5.5013.2530.58No$0.013

Only Opus-4.5 delivered a good enough experience, but at $0.451 per agent build and over a minute of server time. The small model was fast and cheap but failed the acceptance test. It couldn't reliably extract the right information from the user or produce a working agent configuration.

At $0.451 per build, scaling to thousands of users would mean significant infrastructure cost just for the agent creation flow.

With Opper context features

Opper's context features let you steer model behavior by providing structured context (examples, patterns, and domain knowledge) that guide the model toward the right output without relying solely on the model's innate capabilities.

After applying context features to the same agent builder, the results changed dramatically:

ModelUser roundsLLM callsServer time (s)Passes acceptance testCost
Opus-4.52.679.030.93Yes$0.329
Oss-120b2.08.017.43Yes$0.010

The small model now passes the acceptance test. Not only that, it delivers the best experience overall: fewest rounds, fewest LLM calls, fastest server time, and lowest cost.

The numbers

The improvement from baseline Opus-4.5 to Opper-enhanced Oss-120b:

  • 98.6% cost reduction, from $0.451 to $0.010 per agent build
  • $436 saved per 1,000 agent builds
  • 72% faster, from 63.37s to 17.43s server time
  • 50% fewer conversation rounds, from 4.0 to 2.0 user turns
  • No loss in quality: both configurations pass the same acceptance test

Even comparing Opper-enhanced Opus-4.5 against itself shows significant gains: fewer rounds, fewer calls, and faster execution. The context features improve every model, but the effect is most dramatic on small models where the gap to close is largest.

How it works

Opper's context features work by enriching the model's context with relevant examples and structured patterns at inference time. Rather than relying on the model to figure out what a good agent configuration looks like from a generic system prompt alone, you provide it with concrete examples of successful interactions and outputs.

from opperai import Opper

opper = Opper()

# 1. Make the call
response = opper.call(
    name="agent_builder/process_user_input",
    instructions="You are an agent builder. Extract the user's intent and build an agent configuration.",
    input={"message": "I want an alert email when price for trip to Copenhagen goes below 1000 SEK via plane"},
    tools=[...]  # Agent builder tools (e.g., extract_entities, configure_trigger)
)

# 2. Determine if the agent run was successful (programmatic check, not user rating)
agent_run_success = check_agent_run_success()  # Your acceptance test logic

# 3. Submit feedback - positive feedback auto-saves to dataset
opper.spans.submit_feedback(
    span_id=response.span_id,
    score=1.0 if agent_run_success else 0.0,
    comment="Agent acceptance test passed"
)

The key insight is that small models don't lack capability. They lack context. When you provide the right context through Opper, a 120B parameter open-weight model can match or exceed the performance of a frontier model on your specific task.

This is fundamentally different from fine-tuning. There's no training step, no model weights to manage, and no deployment complexity. You add examples and feedback directly through the Opper platform, and the improvement is immediate.

When to use this approach

This pattern works well when:

  • You have a well-defined task with clear success criteria
  • You can provide examples of good outputs for similar inputs
  • Cost and latency matter because you're running inference at scale
  • You want to iterate quickly, since adding examples is faster than prompt engineering

The agent builder use case is a strong fit because the task is structured (extract intent, configure agent), the success criteria are clear (does the resulting agent work?), and the volume is high enough that per-call cost matters.

Try it yourself

If you're running agents or structured LLM tasks and want to explore whether a smaller model could handle your workload, Opper makes it straightforward to test:

  1. Set up your call on platform.opper.ai with your current model
  2. Add examples of good inputs and outputs
  3. Switch to a smaller model and compare results
  4. Measure the difference in cost, latency, and quality

The Opper dashboard gives you full visibility into each call, including token usage, latency, and outputs, so you can make the comparison with real data.

Get started at platform.opper.ai or check out the documentation to learn more about context features.