Opper

Lost in Context: Where LLMs Shine — and Fail — at Using What They’re Given

We present results of our Benchmark for Context tasks, with insights into what is easy and hard for each of the models

Introducing Opper Taskbench - A Real‑World Benchmark for Task‑Oriented LLMs

We built TaskBench to measure real-world LLM performance on practical tasks like RAG, SQL generation, and agentic workflows. Here are our findings across accuracy, cost, and model size.

Reference‑Free LLM Evaluation with Opper SDK

Three reference‑free evaluators to demonstrate how to evaluate RAG systems at runtime without gold references.

Building a Simple GitHub PR Review Agent with ReAct

In this post, we will build an initial version of a simple but effective GitHub PR review agent using the ReAct pattern.

Indexing docs and websites using Github Actions

This guide explores how to automatically index your documentation in Opper Indexes using GitHub Actions for integration with your CI/CD pipelines

Introduction to Schema Based Prompting: Structured inputs for Predictable outputs

Schema based prompting is a technique that allows for instructing models through clear data structures instead of natural language prompts. This post introduces the concept and shows how to use it with Opper.

New OpenAI-compatible endpoint: Use Opper with OpenAI SDKs and frameworks

Opper now provides an OpenAI-compatible API endpoint that works seamlessly with OpenAI SDKs and its ecosystem of popular AI tools and frameworks.

Introducing Opperator: A composable agent to automate tasks on the web

Opperator is a programmable, autonomous web agent for automating tasks on the web built with the Opper SDK.

Reason then respond with DeepSeek-R1 and Mistral Tiny

Using a reasoning model to generate detailed thought traces that improve the quality of AI responses, while keeping costs low.

Agentic customer service chatbot with tools, tracing and evals

We build a chatbot to assist users by utilising tools, while maintaining context and handling errors gracefully.

Using o1-preview and o1-mini with RAG and structured output

In this blog post we explore how OpenAIs reasoning models o1-mini and o1-preview perform in a RAG pipeline with structured output

Takeaways from AI Engineer World Fair, San Fransisco 2024

Three days at the AI Engineering World Fair in San Francisco, covering the what, how and why of LLMs and how to best use them.

Using Examples and Few Shot Retrieval to Shape LLM Responses

We build a pipeline to shape the output of LLM calls with synthetic examples and few shot retrieval, and see how multiple non-frontier models perform

Resilient Azure OpenAI using Azure API Management

Exploring how to set up APIM with an OpenAI-compatible API, and how to connect it to multiple OpenAI deployments. We will also cover how to set up failover and load-balancing.

RAG metrics: answer correctness

In this blog post we explore how to implement and use RAG metrics to evaluate the quality of answers generated by a model. We will use the answer correctness metric to evaluate the quality of answers generated by a model.

Introducing Delvin: State of the art bug fixing agent

Delvin is an agent fixing issues from the SWE-Bench Lite dataset, achieving state-of-the-art accuracy(23%) with very simple code leveraging the Opper SDK.

Extracting recipes from images using gpt-4o in Opper

In this blog post we explore how to use multimodal models in Opper. We will use the newly released model gpt-4o to generate structured data from images

Simple RAG with citations

In this blog post we show how to build a simple RAG feature with citations, using Opper with structured input/output and Mistral-Large.

Examples are all you need: getting the most out of LLMs part 2

Getting GPT 3.5 Turbo to 80% accuracy on the GSM8k benchmark by leveraging the teacher student pattern, synthetic examples and few shot retrieval.

Getting the best out of LLMs, part 1

Dramatically improving LLMs accuracy with structured generation and chain of thought

Introducing Opper

At Opper, our mission is to accelerate adoption of Generative AI by making it simpler to build production grade reasoning applications, agents and features.