We gave Hugging Face's Reachy Mini robot a brain in one hour

By Jose Sabater -

Yesterday the Opper team spent the day at MagasinX in Stockholm, the new hub for robotics and hardware. Amazing place, full of people building physical things.

When we arrived, this little guy was lying on our desk: a Reachy Mini, the open-source desktop robot from Hugging Face, with a moving head, wiggly antennas, a microphone, a camera and a speaker. None of us had seen one before, so naturally we got curious about how it worked. We like a good challenge, and a real robot sitting on the desk was too tempting to leave alone.

One hour later, it had a brain.

We had wired the Reachy Mini up to a realtime voice AI model through Opper, and the robot could suddenly hear us, talk back, look around, and jiggle a little when it got excited. Say hi to Reachy 👋

Reachy Mini, an open-source desktop robot from Hugging Face, on a desk at MagasinX in Stockholm

The full code is open source: github.com/opper-ai/reachy-voice-realtime.

The realisation: a realtime model is just a brain you can plug in

Here is the thing that clicked for us that morning. We didn't write a speech-to-text pipeline. We didn't stitch together transcription, then an LLM, then text-to-speech, then a motion controller. We didn't tune turn-taking or build a voice activity detector.

We just opened a WebSocket to a realtime model and handed it the robot's senses and limbs:

  • Ears: the robot's microphone, streamed straight into the model.
  • Eyes: the camera, sent as image input.
  • Voice: the model's audio, played out of the robot's speaker.
  • Body: a list of tools the model can call to move, like nod, wave, look_around, set_head, dance.

That's the whole trick. A realtime model is a brain with audio in, audio out, and the ability to call functions. Once you frame it that way, the robot is just one possible body. The exact same loop could drive a kiosk, a phone app, a car dashboard, a customer support line, or a game character. Realtime voice can add an AI brain to almost any interface, and the body is whatever tools you give it.

That's why this feels like the future of robotics to us. It's the heart of what people now call physical AI, or embodied AI: a multimodal model that can perceive and act in the real world, wrapped in a body. The hard part used to be the brain. Now the brain is an API call, and you get to spend your hour on the fun part: what the body can actually do.

How it actually works

The agent runs a single realtime session and gives the model a handful of tools. Here is the shape of it, trimmed down:

# The model becomes Reachy: it hears via the mic, sees via the camera,
# speaks through the speaker, and calls tools to move while it talks.

TOOLS = [
    {"name": "nod",          "description": "Nod the head yes"},
    {"name": "shake_head",   "description": "Shake the head no"},
    {"name": "wave",         "description": "Wave the antennas hello"},
    {"name": "look_around",  "description": "Sweep the head left to right"},
    {"name": "dance",        "description": "Body sway + head bob + music"},
    {"name": "look",         "description": "Grab a fresh camera frame and look"},
    # ...19 tools in total
]

SYSTEM_PROMPT = """
You are Reachy, a small curious desktop robot.
Mirror the human: if they wave, wave back. If they nod, nod back.
Use your tools to move while you talk. Be playful.
"""

When you speak, the audio streams to the model. When the model decides to wave back, it emits a tool call, the agent runs the matching motion on the robot, and the model keeps talking through it. See you wave, it waves back. Hear a sound off to the side, it turns toward you. It all happens in one continuous conversation.

One API key, any model

We went with OpenAI's GPT Realtime 2 since it was the newest realtime model out there, and we ran it through the Opper gateway rather than calling a provider directly, which removed the last bit of friction:

  • No client-side API key. Opper handles auth with an OAuth device flow and mints short-lived WebSocket tickets. You only need an Opper credential, and nothing sensitive ever sits on the device.
  • One credential, any model. Realtime voice is just one of the models behind the same gateway. With a single API key you can also reach Grok Voice and Gemini Live, so you can pick whatever fits the job and swap with a config change instead of a rewrite.
  • Tune for budget and native features. Different realtime models trade off cost, latency and capabilities. Gemini Live, for example, can take live video in, not just still frames, which opens up richer perception. Being able to A/B models behind one interface means you can match price to the task and lean on each model's native strengths.

If you've used the OpenAI-compatible API from Opper before, this is the same idea extended to realtime voice.

And it doesn't have to run in the cloud

Today Reachy's brain lives in the cloud. But the same gateway can run on sovereign, EU-based infrastructure like evroc, which means the model can sit close to the robot instead of a continent away. Lower latency, data that never leaves your walls, and a path toward robots that keep their wits about them even when the internet doesn't. Eventually, close enough to run on the robot itself.

Try it yourself

The whole thing is open source and runs in two terminals. Plug a Reachy Mini in over USB, then:

# Terminal 1: daemon that owns the USB connection to the robot
reachy-mini-daemon --fastapi-port 1111

# Terminal 2: the voice agent (first run signs you in via OAuth, no API key)
python -m reachy_agent --opper-login

Open http://localhost:1080, and talk to Reachy. There's a little UI showing the live camera frame, the transcript, and every tool call as it happens.

Full setup, the 19 tools, the voices, sound localisation, and a demo video are all in the repo:

👉 github.com/opper-ai/reachy-voice-realtime

And because every good demo deserves a proper sign-off, here is Reachy saying goodbye 👋

Add a brain to your own interface

The gap between "a robot we'd never seen" and "a working, talking prototype" was a single morning. The brain is the easy part now, and that changes what a small team can build in an hour.

Whether your "body" is a robot, an app, a kiosk or a phone line, the recipe is the same: point a realtime voice session at it and hand the model a few tools.

Sign up for Opper to get one API key for realtime voice across providers, then clone the Reachy voice agent to see a full working example. We'd love to see what you give a brain to next.