Back to timeline
school Nov 2023

Oda — a Norwegian voice assistant

A two-month project for NTNU's avansert parallell track: a Norwegian-speaking voice assistant called Oda, running on a Raspberry Pi Zero W. Custom RNN wake-word trained on 30k Norwegian audio clips I filtered myself, Whisper for speech-to-text, an offline GPT for open questions, and a mesh of ESP32 sensor modules for the things a chatbot can't know.

What it does

You say "Oda," then ask something. If it's a command the system recognises ("what's the time", "what's the temperature in the room") it answers from cache or sensor data. If it's an open question ("what's the meaning of life") it routes to a local LLM. Everything runs offline by default — no cloud APIs in the answer path.

The assistant on the workbench mid-assembly: a Bumblebee helmet enclosure with wiring spilling out the back, breadboard and modules behind it.
Mid-build. Bumblebee enclosure, Pi Zero inside, ESP32 sensor modules behind.

Wake-word from scratch

The hardest part of the project was data, not modelling. There is very little public Norwegian speech data with reliable captions, so I built my own pipeline: the 27k clips from the Norwegian språkbank, plus ~750k clips scraped from 251 captioned Norwegian YouTube channels. Caption quality varied wildly — TV channels like NRK and TV2 sometimes had Norwegian text over English audio — so I re-ran every clip through Whisper-large and only kept the ones where Whisper agreed with the original caption. About 30% survived. That left me ~30k high-quality clips to train on.

To make the model smaller, I swapped the standard Norwegian alphabet for a phonetic one with two extra letters: ɵ for the "ng" sound and ʂ for the "sj/kj/sjk/tj" cluster. Fewer, more evenly distributed labels — easier to learn. The wake-word model itself is a small GRU with CTC loss, chosen for efficiency on the Zero W.

The pipeline

Side-by-side diagram: an RNN encoder processing tokens one at a time, versus a transformer encoder seeing all tokens at once.
RNN for the wake-word (cheap, always on), transformer for transcription (heavy, fires once per command).

ESP32 sensor mesh

The assistant can answer questions about the physical world because cheap ESP32 modules with sensors are scattered around — a thermometer, a PIR motion sensor, a dust sensor. Each ESP32 boots into Bluetooth-server mode if it hasn't seen the Wi-Fi before; saying "oppstart" makes the Pi scan for them and hand over credentials over Bluetooth (weakly encrypted so the SSID/password isn't broadcast in clear text). Once on the network they're addressable over local sockets — a single byte over the socket means "ping," "name," or "value."

A Raspberry Pi Zero W single-board computer.
Pi Zero W. Brain.
INMP441 I²S MEMS microphone modules with header pins.
Two INMP441 mics, slightly angled, to catch the room.
An HW-104 PAM8403 class-D amplifier board.
PAM8403 class-D amp into a 3Ω speaker.

Demo

Asking the time, the room temperature, and the meaning of life, in Norwegian.

What it taught me

The model was never the bottleneck. Norwegian speech data was, and figuring out how to mine and filter it from YouTube — using a bigger model to grade a smaller one — was the most useful skill I picked up. The second-most-useful was learning where PyTorch and TensorFlow each pull their weight: PyTorch felt better for training and iterating; TF felt faster once the model was frozen.

Course report

Full Norwegian IELS1001 report — "Talestyrt smart assistent med ESP moduler" — covers the full architecture, training pipeline, the sensor protocol, and the budget:

Back to timeline