school Spring 2026

A vision transformer that knows its salmon

The ML half of a five-person NTNU project on monitoring invasive pink salmon (pukkellaks) in Norwegian fjord straits. Underwater cameras watch fish pass through a 1×2-metre observation chamber; a small CNN detects them locally on a Raspberry Pi; a DINOv2 vision transformer, fine-tuned with LoRA, classifies species and estimates length on the server. I shared hardware responsibility for the camera rig and built the server-side classifier.

Why pink salmon

Norwegian wild salmon (laks) is classified as near threatened on the national red list. The fastest-growing threat is pink salmon, an invasive Pacific species that spawns earlier and more aggressively, displaces wild salmon and sea trout from their best spawning grounds, and then dies en masse (the species has a fixed two-year life cycle), polluting the watercourse with rotting fish. In 2025 control efforts ran in 63 Norwegian rivers.

Existing fish-monitoring tech is mostly designed for inside salmon farms or for river-mouth traps. Nobody is watching what comes into the fjords before the fish move upstream. That is the gap our project tries to address.

Hand-drawn top-down sketch of a river mouth: a green guide-net ('Ledenett') angles out from the land, funnelling fish into a black observation chamber. Fish drawings show the intended swim direction; a blue arrow labelled 'Svømmeretning' marks the natural current. — Concept sketch. A guide-net funnels fish through the observation chamber on their way upstream.

The system, briefly

Observation chamber. 2040×2040×1000 mm aluminium-profile frame with a white tarp back wall, mounted in-stream.
Cameras. 1× 8MP IMX179 (detail capture; lice identification) and 3× 2MP IMX290 low-light cameras (0.001 lux). Each in a sealed acrylic-dome housing with the lens at the dome's geometric centre to minimise refraction.
Sensors. Three ESP32 nodes at depths 0.5/1.0/1.5 m, each with a DS18B20 temperature sensor and a Keyestudio TDS probe (a cost-effective proxy for salinity). I²C bus to a Raspberry Pi 4B.
Edge compute. Raspberry Pi 4B handles four parallel USB-3 video streams, runs the local detection model, and uplinks over 4G.

Chamber CAD. Aluminium-profile frame, white tarp back wall, camera mounts at three depths.

System dataflow diagram in Norwegian. Inputs 'Fisk' (fish) and 'Vann' (water) feed into 'Fanger video' and 'Innhenter data' nodes via cameras and a temperature/salinity sensor. The MPU forwards data to a 'Server' which performs 'Gjenkjenne lengde og art' (recognise length and species), then forwards to a 'Nettside' (website) and a 'Resultat' (result) node. — End-to-end dataflow. Cameras and sensors → edge MPU → server → web UI.

The two-stage ML pipeline

The interesting design choice was splitting detection from classification:

Local CNN (YOLO) on the Pi. A small detector watches the video stream in real time, produces bounding boxes and a tracking ID per fish. Most of the time, all it sees is water. When it triggers, only the relevant frames get sent over 4G, which saves bandwidth and storage and keeps the pipeline alive when the cellular link drops.
Server-side ViT (DINOv2). A much heavier transformer runs on the server, where compute is cheap and the model can afford to be careful. It classifies species (salmon / pink salmon / sea trout / sea char / "no fish") and estimates length, jointly, in a multi-task head.

The "no fish" class is what filters false positives (branches, waves, shadows) that the small local model lets through. Tracking IDs let the server pool multiple frames of the same fish during its passage and vote, which makes the final classification far more robust than any single frame.

A frame from an underwater camera inside the chamber. A dark salmon-shaped fish is in the lower-right, with a cyan YOLO bounding box around it labelled 'fish'. Tick-marks along the chamber wall act as a length reference. — Local YOLO output. The small detector running on the Pi, drawing a box around a fish in the chamber.

Demo of the full pipeline using a desk webcam: a person walks past holding paper printouts of two salmon. The web UI overlays bounding boxes on each printout, assigns 'FISK ID: 1' and 'FISK ID: 2', and reports estimated lengths (13.7 cm and 39.0 cm) and species confidence around 96–98%. — End-to-end demo. A test rig with paper-printout fish; the server returns IDs, estimated lengths, and species confidence.

The LoRA part

DINOv2 is a strong general-purpose ViT, but it has not seen pink salmon in low light through an acrylic dome. Full fine-tuning would update hundreds of millions of parameters; that is expensive and overkill for our dataset size.

LoRA (Low-Rank Adaptation) assumes the changes you actually need to make during adaptation have low intrinsic rank. For each frozen weight matrix W ∈ ℝ^d×k, you inject a trainable low-rank update ΔW = B·A where A ∈ ℝ^r×k and B ∈ ℝ^d×r with r ≪ min(d, k). The original weights stay put; only the adapters are trained.

In practice this gave us a model that handled our domain well, trained in hours instead of days, and didn't need a massive amount of labelled data to land somewhere useful.

My slice of the project

Five-person team. From the report's work-allocation section, my share was:

Hardware (shared): physical camera-system layout, component interconnection, power distribution.
Species classifier: adapting and validating DINOv2 with LoRA for the species task, and integrating the classifier with the video pipeline.

Teammates owned sensors, detection & length estimation, frontend, and backend respectively. Everyone wrote.

Writeup

Full Norwegian report: system design, hardware bill of materials, ML method, validation, all of it.

Download: Elsys_prosjekt_rapport.pdf

Back to timeline