Object Detection on the Raspberry Pi 5
Background
I recently configured some Raspberry Pi 5’s to run USB cameras and capture video throughout the day. The video segments are stored on the Pi in a m3u8 playlist which allows me to embed them on my Grafana dashboard. I wanted to keep at least a days worth of video as it would give me plenty of time to catch up on any events in the backyard while I was away. Also, there are some neighborhood cats that frequently move around at night and I was hoping to catch them when they turn on the motion sensitive lights.
Navigating 24+ hours of video on tiny slider is quite difficult. Seeking to the exact points when my lights are on (for 15 seconds!) is even more difficult. After setting up only the camera video stream, I wanted to add some sort of object detection. After tuning ffmpeg, I knew I was pressed for resourced and was quite optimistic that I would find something to run on the RPi5. I knew there were a new class of models designed for phones. Google’s answer to this sort of product is MediaPipe though it seems to have a pretty wide range of solutions. They have a pretty simple tutorial so it is easy to get started.
I downloaded this recommended model: (int8 Model Used)
Implementation
I wanted to use the m3u8 stream as input instead of the camera device itself so that the detection service could potentially be run on a better device. This also let me develop the service on my desktop’s docker instance instead of developing on the Raspberry Pi itself. It turns out that OpenCV can use m3u8 streams as input to cv2.VideoCapture(...)
so this turned out to be quite easy to get it working. However, I soon found out that the detection algorithm is quite slow and I could see the segments that were fetched would slowly be farther behind the “latest” in the playlist. Because my playlist stores so many segments, there is no error fetching an old or deleted segment (at least for the immediate several hours). I then added the following features to my image detector service:
- a Prometheus histogram to measure the duration of a detection call
- a feedback loop to use the duration of the detection to limit the times a detect is called
dividing the sum by counts, I can see that the higher res camera (1920x1080 = 2,073,600 pixels) runs about 260 ms and the lower res camera (1280x720 = 921,600 pixels) runs at about 220 ms. Given that the timing is relatively similar for a picture twice as large, I do wonder how much of the overhead is from running Python instead of running a C++ version of the service. I am currently using the following CPU config in docker compose:
cpu_count: 1
cpus: "1"
cpuset: "1"
ffmpeg is run with a similar config but with separate cpuset and 2 cpus.
Quality of Detections
The final thing to note is the quality of the detections. Now, this task is quite ambiguous as there are clearly many things in each picture. Should the door be annotated? The wooden pole? The deck? Overall, the detector worked quite well when I was testing and it is suitable for detecting people. With awkward camera placements or obstructed viewing, I think this detector might not perform as well.
Note: I also did not set up the detector to run in video mode or in stream mode. I don’t know much about the internals of the model or if the difference between frames is accounted for in either mode.
Moreover, the detector failed to detect obstructed movements of cats (false positive)
Here are some recent examples of the “false negatives”:
Given that the detection results are not quite desirable, I have a few options:
- add detection filters to ignore “false” detections (I’ve also had “refrigerator” and “table” labels appear)
- increase the detection threshold (which might ignore obstructed / less confident readings of desirable objects)
- experiment with other models or other running modes
- implement a simpler “detection” function (like distributions of RGB values across various subsections of the image)
Given that the content in the frame is relatively similar frame to frame, option 3 might be viable. If I had a noisier image (like a busy street), I probably would be forced to tune the model or go with 1, 2, or 3.
Wrapping Up
While the project isn’t complete just yet, I’ll release this post to get this information out there. I think there may be several attempts at getting this working so it might be better to keep this one short if the next one ends up being longer.