I live in an area of the United States where I have a lot of random wildlife wandering through my backyard. At the end of 2022, I put up security cameras so I could spy on those little cuties. I very quickly realized, however, that there's a whole lot of nothing that happens in between the times when the fluffy little guys come sauntering through. I started thinking about it and, lo and behold, someone out there on the internet has solved this problem for me. The MegaDetector project has a pretty nifty AI model for finding animals in images. They've also provided some Python code to demonstrate how to use it. And that, in a nutshell, is what this repo is: me leveraging what I know about running distributed computer programs with a nifty AI model that I can use to find animals in my backyard video cameras.
As of the summer of 2024, I'm still trying to figure out how to speed it up
without spending thousands on GPU instances on AWS (or a suped up desktop or
something) and, honestly, I think I've reached the limits of what my laptop
equipped with an Nvidia RTX 2070 can do. To process around 90 minutes of video
is currently taking me about 3.25 hours on my laptop. However ... remember
how I said I started with what I know about running distributed computer
programs? Well, I put together a Dockerfile and docker-compose.yaml that one
could use to run this across several GPU equipped EC2 instances or something.
Note that this probably requires a trust fund or some other means by which you
can finance this without impacting your ability to pay rent. I processed
exactly one approximately 90 minute long video on a g5.xlarge
instance and
my laptop and it took ... about 90 munutes. Which I guess means I could have
just sat there and watched it myself. 🤷 You can use that information
however you want to. I'll get into the nitty gritty of how to set this project
up to run that way further down in the README but first let's talk about what
it takes to run it in the first place.
At the very least, you'll need a Nvidia GPU equipped computer to run this on. I've never attempted to run it on anything other than Ubuntu 22.04 with version 12.5 of the Nvidia CUDA Toolkit but I think it'll probably work with older versions as well. Getting that setup can be a little weird but most concise instructions I've found are here. I make no claims to know very much about that process but if you get stuck, maybe we can put our heads together and figure it out (just open an issue).
One other thing that has been a bit of a struggle for me with this is having enough space on a fast enough disk. Firstly, extracting all of the frames from hours of video takes up a whole bunch of disk space. I tried to make it so the JPEG compression that the project uses is good enough to retain quality but also make the files a little smaller. That said, a video of approximately 90 minutes has over 100,000 images in it (and that's at a pretty low frame rate). In my experience, this can consume around 180GB of disk space.
The other thing that ends up requiring a lot of disk space are the docker images. The image that the Dockerfile in this repo builds is over 10GB so when you're messing around with it and building different versions, it can add up.
Besides space, the disk needs to be relatively fast. So, if you're thinking "I've got a big 'ol USB backup drive I can use for this", just be prepared to wait because having fast disk ends up making a real difference. Further down when I talk about how to distribute this across several systems, the setup seamlessly uses AWS S3 for a storage backend. This is fine I guess (and really the only way to make it work in a distributed way) but, again, using a local, fast disk really makes it better.
To use the Docker setup, you'll also need to install the Nvidia Container Toolkit. FWIW, that's probably going to be the vastly simpler way of using this (and, to be honest, probably the better way, since it'll be easier to scale and run across several systems, too). You also won't need to worry about getting the CUDA Toolkit setup on your host since the container handles all of that for you and just leverages whatever GPU you have on your host.
So, to summarize for the people who are just skimming, the prerequisites are:
- Nvidia GPU
- Debian-ish Linux distro (tested on Ubuntu 22.04)
- Nvidia CUDA Toolkit (tested with 12.5 but older versions will probably work)
- A lot of disk space on a fast disk (optional but, like, definitely worth it)
- For Docker: Nvidia Container Toolkit
The simplest way to run this is just on your local machine using the
docker-compose-local.yaml
file with the .env.local
file to populate the env
vars. To run it in a more distributed way, see the section on running it in a
more distributed way below. To get a basic running version up, here's the
tl;dr:
- Build the docker image (see "Docker build process" below)
- Make a copy of the example local env file and make changes as needed
(probably the only thing you'll want to think about changing is the
STORAGE_MOUNT
but, it should just work as is)
cp .env.local.example .env.local
- Run the docker compose file using the copy of the env file you just made:
docker compose -f docker-compose-local.yaml --env-file .env.local up
That's basically it. You should get an admin
container for free. By default,
it's configured to make a user for you if you set DJANGO_SUPERUSER_USERNAME
,
DJANGO_SUPERUSER_PASSWORD
and DJANGO_SUPERUSER_EMAIL
in your .env.local
file. Then you should be able to login using those creds by navigating to
http://127.0.0.1:8000/admin
in your web browser. If you're familiar with the
Django Admin, this should look familiar. Once you're logged in, you can click
through to Video
and then Add Video
in the upper right hand corner. From
there, you should be prompted for a file to upload. Once the video is uploaded,
it should start processing.
The example I've included for running this in a distributed way is running some
workers on an AWS EC2 instance with a GPU and some workers on a local machine.
If you'd like to run this entirely on AWS or another cloud provider, one
thing you'll need to do is make the admin
container slightly less dumb. Right
now it's just using the Django development server and isn't behind a web
server, isn't using SSL, etc, etc. I'd really, really recommend not just
running that as is anywhere but your local machine. I've been deploying Django
in production environments since 2009 and have done this in Docker a few times
as well so if you get stuck attempting to Google for it, open an issue and I'll
give you some pointers.
At any rate, the tl;dr to get this running on AWS is:
- Ask AWS to let you spin up GPU instances If you don't already have permission, this is something that you have to submit a support ticket to get turned on. They seem to get back to you pretty quickly but, it's a step nonetheless. If you're not familiar with how they do these things, you're asking for the ability to run a certain number of vCPUs of a given instance type. I was able to get 8 vCPUs "granted" to me for "G" type instances (which are the cheapest GPU instances as of early 2024).
- Spin up your GPU instance and install things I've included instructions
for getting things setup on an Ubuntu 22.04 machine above (see
"Prerequisites") and it should work in more or less the same way if
you're using Ubuntu 22.04 for your new instance. One nice thing that is
included with the "G" type instances is a 250GB instance store which I
started using for my docker setup so that I didn't have to pay for a
massive EBS volume. If you want to do something similar, you can format
and mount that device and then add a
/etc/docker/daemon.json
file that tells your docker setup where to cache the images. I'll let you go ahead and Google that so this tl;dr doesn't get too long. - Build the docker image You can either do this locally and push the image
to AWS's Container Registry (aka ECR; that's what I was doing) or just build
the image on your new instance. Either way, you can follow the Docker build
process below. Before you push to ECR, you'll need to get the login creds and
configure docker to use them. Here's a nifty one-liner for that:
aws ecr get-login-password --region us-east-2 | docker login --username AWS --password-stdin <your-ecr-hostname>
- Make a DB instance You should be able to use the smallest instance type
available to run PostgreSQL or, if you're fancy (and really like giving AWS
money) you can use RDS or Aurora. You should be able to just spin up a Linux
instance of your choosing and install PostgreSQL on it. You'll want to edit
your
pg_hba.conf
file to allow your GPU instance(s) and your local machine to connect to it. Also make sure it uses the same security group as the GPU instance(s) you spun up. That'll make the next step easier. - Setup your Security Group You'll want to add a couple rules. One that allows instances that are associated with that security group to connect to one another on port 5432, and then one that allows your local machine to connect to it on port 5432. I suppose you could make a couple security groups and make that a little cleaner but, ya know, let's just get to the good part, shall we?
- Copy the example .env file Similar to running this thing locally, you'll
need to make a copy of
.env.aws
and make changes to it as needed. At the very least you'll need to change:AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
AWS_REGION
FRAMES_BUCKET
VIDEOS_BUCKET
- Probably most of the
DB_*
vars based on how you setup your DB instance EXTRACT_WORKERS_PER_HOST
Configures the number of replicas to make that will be extracting images from the video chunks. On ag5.xlarge
I was able to get away with 4. On my laptop with an RTX 2070, I used 2.DETECT_WORKERS_PER_HOST
Configures the number of replicas to make that will work on finding things in the frames of the videos. On ag5.xlarge
I was able to use 4. On my laptop with an RTX 2070, I used 2.STORAGE_MOUNT
You'll probably want to use the instance store for this. There's a part in the processing where chunks of the video files get written to disk so you'll need some space. They get cleaned up as they are being processed but, still something worth noting.
- Run it! You should be able to run the AWS version of the docker compose file along with the AWS version of the .env file on the GPU instance(s) as well as your local machine like so:
# Run the admin, detect and extract containers locally
docker compose -f docker-compose-aws.yaml --env-file .env.aws up admin detect extract
# ... and on your GPU instance ...
# Run the detect, extract and chunk_video containers
docker compose -f docker-compose-aws.yaml --env-file .env.aws up chunk_video detect extract
- Process a file This is the same as on a local setup. Just navigate to
http://127.0.0.1:8000/admin
on your local machine, login and then click through the UI to upload a new Video.
That probably glosses over some details but if you're comfortable with AWS and OK at Googling, you should be able to get things going. If not, open an issue and I'll see if I can help you out.
The main output of this is a DB table that records where in an image the detector found something and what kind of a thing it is. What does it look like? Here's what the DB table looks like:
id | category | confidence | x_coord | y_coord | box_width | box_height | frame_id
------+----------+------------+-----------+---------+-----------+------------+----------
319 | 2 | 0.699 | 0.08125 | 0.3629 | 0.05 | 0.3407 | 1331
366 | 2 | 0.793 | 0.07812 | 0.3638 | 0.05312 | 0.3407 | 1332
314 | 2 | 0.793 | 0.07812 | 0.3638 | 0.05312 | 0.3407 | 1333
348 | 2 | 0.736 | 0.07604 | 0.3638 | 0.05572 | 0.3388 | 1334
... etc ...
The category
is what kind of thing (1 = animal, 2 = person, 3 = vehicle) is
represented by the detection. The confidence
is how confident the model was
that it is, in fact, the thing that it thinks it is. The coordinates represent
the upper left hand corner of where the detection was in the image and you can
use the box_width
and box_height
to figure out how big the box is.
There's a process that checks every so often to see if there are any videos that are done processing and, if so, it kicks off the process of taking all the images where something was detected and stitches them back together into a video. When that's done, you'll be able to check it out by clicking on the name of the video in the Django admin.
List of videos uploaded for processing
This project relies upon decord
to quickly
extract frames from your video files. In order to enable GPU acceleration for
that library, you need to install it from source. The Dockerfile included here
will take care of that for you however, you need to download the Nvidia Video
Codec SDK and stick it into the decord folder before you build the docker
image. Why can't the Dockerfile just take care of that for you? Because Nvidia
wants your email address. Anyways, it's pretty simple:
- Recursively clone the decord repo:
git clone --recursive https://github.com/dmlc/decord
- Go to the Nvidia Video Codec SDK download page and download the "Video Codec for application developers". It will involve registering with Nvidia (Boo!)
- Copy the zip file you end up with to the directory where you cloned the decord repo
- Unzip it
- Build the docker image for this project:
docker build -t video-processor:latest .
The build will probably take around 5-10 minutes and use around 10GB of disk.
The short version is that this project uses Django + Celery to run a couple
different "chunks" of the processing pipeline in a more or less distributed
way. Could it be more distributed? Probably. But I'm not really willing to give
AWS that kind of money (yet). You'll see these stages reflected in the celery
tasks and in the names of the services in the
docker-compose
files. These stages look like this:
Break the video into chunks I found that the decord
library was great but
it choked if I tried to give it video files that were more than a few minutes
long. Luckily, ffmpeg
can use your GPU to speed up the process of taking one
long video and making it into a bunch of shorter videos, which is what is
happening under the hood here. As one "chunk" of the video is completed, the
next step in the pipeline (extracting frames from the video) gets kicked off
for that chunk. This makes it possible to parallelize the extraction process
in a way that decord can keep the detection queue filled (and not run out of
memory which was the case when I was attempting to grab batches of frames from
one, very long, video).
Extract frames from the video chunks As I'm sure you've guessed, this iterates through the chunks of the video and saves out each frame as a separate image and makes a row for each frame in the DB. This is really where the storage backend you're using will come into play since, as I mentioned above, an approximately 90 minute video with a framerate of 20 fps will use up about 180GB of disk. If you're using S3, you'll need to consider the data transfer costs ($$$ as well as time), too (unless you're doing all your processing in AWS).
Detect whether or not there are interesting things in the images This is the meat and potatoes of the process. It uses the MegaDetector v5a model to figure out if there are likely to be animals, people, or vehicles in each frame of the video and, if it finds things, it saves its findings to a DB table.
Check if videos are done processing and stitch the interesting images back into a video There's a periodic task that runs every 10 seconds to see if there are any videos that are done processing. If so, it kicks off another task which takes all the images where something was detected and stitches them back together into a video. If you take a look at the table showing the list of videos you've uploaded for processing, you should be able to see whether there's a video available or not. You can check it out by clicking on the name of the video and looking at its detail page.
The one step in this process that has always vexed me is the part where the video gets broken apart into individual images for each frame. If you look back in the git history for this repo, you'll notice that I was using OpenCV to extract the frames at first. The problem I had with that was that, for my videos in particular, OpenCV would end up not extracting all of the frames in an individual file. This seems to have something to do with the fact that the video codec that is used to encode the files on my little security camera system doesn't have very good metadata so any open source tool just gets garbage in.
I did find that ffmpeg
could extract all the frames if you gave it the
correct incantations but it was always limited since you can only run it in one
process. Even the hardware acceleration that you can get with the Nvidia
Toolkit doesn't really help much (it seems to be more for encoding videos). I
tried breaking the work of extracting frames up into smaller chunks and then
just telling a bunch of workers to have ffmpeg
only extract a particular
chunk but the problem then became the fact that, as the processing got farther
and farther into the video, ffmpeg
would have to scan the video farther and
farther which was just a non-starter for very large videos.
I then found decord
which definitely does things a whole lot faster but has
the same limitations as OpenCV (since under the hood it seems to be using quite
a lot of the same primitives). I was able to get around the limitation that
decord
seems to have regarding very large videos (aka, it chokes on them) by
breaking large videos into smaller chunks but it still kinda sucks that I can't
seem to be able to actually extract all of the frames from my videos. So,
that's where this project currently sits: a faster but imperfect solution.
Which doesn't quite feel right to me. Hopefully I'll get some more time to
work on this before I run out of room to store all my security camera videos.
All I want is to stare at cute little animals in my backyard!