This repository was created to explore the possibility of populating a NYC-DB instance via Kubernetes Jobs or Amazon Fargate.
There are a few potential advantages to this approach:
-
For developers who are more conversant with containerization, it could be more convenient than learning how to deploy via a VPS or through tools like Ansible.
-
Containerization allows for nice dev/prod parity.
-
It potentially parallelizes the workload over multiple machines, which could increase the speed of populating the database.
-
Kubernetes supports Cron Jobs, so even a single-node cluster could be used to keep a NYC-DB instance continuously updated, with a convenient UI provided via the Kubernetes Dashboard. This might be easier to manage than e.g. custom cron jobs on a VPS. (Note, however, that at the time of this writing, no tooling is provided to help configure Kubernetes Cron Jobs.)
Amazon Fargate supports Scheduled Tasks, so a NYC-DB can be continuously updated on AWS infrastructure as well, though at the time of this writing it has a number of limitations compared to the Kubernetes approach. However, unlike Kubernetes, it also doesn't require any kind of master server, so it could be less expensive to maintain. A script is also included in this repository to help configure such tasks.
You will need Docker.
First, you'll want to create an .env
file by copying the example one:
cp .env.example .env # Or 'copy .env.example .env' on Windows
Take a look at the .env
file and make any changes to it, if you like.
The easiest way to develop the loader is via Docker Compose, as it sets up a development Postgres server for you.
Enter the development container by running:
docker-compose run app bash
You can now develop and run the load_dataset.py
script. The /app
directory is mounted to the root of the repository on your local filesystem,
so any changes you make to any files will instantly be reflected in the
container environment.
You'll need a Kubernetes (k8s) cluster. The easiest way to obtain one on your local machine is by enabling Kubernetes on Docker Desktop.
You may also want to deploy the Kubernetes Dashboard UI, as it makes introspecting the state of the NYC-DB dataset loader jobs very easy.
You'll want to build your container image by running:
docker build --target prod -t justfixnyc/nycdb-k8s-loader:latest .
To deploy the jobs to your Kubernetes cluster, first generate job files:
docker-compose run app python k8s_build_jobs.py
Note that the created jobs will use the environment variables defined
in your .env
file.
Then tell k8s to start your jobs:
kubectl create -f ./k8s-jobs
Now you can visit the "Cron Jobs" section of your Kubernetes Dashboard to see the state of the jobs.
If you want to stop the jobs, or clean them up once they're finished, run:
kubectl delete -f ./k8s-jobs
It's also possible to deploy this container as a Task on Amazon Fargate, which supports scheduled tasks. Here are some guidelines:
-
The "get started" wizard for Fargate has you set up a Service definition, but you won't need this particular feature, because you're not setting up a web server or anything. You should be able to safely delete the Service after you're done with the wizard.
-
You can set your Task's container image to
justfixnyc/nycdb-k8s-loader:latest
and set the environment variables as per the documentation in the.env.example
file. -
When running the Task, you'll want to set "Auto-assign public IP" to
ENABLED
: even though the container doesn't need to be accessed by anything, apparently your container needs a public IP in order to access the outside world (see aws/amazon-ecs-agent#1128 for more details). Note also that this setting is part of running a Task rather than defining a Task. -
Make sure your DB instance is accessible by the container in your VPC. There can be situations where your DB might be accessible from the internet, yet your container times out when attempting to access it. This could be because the security group for your RDS instance has inbound rules that only allow access from a certain IP range. (If you created an RDS instance via the AWS Console, its security group might be called
rds-launch-wizard
.) -
At the time of this writing, it's not possible to see task-level CPU/memory utilization in CloudWatch, which is unfortunate (see aws/amazon-ecs-agent#565).
-
At the time of this writing, it's not possible to use a SSM parameter store as a secret store (see aws/amazon-ecs-agent#1209). This means you will probably need to specify your database URL as plaintext in your task definition.
To create scheduled tasks for loading each dataset on a regular basis,
see aws_schedule_tasks.py
.
The container can be configured through environment variables,
so take a look at the .env.example
file for
documentation on all of them.
Every dataset is made up of URLs which their web servers
deliver with ETag
and/or Last-Modified
metadata about when
they have last been modified. The loader takes advantage
of this information: if a dataset hasn't changed
since the last time the loader retrieved it, it won't be retrieved
or loaded again. This behavior can be overridden by deleting the
last modification metadata for the dataset via the
dbtool.py lastmod:reset
command.
If one or more of a dataset's URLs have been changed, the loader downloads them and creates a temporary Postgres schema for the dataset. It then loads the dataset into that schema, which could take a long time. Using the temporary schema ensures that users can still make queries to the public schema (if one exists) while the new version of the dataset is being loaded.
Once the dataset has been loaded into the temporary schema, the loader drops the dataset's tables from the public schema and moves the temporary schema's tables into the public schema.
The loader also tries to ensure that users have the same permissions to the new tables in the public schema that they had to the old tables. However, you should probably verify this manually.
The dbtool.py
utility provides a variety of tools
for querying the status of the database and making modifications
to it.
To run the test suite, run:
docker-compose run app pytest
At present, the revision of NYC-DB's Python library is pulled directly
from GitHub via a commit hash. At the time of this writing, that
commit hash is specified with the Dockerfile
's NYCDB_REV
argument.
To update the revision for anyone who is using the
justfixnyc/nycdb-k8s-loader:latest
image off Docker Hub, issue a PR
that changes the default value of the aforementioned NYCDB_REV
argument.
Our continuous integration system will then ensure that everything still
works, and once the PR is merged into main
, Docker Hub will re-publish
a new container image that uses the latest version of NYC-DB.
This repository also contains wowutil.py
, a tool for creating and
updating the NYCDB-derived tables and functions required by
Who Owns What (WoW).
Currently, the tool creates WoW's tables and functions under a
Postgres schema called wow
. It's the responsibility of the
database administrator to set the Postgres schema search path
to wow, public
for WoW's functions to work properly.
Unlike the NYCDB datasets, at the time of this writing, there are no tools to automate the scheduling of WoW data updates.
It is also your responsibility to ensure that the NYCDB dependencies
of the WoW data are already in the database at the time that
wowutil.py
is used to generate the WoW tables and functions.
The specific version of WoW used by wowutil.py
is specified
by the WOW_REV
argument in the Dockerfile
.
As an alternative to running wowutil.py
, you can also specify
wow
as the dataset for the load_dataset.py
tool to load: this
is essentially a shortcut for wowutil.py build
and can be
a convenient way to "slot in" the loading of WoW data to an
already existing workflow for other NYCDB datasets.
At the time of this update, there is a custom scheduled ECS rule
created via the AWS console that updates the wow
dataset.