I use jupyter lab with a variety of python packages on linux. Here's a simple recipe to set it up from scratch. Your mileage may vary based on OS.
-
Make sure Anaconda is installed and
git clone
the repository from github. cd into the nyc-stew folder. -
In the nyc-stew folder you should see the environment.yml file. Use this to build the env using conda:
conda env create -f environment.yml
-
Once the create env is completed, activate the new env:
conda activate stew
-
At this point you will have a working jupyter lab with the necessary packages. Launch the lab with
juptyer lab
. -
You're ready to explore, understand, develop, ...
I am lazy with imports. I setup a default start script in ~/.ipython/profile_default/startup so I don't think about specifics in a notebook.
I have included start.py in the notebooks folder. If you don't want to setup a default, just add a code cell (to each notebook) with %run start.py.
My-o-my. I've covered a lot of ground with the data. In general the data flow is:
1. Find data and save to the raw directory.
My raw directory looks like this:
data/raw
├── 311
├── admin-boundaries
├── DEM
├── DEP
├── NYC-2017-STEW-MAP-Public-Version2
├── NYCFutureHighTideWithSLR.gdb
├── NYC_STEWMAP_2017_Networks_Version2_Public.xlsx
├── NYCWRP_Shapefiles_2016
├── slr_metadata.pdf
└── weather
8 directories, 2 files
I have some organization. As of this time (05/31/2022), I have 31G. Way to much for github.
2. Process the raw data and place it in data/processed.
My processed directory looks like:
data/processed/
├── 311
│ ├── dep-clean-geo.parq
│ ├── dep-full.parq
│ ├── dob-clean-geo.parq
│ ├── dob-full.parq
│ ├── dot-clean-geo.parq
│ ├── dot-full.parq
│ ├── dpr-clean-geo.parq
│ ├── dpr-full.parq
│ ├── dsny-clean-geo.parq
│ ├── dsny-full.parq
│ ├── hpd-clean-geo.parq
│ └── hpd-full.parq
├── admin-boundaries
│ ├── boroughs.parq
│ ├── brooklyn.parq
│ ├── CDTA.parq
│ ├── census-tracts-2020.parq
│ └── NTA.parq
├── brooklyn
│ ├── brooklyn-2021-311.parq
│ ├── brooklyn-311-elevation.parq
│ ├── brooklyn-boundary.parq
│ ├── brooklyn-catch-basins.parq
│ ├── brooklyn-census-tracts.parq
│ ├── brooklyn-community-districts-ta.parq
│ ├── brooklyn-dem.parq
│ ├── brooklyn-extreme-flood.parq
│ ├── brooklyn-moderate-flood.parq
│ ├── brooklyn-ms4-drainage.parq
│ ├── brooklyn-ms4-outfalls.parq
│ ├── brooklyn-neighborhoods-ta.parq
│ ├── brooklyn-rainfall-2021.parq
│ ├── brooklyn-slr-2050-08.parq
│ ├── brooklyn-slr-2050-11.parq
│ ├── brooklyn-slr-2050-16.parq
│ ├── brooklyn-slr-2050-21.parq
│ ├── brooklyn-slr-2050-30.parq
│ ├── brooklyn-turfs.parq
│ ├── primst-turfs-counts.parq
│ └── primst-with-alters.parq
├── db
│ ├── popids2.p
│ └── popids.p
├── DCP
│ ├── slr-2050-08.parq
│ ├── slr-2050-11.parq
│ ├── slr-2050-16.parq
│ ├── slr-2050-21.parq
│ ├── slr-2050-30.parq
│ └── slr_metadata.pdf
├── DEP
│ ├── 2021-311.parq
│ ├── brooklyn-extreme.parq
│ ├── catch-basins.parq
│ ├── Data_Dictionary_ExtremeFlood.xlsx
│ ├── extreme-flood-map.parq
│ ├── moderate-flood-map.parq
│ ├── ms4-drainage.parq
│ └── ms4-outfalls.parq
├── office-locations.parq
├── SN
│ ├── connections.parq
│ └── elements.parq
└── turfs.parq
7 directories, 58 files
It contains 3.2G. You can look at the notebooks and see what goes into the transformations.
Note that I am trying to use parquet files. Much faster and more economical.
3. For the first release, I am including data/processed/brooklyn/
This directory contains 72M. Somewhat more manageable.