Goal: of the > 3000 counties in the US, want to identify those most likely to become the next 'hotspots'. By this I mean places showing anomalous the infection rates and/or death rates. I do not mean places that simply have 'a lot' of cases, such as NYC, which was bound to have high numbers, but also has high resources. This could identify a subset of places for investigators to focus their attention and for disaster relief managers to be vigilent about sending support.
Solution: combine COVID-19 dataset from the New York Times's public repository, with US Census Bureau's most recent (2018) estimate of population by county. Build features, run Local Outlier Factor analysis.
Need python 3, pandas, scikit-learn, numpy, csv, requests.
There are some geographic exceptions which we will have. In particular, Kansas City and New York City are counted separately from the counties. Also, the fips
identifiers are concatenated state + county codes.
We create new fips
'nycty'
and 'kscty'
, and also remove 'Unkown'
cases.
Have to create 5-digit concatenated fips
codes from state and county fips
. Also, have to extract relevant data, which is total population by county (columns YEAR=11
, AGEGRP=0
), can discard the rest.
Have to aggregate the counties surrounding NYC and Kansas City to make consistent with NYT's dataset.
Features defined in src/define_features.py
.
They are build from: n_days_window
-length series of new_cases
, new_deaths
, new_case_baseline
(how many cases were there n_days_for_incubation
ago?), new_death_baseline
(how many cases were there n_days_to_death
ago?), and population
.
Features are parameterized by n_days_window
, n_days_for_incubation
, n_days_to_death
, n_days_of_infection
. end_date
determines which day we are doing the analysis for, defaults to today.
Any counties whose case record does not exceed min_num_cases
over the window defined by n_days_window
and end_date
are ignored.
Local Outlier Factor estimation. Computes the local density of points using a nearest-neighbor algorithm, and identifies outliers based on how the local density at each point compares to the average density across all sample points.