-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Social Vulnerability Index (SVI) subpackage #169
Conversation
If the data source is inherently spatial, would it make sense for |
Thanks for asking! I was planning on bringing this up as I got further along with this PR. My thinking is, as a user I want the ability to have the geospatial data and not to have the geospatial data. The geospatial data column will contribute more to memory usage and I would like the ability to limit memory consumption if my problem does not require geospatial data. However, as I read back over what I just wrote and consider the meaning of "canonical," I think I am persuading myself towards your view point. We have a canonical format and I think it makes sense to include all fields from the canonical format when a dataset has those fields. Do you share that view? And, what are your thoughts on having a method that returns data that includes the geospatial data and one that excludes the geospatial data? |
Emphasis on inherently spatial. If the original data source is a spatial format like a Shapefile, GeoJSON, or GeoCSV then I think we'd want to include a |
For this work, I agree with you and ill make the change to only return |
I think we have some leeway since objects like For example, the |
CDC_Social_Vulnerability_Index_2018 (FeatureServer) |
Looking back at this work, I think it would be helpful for me moving forward if we could decide what SVI fields our "flagship" api will support. I would like to support a "raw" svi api for users who want to use our tools to just get data svi in a dataframe format. This comment should provide background and enough information to get us started. Ill include my thoughts / opinions in a separate comment below. BackgroundTo, hopefully, start the conversation, the SVI has been released 5 times (2000, 2010, 2014, 2016, and 2018). The SVI calculates a relative percentile ranking in four themes categories for a give geographic extent:
The values used to calculate the percentile ranking for each of the four themes are summed, for each record, to calculate an overall percentile ranking. Rankings are calculated relative to a given state or the entire United States. For all editions of the SVI, rankings are computed at the census tract spatial scale. Meaning, if you were to retrieve the 2018 SVI at the census tract scale, at the state coverage for the state of Alabama, you would receive 1180 records (number of census tracts in AL in 2010 census) where each ranked percentile is calculated relative to census tracts in Alabama. From 2014 onward, the SVI is also offered at the county scale in both the state and U.S. coverage products. The state coverage products allow inter-state comparison and the U.S. coverage allow national comparison at the census tract or county scales (2014 onward). Luckily, facets used to calculate each theme are included in all datasets. Facets being one of the fields contributing to the calculation of a themes value (e.g. Per capita income). In the 2018 edition, there are 124 column headers. This number has fluctuated (mainly only increased) with new released of the SVI: number of cols for each release:
QuestionThe main question I would like to address through discussion is, what fields should be included in the canonical output for the flagship svi client Additional InformationBelow are links for SVI documentation broken down by year: "2000": "https://www.atsdr.cdc.gov/placeandhealth/svi/documentation/pdf/SVI2000Documentation-H.pdf" |
My thoughts are this flagship state_fips: State FIPS code As always, this is a conversation and I hope we can find the best for hydrotools users together! I would really value some opinions on this topic! Pinging @jarq6c. |
That synopsis is extremely useful! I may share that widely. Thanks a lot. I have no objections to this initial subset for Something like this: state_fips state_name county_name fips svi_edition geometry value_theme value
0 0000 ZZ foo 1111 2018 POLYGON Socioeconomic 99.9
1 4444 XX bar 2222 2018 POLYGON All 26.3 As for additional metadata, you might consider:
Thanks for putting this together! |
Great! Please do! Right, I see that we share the same irk. There are just sooo many columns. I like your proposed solution. I think long term Moving forward, I think we are on the same page for next steps for this PR. I will work |
I use MultiIndexes frequently for intermediate processing, but often have to break out of MultiIndexes due to compatibility issues, mostly when attempting to write these dataframes to disk. The only methods I know of that return MultiIndex dataframes by default are I agree, I think sticking to a more data-source-native wide format is the best way to go in the short term. |
Would it make sense to extend the #get the basic data (default fields, "wide format")
basic_data = client.get(arg1, arg2)
#get the basic data (default fields, "long format")
long_data = client.get(arg1, arg2, long_form=True)
all_fields = client.available_fields(year)
#get all data
all_data = client.get(arg1, arg2, fields=all_fields)
#get a single field
single_data = client.get(arg1, arg2, fields=all_fields[0])
#get data with multi index
data = client.get(arg1, arg2, multi_index=True) The backend can be organized around some common functions/transformations, and the |
Yeah, this makes sense to me. |
type encapsulates hydrotools canonical svi fields. in the future this should be seperated to a hydrotools canonical module once a provider class has been created.
…uture svi providers will specify a mapping from a location abbreviation to their naming needs
…l change once providers concept is added
I am not keen to use kwargs to change the output format in the |
|
||
valid_years = typing.get_args(Year) | ||
if year_str not in valid_years: | ||
valid_years = sorted(valid_years) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't work because valid_years
contains str
and int
types. See below:
from hydrotools.svi_client import SVIClient
client = SVIClient()
print(client.svi_documentation_url(2020))
Traceback (most recent call last):
File "main.py", line 7, in <module>
print(client.svi_documentation_url(2020))
File "/home/jregina/Projects/hydrotools/python/svi_client/src/hydrotools/svi_client/clients.py", line 199, in svi_documentation_url
year = utilities.validate_year(year)
File "/home/jregina/Projects/hydrotools/python/svi_client/src/hydrotools/svi_client/types/utilities.py", line 46, in validate_year
valid_years = sorted(valid_years)
TypeError: '<' not supported between instances of 'int' and 'str'
You might consider using categories for some values. Categories reduced the memory footprint from 15.0 MB to 1.0 MB in this example. from hydrotools.svi_client import SVIClient
client = SVIClient()
gdf = client.get("AL", "census_tract", "2018")
print("BEFORE CATEGORIZATION")
print(gdf.info(memory_usage="deep"))
str_cols = gdf.select_dtypes(include=object).columns
gdf[str_cols] = gdf[str_cols].astype("category")
print("AFTER CATEGORIZATION")
print(gdf.info(memory_usage="deep")) BEFORE CATEGORIZATION
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 29500 entries, 0 to 29499
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state_name 29500 non-null object
1 state_abbreviation 29500 non-null object
2 county_name 29500 non-null object
3 state_fips 29500 non-null object
4 county_fips 29500 non-null object
5 fips 29500 non-null object
6 theme 29500 non-null object
7 rank 29500 non-null float64
8 value 29500 non-null float64
9 svi_edition 29500 non-null object
10 geometry 29500 non-null geometry
dtypes: float64(2), geometry(1), object(8)
memory usage: 15.0 MB
None
AFTER CATEGORIZATION
<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 29500 entries, 0 to 29499
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 state_name 29500 non-null category
1 state_abbreviation 29500 non-null category
2 county_name 29500 non-null category
3 state_fips 29500 non-null category
4 county_fips 29500 non-null category
5 fips 29500 non-null category
6 theme 29500 non-null category
7 rank 29500 non-null float64
8 value 29500 non-null float64
9 svi_edition 29500 non-null category
10 geometry 29500 non-null geometry
dtypes: category(8), float64(2), geometry(1)
memory usage: 1.0 MB
None |
# lowercase and strip all leading and trailing white spaces from str columns for consistent | ||
# output and quality control | ||
df_dtypes = df.dtypes | ||
str_cols = df_dtypes[df_dtypes == "object"].index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a dumb question, but does this always select only string columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it was a dumb question. I don't think it will always return a string column. I think datetime columns will also be included. Because of this comment, I changed this check to use pd.DataFrame.select_dtypes
as you showed above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the way this package is structured. The only real "bug" is that the intended error is never raised for invalid years. There could be some memory optimizations and we'll want to check that the current output is what's expected (and possibly document why it's like that).
Overall, great work!
Thanks for finding this issue, @jarq6c! The issue was how I was melting the dataframes. Ive pushed a change that resolves the issue you brought to light. Thanks! |
Awesome! Thanks for suggesting this! Ive pushed a change that now casts string columns to categories! |
It looks like pip removed
I removed this option when installing subpackages in all our github actions. |
also, it seems like pytest does not like it when two test files have the same name even if they are in separate directories. see this gh actions log. |
alright, so with the additions I made today, this is pretty much ready to be merged. I just need to add documentation in the form of a readme and add description to the PR. However, the tool is now functional! Ill get that done tomorrow! Have a great weekend |
@jarq6c, just updated the readme and PR description. Once you and others are satisfied with the code, we should be ready to merge and release! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Passed all tests and output now makes sense. The input validation also appears to be working.
The only gotcha I ran into (which is my fault) is that geopandas.GeoDataFrame.to_file
is not compatible with CategoricalDtype
. This can be resolved by converting categories to strings before writing the file.
from hydrotools.svi_client import SVIClient
client = SVIClient()
gdf = client.get(
location="WY",
geographic_scale="county",
year="2000",
geographic_context="national"
)
cols = gdf.select_dtypes("category")
for c in cols:
gdf.loc[:, c] = gdf[c].astype(str)
gdf.to_file("wyoming_svi_county.geojson", driver="GeoJSON")
We'll pin a release before and after merging this in. I'll pin a release after #191 is merged. |
Great find regarding |
Might be a good idea to at least add to the |
Live on pypi. |
Thanks! I was going to take this chance to look over the other packages before pinning a HydroTools v3.0.0 release with the |
sorry if I jumped the gun. I should have communicated with you before pushing to pypi. As for pet issues. I would like to update the |
This PR adds a client for programmatically accessing the Center for Disease Control's (CDC) Social Vulnerability Index (SVI).
The SVI has been released 5 times (2000, 2010, 2014, 2016, and 2018) and calculates a relative percentile ranking in four themes categories and an overall ranking at a given geographic context and geographic scale. The themes are:
Rankings are calculated relative to a geographic context, state or all states (United States) . Meaning, for example, a ranking calculated for some location at the United States geographic context would be relative to all other locations where rankings was calculated in the United States. Similarly, SVI rankings are calculated at two geographic scales, census tract and county scales. Meaning, the rankings correspond to a county for a census tract. For completeness, for example, if you were to retrieve the 2018 SVI at the census tract scale, at the state context for the state of Alabama, you would receive 1180 records (number of census tracts in AL in 2010 census) where each ranked percentile is calculated relative to census tracts in Alabama. The tool released in this PR only supports querying for ranking calculated at the United States geographic context. Future work will add support for retrieving rankings at the state spatial scale.
Documentation for each year release of the SVI are located below:
Example
Additions
Testing
Todos
Checklist