-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tests to make sure each source makes it to the score correctly #1878
Changes from 9 commits
b26d359
ec79f87
d90292b
c0c2064
9cd5eb6
98a02c4
0d919fd
358a1fd
17669c9
4459a77
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,12 +2,218 @@ | |
import pytest | ||
from data_pipeline.config import settings | ||
from data_pipeline.score import field_names | ||
from data_pipeline.etl.score import constants | ||
|
||
GEOID_TRACT_FIELD_NAME = field_names.GEOID_TRACT_FIELD | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. an extremely annoying nit: we set this constant here and then lower, we use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thank you. |
||
|
||
|
||
@pytest.fixture(scope="session") | ||
def final_score_df(): | ||
return pd.read_csv( | ||
settings.APP_ROOT / "data" / "score" / "csv" / "full" / "usa.csv", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question: will this test fail until the user has run the full score pipeline at least once on their local? If so, that could potentially make it hard for new developers who are just "opening the box" on the CEJST. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It will, but any tests that use it are marked as smoketest and skipped by default. I should add something to the readme tho, so I will do so. |
||
dtype={field_names.GEOID_TRACT_FIELD: str}, | ||
dtype={GEOID_TRACT_FIELD_NAME: str}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def census_df(): | ||
census_csv = constants.DATA_PATH / "dataset" / "census_acs_2019" / "usa.csv" | ||
return pd.read_csv( | ||
census_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def ejscreen_df(): | ||
ejscreen_csv = constants.DATA_PATH / "dataset" / "ejscreen" / "usa.csv" | ||
return pd.read_csv( | ||
ejscreen_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def hud_housing_df(): | ||
hud_housing_csv = ( | ||
constants.DATA_PATH / "dataset" / "hud_housing" / "usa.csv" | ||
) | ||
Comment on lines
+38
to
+41
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Idea: why re-write the paths for each dataset, when there's a class variable for it here? I'm mostly thinking about keeping it DRY? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (this would apply to all the df's here) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. They're all instance variables and it just didn't make sense to me to import and instantiate all those classes to just get the paths. I tried to factor them up as classvars but they rely on a method so like it didn't make a ton of sense. So in this case I figured keeping the coupling lower and relying on a pretty thorough convention was worth the repetition. Open to arguments otherwise, but that as my logic. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, Matt, thanks! I've seen this repetition pattern in fixtures often to totally decouple things. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🙇♂️ |
||
return pd.read_csv( | ||
hud_housing_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def cdc_places_df(): | ||
cdc_places_csv = constants.DATA_PATH / "dataset" / "cdc_places" / "usa.csv" | ||
return pd.read_csv( | ||
cdc_places_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def census_acs_median_incomes_df(): | ||
census_acs_median_incomes_csv = ( | ||
constants.DATA_PATH | ||
/ "dataset" | ||
/ "census_acs_median_income_2019" | ||
/ "usa.csv" | ||
) | ||
return pd.read_csv( | ||
census_acs_median_incomes_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def cdc_life_expectancy_df(): | ||
cdc_life_expectancy_csv = ( | ||
constants.DATA_PATH / "dataset" / "cdc_life_expectancy" / "usa.csv" | ||
) | ||
return pd.read_csv( | ||
cdc_life_expectancy_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def doe_energy_burden_df(): | ||
doe_energy_burden_csv = ( | ||
constants.DATA_PATH / "dataset" / "doe_energy_burden" / "usa.csv" | ||
) | ||
return pd.read_csv( | ||
doe_energy_burden_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def national_risk_index_df(): | ||
return pd.read_csv( | ||
constants.DATA_PATH / "dataset" / "national_risk_index" / "usa.csv", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mattbowen-usds a design question -- for some of these, we have the class method to get the df. I assumed that loading the df fresh was helpful because it gets like, the most final possible output -- but I'm wondering if that's the right assumption here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mostly just didn't want to couple to all those classes, since the overall design of the system assumes these CSVs exist. I am not like deeply committed to not using the classes --- just try to import as little of the system under test in my fixtures as I can to prevent bugs there from crashing or otherwise harming my tests. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe this is related to the question I posted and makes sense. |
||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def dot_travel_disadvantage_df(): | ||
return pd.read_csv( | ||
constants.DATA_PATH / "dataset" / "travel_composite" / "usa.csv", | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def fsf_fire_df(): | ||
return pd.read_csv( | ||
constants.DATA_PATH / "dataset" / "fsf_wildfire_risk" / "usa.csv", | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def fsf_flood_df(): | ||
return pd.read_csv( | ||
constants.DATA_PATH / "dataset" / "fsf_flood_risk" / "usa.csv", | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def nature_deprived_df(): | ||
return pd.read_csv( | ||
constants.DATA_PATH / "dataset" / "nlcd_nature_deprived" / "usa.csv", | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def eamlis_df(): | ||
return pd.read_csv( | ||
constants.DATA_PATH / "dataset" / "eamlis" / "usa.csv", | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def fuds_df(): | ||
return pd.read_csv( | ||
constants.DATA_PATH / "dataset" / "us_army_fuds" / "usa.csv", | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def geocorr_urban_rural_df(): | ||
geocorr_urban_rural_csv = ( | ||
constants.DATA_PATH / "dataset" / "geocorr" / "usa.csv" | ||
) | ||
return pd.read_csv( | ||
geocorr_urban_rural_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def census_decennial_df(): | ||
census_decennial_csv = ( | ||
constants.DATA_PATH / "dataset" / "census_decennial_2010" / "usa.csv" | ||
) | ||
return pd.read_csv( | ||
census_decennial_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def census_2010_df(): | ||
census_2010_csv = ( | ||
constants.DATA_PATH / "dataset" / "census_acs_2010" / "usa.csv" | ||
) | ||
return pd.read_csv( | ||
census_2010_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def hrs_df(): | ||
hrs_csv = constants.DATA_PATH / "dataset" / "historic_redlining" / "usa.csv" | ||
|
||
return pd.read_csv( | ||
hrs_csv, | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
) | ||
|
||
|
||
@pytest.fixture() | ||
def national_tract_df(): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We should add in the datasets that have been added since then, right? Like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yup! Added tribal overlap in #1904 since that's where it actually has output. |
||
national_tract_csv = constants.DATA_CENSUS_CSV_FILE_PATH | ||
return pd.read_csv( | ||
national_tract_csv, | ||
names=[GEOID_TRACT_FIELD_NAME], | ||
dtype={GEOID_TRACT_FIELD_NAME: "string"}, | ||
low_memory=False, | ||
header=None, | ||
) |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,7 +28,6 @@ def full_percentile_column_name(self): | |
return self.percentile_column_name | ||
|
||
|
||
### TODO: we need to blow this out for all eight categories | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. thank you for the cleanup! |
||
def _check_percentile_against_threshold(df, config: PercentileTestConfig): | ||
"""Note - for the purpose of testing, this fills with False""" | ||
is_minimum_flagged_ok = ( | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you!