Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests to make sure each source makes it to the score correctly #1878

13 changes: 0 additions & 13 deletions data/data-pipeline/data_pipeline/etl/score/etl_score.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,6 @@ def __init__(self):
self.doe_energy_burden_df: pd.DataFrame
self.national_risk_index_df: pd.DataFrame
self.geocorr_urban_rural_df: pd.DataFrame
self.persistent_poverty_df: pd.DataFrame
self.census_decennial_df: pd.DataFrame
self.census_2010_df: pd.DataFrame
self.national_tract_df: pd.DataFrame
Expand Down Expand Up @@ -159,16 +158,6 @@ def extract(self) -> None:
low_memory=False,
)

# Load persistent poverty
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you!

persistent_poverty_csv = (
constants.DATA_PATH / "dataset" / "persistent_poverty" / "usa.csv"
)
self.persistent_poverty_df = pd.read_csv(
persistent_poverty_csv,
dtype={self.GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)

# Load decennial census data
census_decennial_csv = (
constants.DATA_PATH
Expand Down Expand Up @@ -359,7 +348,6 @@ def _prepare_initial_df(self) -> pd.DataFrame:
self.doe_energy_burden_df,
self.ejscreen_df,
self.geocorr_urban_rural_df,
self.persistent_poverty_df,
self.national_risk_index_df,
self.census_acs_median_incomes_df,
self.census_decennial_df,
Expand Down Expand Up @@ -484,7 +472,6 @@ def _prepare_initial_df(self) -> pd.DataFrame:

non_numeric_columns = [
self.GEOID_TRACT_FIELD_NAME,
field_names.PERSISTENT_POVERTY_FIELD,
field_names.TRACT_ELIGIBLE_FOR_NONNATURAL_THRESHOLD,
field_names.AGRICULTURAL_VALUE_BOOL_FIELD,
]
Expand Down
208 changes: 207 additions & 1 deletion data/data-pipeline/data_pipeline/tests/score/fixtures.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,218 @@
import pytest
from data_pipeline.config import settings
from data_pipeline.score import field_names
from data_pipeline.etl.score import constants

GEOID_TRACT_FIELD_NAME = field_names.GEOID_TRACT_FIELD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an extremely annoying nit: we set this constant here and then lower, we use field_names.GEOID_TRACT_FIELD anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you.



@pytest.fixture(scope="session")
def final_score_df():
return pd.read_csv(
settings.APP_ROOT / "data" / "score" / "csv" / "full" / "usa.csv",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: will this test fail until the user has run the full score pipeline at least once on their local?

If so, that could potentially make it hard for new developers who are just "opening the box" on the CEJST.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will, but any tests that use it are marked as smoketest and skipped by default. I should add something to the readme tho, so I will do so.

dtype={field_names.GEOID_TRACT_FIELD: str},
dtype={GEOID_TRACT_FIELD_NAME: str},
low_memory=False,
)


@pytest.fixture()
def census_df():
census_csv = constants.DATA_PATH / "dataset" / "census_acs_2019" / "usa.csv"
return pd.read_csv(
census_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def ejscreen_df():
ejscreen_csv = constants.DATA_PATH / "dataset" / "ejscreen" / "usa.csv"
return pd.read_csv(
ejscreen_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def hud_housing_df():
hud_housing_csv = (
constants.DATA_PATH / "dataset" / "hud_housing" / "usa.csv"
)
Comment on lines +38 to +41
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea: why re-write the paths for each dataset, when there's a class variable for it here? I'm mostly thinking about keeping it DRY?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this would apply to all the df's here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They're all instance variables and it just didn't make sense to me to import and instantiate all those classes to just get the paths. I tried to factor them up as classvars but they rely on a method so like it didn't make a ton of sense. So in this case I figured keeping the coupling lower and relying on a pretty thorough convention was worth the repetition. Open to arguments otherwise, but that as my logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, Matt, thanks! I've seen this repetition pattern in fixtures often to totally decouple things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙇‍♂️

return pd.read_csv(
hud_housing_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def cdc_places_df():
cdc_places_csv = constants.DATA_PATH / "dataset" / "cdc_places" / "usa.csv"
return pd.read_csv(
cdc_places_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def census_acs_median_incomes_df():
census_acs_median_incomes_csv = (
constants.DATA_PATH
/ "dataset"
/ "census_acs_median_income_2019"
/ "usa.csv"
)
return pd.read_csv(
census_acs_median_incomes_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def cdc_life_expectancy_df():
cdc_life_expectancy_csv = (
constants.DATA_PATH / "dataset" / "cdc_life_expectancy" / "usa.csv"
)
return pd.read_csv(
cdc_life_expectancy_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def doe_energy_burden_df():
doe_energy_burden_csv = (
constants.DATA_PATH / "dataset" / "doe_energy_burden" / "usa.csv"
)
return pd.read_csv(
doe_energy_burden_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def national_risk_index_df():
return pd.read_csv(
constants.DATA_PATH / "dataset" / "national_risk_index" / "usa.csv",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattbowen-usds a design question -- for some of these, we have the class method to get the df. I assumed that loading the df fresh was helpful because it gets like, the most final possible output -- but I'm wondering if that's the right assumption here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly just didn't want to couple to all those classes, since the overall design of the system assumes these CSVs exist. I am not like deeply committed to not using the classes --- just try to import as little of the system under test in my fixtures as I can to prevent bugs there from crashing or otherwise harming my tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is related to the question I posted and makes sense.

dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def dot_travel_disadvantage_df():
return pd.read_csv(
constants.DATA_PATH / "dataset" / "travel_composite" / "usa.csv",
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def fsf_fire_df():
return pd.read_csv(
constants.DATA_PATH / "dataset" / "fsf_wildfire_risk" / "usa.csv",
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def fsf_flood_df():
return pd.read_csv(
constants.DATA_PATH / "dataset" / "fsf_flood_risk" / "usa.csv",
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def nature_deprived_df():
return pd.read_csv(
constants.DATA_PATH / "dataset" / "nlcd_nature_deprived" / "usa.csv",
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def eamlis_df():
return pd.read_csv(
constants.DATA_PATH / "dataset" / "eamlis" / "usa.csv",
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def fuds_df():
return pd.read_csv(
constants.DATA_PATH / "dataset" / "us_army_fuds" / "usa.csv",
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def geocorr_urban_rural_df():
geocorr_urban_rural_csv = (
constants.DATA_PATH / "dataset" / "geocorr" / "usa.csv"
)
return pd.read_csv(
geocorr_urban_rural_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def census_decennial_df():
census_decennial_csv = (
constants.DATA_PATH / "dataset" / "census_decennial_2010" / "usa.csv"
)
return pd.read_csv(
census_decennial_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def census_2010_df():
census_2010_csv = (
constants.DATA_PATH / "dataset" / "census_acs_2010" / "usa.csv"
)
return pd.read_csv(
census_2010_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def hrs_df():
hrs_csv = constants.DATA_PATH / "dataset" / "historic_redlining" / "usa.csv"

return pd.read_csv(
hrs_csv,
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
)


@pytest.fixture()
def national_tract_df():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add in the datasets that have been added since then, right? Like tribal_overlap.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup! Added tribal overlap in #1904 since that's where it actually has output.

national_tract_csv = constants.DATA_CENSUS_CSV_FILE_PATH
return pd.read_csv(
national_tract_csv,
names=[GEOID_TRACT_FIELD_NAME],
dtype={GEOID_TRACT_FIELD_NAME: "string"},
low_memory=False,
header=None,
)
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,6 @@ def full_percentile_column_name(self):
return self.percentile_column_name


### TODO: we need to blow this out for all eight categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the cleanup!

def _check_percentile_against_threshold(df, config: PercentileTestConfig):
"""Note - for the purpose of testing, this fills with False"""
is_minimum_flagged_ok = (
Expand Down
Loading