You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In #3531 we've learned that the addfips package we rely on to map state/county data into a FIPS code is irregularly maintained. This is causing problems for downstream users, and we should consider deprecating this package and moving to a different solution.
Describe the solution you'd like
There are a few different use cases for FIPS encoding that we could hypothetically imagine for PUDL.
Right now, add_fips_ids in pudl.src.helpers.py is the only place we import the addfips package. This method is expects a county and state name, and a FIPS vintage:
def add_fips_ids(
df: pd.DataFrame,
state_col: str = "state",
county_col: str = "county",
vintage: int = 2015,
) -> pd.DataFrame:
"""Add State and County FIPS IDs to a dataframe.
To just add State FIPS IDs, make county_col = None.
"""
# force the columns to be the nullable string types so we have a consistent
# null value to filter out before feeding to addfips
df = df.astype({state_col: pd.StringDtype()})
if county_col:
df = df.astype({county_col: pd.StringDtype()})
af = addfips.AddFIPS(vintage=vintage)
# Lookup the state and county FIPS IDs and add them to the dataframe:
df["state_id_fips"] = df.apply(
lambda x: (
af.get_state_fips(state=x[state_col]) if pd.notnull(x[state_col]) else pd.NA
),
axis=1,
)
# force the code columns to be nullable strings - the leading zeros are
# important
df = df.astype({"state_id_fips": pd.StringDtype()})
logger.info(
f"Assigned state FIPS codes for "
f"{len(df[df.state_id_fips.notnull()])/len(df):.2%} of records."
)
if county_col:
df["county_id_fips"] = df.apply(
lambda x: (
af.get_county_fips(state=x[state_col], county=x[county_col])
if pd.notnull(x[county_col]) and pd.notnull(x[state_col])
else pd.NA
),
axis=1,
)
# force the code columns to be nullable strings - the leading zeros are
# important
df = df.astype({"county_id_fips": pd.StringDtype()})
logger.info(
f"Assigned county FIPS codes for "
f"{len(df[df.county_id_fips.notnull()])/len(df):.2%} of records."
)
return df
Approaches to resolving the problem
County and state names -> FIPS Code
This is our current FIPS encoding use case. We could:
create a fork of addfips, and implement the fixes we want to see that solve the problem identified in Incorrect county FIPS code for Bedford, VA #3531. We'd relatively easily benefit from any upstream changes, and if the project returns to being more actively maintained it wouldn't be particularly complex to switch back.
write our own method to convert state/county codes to FIPS codes drawing on the addfips methodology. As is noted in Add FIPS codes to power plant database #338 and below, the slight string cleaning implemented in addfips is definitely of relevance to us, so we'd want to replicate it at minimum but it shouldn't require heavy lifting to recreate.
explore a maximalist version, which involves encoding FIPS from addresses or lat/lon coordinates rather than county/state. However, this function is currently used for EIA 860, 861 and 923 data, and at least 861 and 923 data don't have complete addresses or lat/lon coordinates, so this likely won't meet our needs. The pygrispackage seems like a great API-free option for implementing this functionality down the line, though!
Address -> Lat lon -> FIPS Code or lat/lon -> FIPS code
As mentioned above, this wouldn't address the use case for 2/3 EIA datasets we're currently using the addfips package for. See pygrispackage, which seems interesting if we want to explore this further, but this is out of scope for the actual problem we're trying to resolve. See also #3531 for some discussion on other ways to implement this.
Proposed solutions
Fork addfips and implement our desired changes, then do our own release so we can import this as a package.
Pros: We don't have to write much code, and we can keep up to date with any changes made in addfips.
Cons: Extremely annoying and we're likely to forget about this. We're still relying on a largely unmaintained package.
Recreate the logic of addfips in pudl.helpers.py using Census data directly (e.g., this 2023 data or by calling the Census's geometry API, see [here](https://www.census.gov/data/developers/data-sets/geo-info.html) for more info).
Pros: We are our own maintainers.
Cons: We have to recreate the fuzzy-matching logic in the package we're currently using to prevent performance loss.
Find some other new Python package to meet our needs.
The cardinal sin of addfips is to not maintain the source census data or handle the slight changes over time, which is the only thing that needs maintenance! But I'll stop there because I think I've written more lines complaining about addfips than there are lines of code in the entire package 😂
Is your feature request related to a problem? Please describe.
In #3531 we've learned that the
addfips
package we rely on to map state/county data into a FIPS code is irregularly maintained. This is causing problems for downstream users, and we should consider deprecating this package and moving to a different solution.Describe the solution you'd like
There are a few different use cases for FIPS encoding that we could hypothetically imagine for PUDL.
Right now,
add_fips_ids
inpudl.src.helpers.py
is the only place we import theaddfips
package. This method is expects a county and state name, and a FIPS vintage:Approaches to resolving the problem
County and state names -> FIPS Code
This is our current FIPS encoding use case. We could:
addfips
, and implement the fixes we want to see that solve the problem identified in Incorrect county FIPS code for Bedford, VA #3531. We'd relatively easily benefit from any upstream changes, and if the project returns to being more actively maintained it wouldn't be particularly complex to switch back.addfips
methodology. As is noted in Add FIPS codes to power plant database #338 and below, the slight string cleaning implemented inaddfips
is definitely of relevance to us, so we'd want to replicate it at minimum but it shouldn't require heavy lifting to recreate.pygris
package seems like a great API-free option for implementing this functionality down the line, though!Address -> Lat lon -> FIPS Code or lat/lon -> FIPS code
As mentioned above, this wouldn't address the use case for 2/3 EIA datasets we're currently using the
addfips
package for. Seepygris
package, which seems interesting if we want to explore this further, but this is out of scope for the actual problem we're trying to resolve. See also #3531 for some discussion on other ways to implement this.Proposed solutions
Fork
addfips
and implement our desired changes, then do our own release so we can import this as a package.Pros: We don't have to write much code, and we can keep up to date with any changes made in
addfips
.Cons: Extremely annoying and we're likely to forget about this. We're still relying on a largely unmaintained package.
Recreate the logic of
addfips
inpudl.helpers.py
using Census data directly (e.g., this 2023 data or by calling the Census's geometry API, see[here](https://www.census.gov/data/developers/data-sets/geo-info.html)
for more info).Pros: We are our own maintainers.
Cons: We have to recreate the fuzzy-matching logic in the package we're currently using to prevent performance loss.
Find some other new Python package to meet our needs.
us
is maintained but doesn't yet support counties despite an open PR on this topic from 2017: https://github.com/unitedstates/python-us/. Thecensus
package depends on it, so there's no county support there either.Tasks
The text was updated successfully, but these errors were encountered: