-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds number of <18, 18-64, >64 people per household sorted by head age and income bracket, using PSID and CPS data #37
Conversation
Update 3/15
Returns the smoothed number of <18, 18-64, >64 by age/income
Updating fork
Codecov Report
@@ Coverage Diff @@
## master #37 +/- ##
=======================================
Coverage 63.13% 63.13%
=======================================
Files 8 8
Lines 1188 1188
=======================================
Hits 750 750
Misses 438 438
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @prrathi, here are some other suggestions besides the code suggestion as we went through today:
- Run
black
code formatting - Move csv outputs into a new folder, e.g.
ogusa_calibrate/outputs/household_calibration/
and subfolders here forcsv
andimages
(though @rickecon and @jdebacker not sure how you want these produced, or even if they should be part of the repo rather than just being functions that could be called to produce them)
import microdf as mdf | ||
import matplotlib.pyplot as plt | ||
import statsmodels.api as sm | ||
lowess = sm.nonparametric.lowess |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd drop this and refer to it directly for clarity (should only need to be referenced once)
# from taxcalc output. | ||
]) | ||
|
||
def add65(age_spouse): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See suggestions in #30 on replacing these functions and lambdas with one-liner vectorized functions
cps2.reset_index(inplace = True) | ||
cps2[cps2['age_head'].between(20,80)] | ||
|
||
smoothed18 = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested replacement for the remainder of the script (see colab):
def smooth(x, y, frac=0.4):
""" Produces LOESS smoothed data.
"""
return pd.Series(lowess(y, x, frac=frac)[:, 1], index=x)
def smooth_all(data):
""" Return smoothed versions of nu18, n1864, n65.
"""
return data.groupby(["income_bin", "age_group"]).apply(
lambda x: smooth(x.age_head, x.n)
)
cps_long = cps4.drop(columns="index").melt(["age_head", "income_bin"], var_name="age_group", value_name="n")
smoothed_wide = smooth_all(cps_long).reset_index()
smoothed_long = smoothed_wide.melt(["age_group", "income_bin"], var_name="age_head", value_name="n")
# Stack with raw.
cps_long["smoothed"] = False
smoothed_long["smoothed"] = True
combined_long = pd.concat([cps_long, smoothed_long])
# Add the household head. NB: age_head starts at 20 so no need to do for nu18.
combined_long["add_head"] = (
# n1864 and head age between 18 and 64.
((combined_long.age_group == "n1864") &
combined_long.age_head.between(18, 64)) |
# n65 and head age exceeds 64.
((combined_long.age_group == "n65") &
(combined_long.age_head > 64)))
combined_long.n += combined_long.add_head
def plot(data):
""" Produces and exports a plot of household size by age_head, with lines for each income bin.
The title and filename reflect the age group and whether the data is smoothed based on the first record.
"""
age_group = data.age_group.iloc[0]
smoothed = data.smoothed.iloc[0]
title = "Average number of people aged "
# TODO: Add folder.
fname = "cps_" + age_group
if age_group == "nu18":
title += "0 to 17"
elif age_group == "n1864":
title += "18 to 64"
else:
title += "65 or older"
if smoothed:
title += " (smoothed)"
fname += "_smoothed"
tmp.pivot_table("n", "age_head", "income_bin").plot()
plt.title(title)
plt.savefig(fname + ".png")
# Create and export all plots.
combined_long.groupby(["age_group", "smoothed"]).apply(plot)
I'd stack the PSID data with this too and then add data_source
as a groupby everywhere to minimize the code. Then just export combined_long
to a csv.
I would suggest that the csv and image files not a a part of this repo, but it'd be good to share useful images in this discussion. BTW, here's a study talking about creating tax units from the PSID. |
This PR has been superseded by PR #39. Closing. |
The
psidhousehold.py
andcpshousehold.py
files added to the ogusa_calibrate folder contain the scripts that read data- frompsid_data_setup.py
for PSID and PSL's cps dataset for CPS- and output csv files and images for each one. Per suggestion by @rickecon and @MaxGhenis , thepsid.csv
andcps.csv
files are in their respective folders within ogusa_calibrate/data, and contain ordered by head age and income bracket the average number of people in each age group originally and then after smoothing. The images depict these transformations for each age group of each data type and are outputted to ogusa_calibrate/data/images.