Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds number of <18, 18-64, >64 people per household sorted by head age and income bracket, using PSID and CPS data #37

Closed
wants to merge 19 commits into from

Conversation

prrathi
Copy link
Contributor

@prrathi prrathi commented May 6, 2021

The psidhousehold.py and cpshousehold.py files added to the ogusa_calibrate folder contain the scripts that read data- from psid_data_setup.py for PSID and PSL's cps dataset for CPS- and output csv files and images for each one. Per suggestion by @rickecon and @MaxGhenis , the psid.csv and cps.csv files are in their respective folders within ogusa_calibrate/data, and contain ordered by head age and income bracket the average number of people in each age group originally and then after smoothing. The images depict these transformations for each age group of each data type and are outputted to ogusa_calibrate/data/images.

@codecov-commenter
Copy link

codecov-commenter commented May 6, 2021

Codecov Report

Merging #37 (f70a4f3) into master (93cab3d) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #37   +/-   ##
=======================================
  Coverage   63.13%   63.13%           
=======================================
  Files           8        8           
  Lines        1188     1188           
=======================================
  Hits          750      750           
  Misses        438      438           
Flag Coverage Δ
unittests 63.13% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93cab3d...f70a4f3. Read the comment docs.

Copy link
Contributor

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @prrathi, here are some other suggestions besides the code suggestion as we went through today:

  • Run black code formatting
  • Move csv outputs into a new folder, e.g. ogusa_calibrate/outputs/household_calibration/ and subfolders here for csv and images (though @rickecon and @jdebacker not sure how you want these produced, or even if they should be part of the repo rather than just being functions that could be called to produce them)

import microdf as mdf
import matplotlib.pyplot as plt
import statsmodels.api as sm
lowess = sm.nonparametric.lowess
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd drop this and refer to it directly for clarity (should only need to be referenced once)

# from taxcalc output.
])

def add65(age_spouse):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See suggestions in #30 on replacing these functions and lambdas with one-liner vectorized functions

cps2.reset_index(inplace = True)
cps2[cps2['age_head'].between(20,80)]

smoothed18 = []
Copy link
Contributor

@MaxGhenis MaxGhenis May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested replacement for the remainder of the script (see colab):

def smooth(x, y, frac=0.4):
    """ Produces LOESS smoothed data.
    """
    return pd.Series(lowess(y, x, frac=frac)[:, 1], index=x)


def smooth_all(data):
    """ Return smoothed versions of nu18, n1864, n65.
    """
    return data.groupby(["income_bin", "age_group"]).apply(
        lambda x: smooth(x.age_head, x.n)
    )

cps_long = cps4.drop(columns="index").melt(["age_head", "income_bin"], var_name="age_group", value_name="n")
smoothed_wide = smooth_all(cps_long).reset_index()
smoothed_long = smoothed_wide.melt(["age_group", "income_bin"], var_name="age_head", value_name="n")

# Stack with raw.
cps_long["smoothed"] = False
smoothed_long["smoothed"] = True
combined_long = pd.concat([cps_long, smoothed_long])

# Add the household head. NB: age_head starts at 20 so no need to do for nu18.
combined_long["add_head"] = (
    # n1864 and head age between 18 and 64.
    ((combined_long.age_group == "n1864") &
     combined_long.age_head.between(18, 64)) | 
    # n65 and head age exceeds 64.
    ((combined_long.age_group == "n65") &
     (combined_long.age_head > 64)))
combined_long.n += combined_long.add_head

def plot(data):
    """ Produces and exports a plot of household size by age_head, with lines for each income bin.
        The title and filename reflect the age group and whether the data is smoothed based on the first record.
    """
    age_group = data.age_group.iloc[0]
    smoothed = data.smoothed.iloc[0]
    title = "Average number of people aged "
    # TODO: Add folder.
    fname = "cps_" + age_group
    if age_group == "nu18":
        title += "0 to 17"
    elif age_group == "n1864":
        title += "18 to 64"
    else:
        title += "65 or older"
    if smoothed:
        title += " (smoothed)"
        fname += "_smoothed"
    tmp.pivot_table("n", "age_head", "income_bin").plot()
    plt.title(title)
    plt.savefig(fname + ".png")

# Create and export all plots.
combined_long.groupby(["age_group", "smoothed"]).apply(plot)

I'd stack the PSID data with this too and then add data_source as a groupby everywhere to minimize the code. Then just export combined_long to a csv.

@jdebacker
Copy link
Member

I would suggest that the csv and image files not a a part of this repo, but it'd be good to share useful images in this discussion.

BTW, here's a study talking about creating tax units from the PSID.

@rickecon
Copy link
Member

rickecon commented Jun 5, 2021

This PR has been superseded by PR #39. Closing.

@rickecon rickecon closed this Jun 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants