Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds household_structure.py with smoothed averages of number of <18, 18-64, and 65+ people per family sorted by head age and income bracket #30

Closed
wants to merge 19 commits into from

Conversation

prrathi
Copy link
Contributor

@prrathi prrathi commented Mar 15, 2021

This code first determines the average number of people <18, 18-64, and 65+ per household by head age and income bracket, then uses a KDE for smoothing. The outputs of the code are currently saved as numpy arrays titled nu18, n1864, and n65. For example, nu18 is a two dimensional array with the rows being different head ages from 20 to 80 and the columns being the different income brackets used throughout OG-USA, and the value of each cell is the smoothed average number of people under 18 in a household with that head age and income bracket.

Per line 16 in the code, the distribution, particularly the number of people aged between 18 and 64, depends on the new variable num_family that is pulled into the R dataset. Because all the variables from R dataset go through and are saved by the psid_data_setup.py outputted pickle, it would be included in this data. For testing, however, I assumed there to be 4 people in every house, so line 16 read panel_li.insert(len(panel_li.columns),"num_family",4). Here are the results of the smoothing for each of <18, 18-64, and 65+ using this testing: Results.zip. There were definitely some irregular results, here are a few observations for the nu18 array:

  • 142 of the 7*60=420 total results had a magnitude of difference between the smooth and actual values greater than 1
  • 15 of the smoothed values had really extreme values that were at least 5 more than the averages from data- like 7 or 8 people under 18 for households of that type
  • there were significantly more values close to 0 than expected, so overall a lot more extremity with smoothing

Again the nu18 array wasn't affected by the assumed 4 people per household, this was only the product of smoothing. On the topic of smoothing, I used the same function that was suggested to be consolidated in #25. Look forward to everyone's thoughts!

@prrathi prrathi changed the title Adds household_structure.py outputting the smoothed averages of number of <18, 18-64, and 65+ year olds per family sorted by head age and income bracket Adds household_structure.py with smoothed averages of number of <18, 18-64, and 65+ people per family sorted by head age and income bracket Mar 15, 2021
Copy link
Contributor

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @prrathi, left some suggestions to simplify the code, and if you can drop the files in as csv's that'd help diagnose the issues you brought up.

I think a Jupyter notebook visualizing the raw and smoothed values would also be helpful. For example, a plot of nu18 by head_age, with dots for actuals and a line for the smoothed value, and the same by lifetime income bucket.

if(spouse_age>=65):
count += 1
return count #assumes only head or spouse of head can be 65+
panel_li['n65'] = panel_li.apply(lambda x: add65(x['head_age'],x['spouse_age']), axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panel_li['n65'] = panel_li.apply(lambda x: add65(x['head_age'],x['spouse_age']), axis=1)
panel_li['n65'] = np.where(panel_li.head_age > 64, 1, 0) + np.where(panel_li.spouse_age > 64, 1, 0)

can replace the add65 function


panel_li = pickle.load(open('psid_lifetime_income.pkl', 'rb')) #created by psid_data_setup.py

panel_li.insert(len(panel_li.columns),"weight",1) #create column of only 1s which is used as weights for taking microdf average
Copy link
Contributor

@MaxGhenis MaxGhenis Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

microdf doesn't require weights, you can leave the weights argument in any function empty to have it be unweighted (or for this file, skip the microdf import as it's unnecessary)

@@ -0,0 +1,146 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these empty lines

return count
panel_li['n1864'] = panel_li.apply(lambda x: add1864(x['head_age'],x['spouse_age'],x['num_family'],x['num_children_under18']), axis=1)

panel_li['nu18'] = panel_li.apply(lambda x: x['num_children_under18'], axis=1) #assumes only children can be <18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panel_li['nu18'] = panel_li.apply(lambda x: x['num_children_under18'], axis=1) #assumes only children can be <18
panel_li['nu18'] = panel_li.num_children_under18

if(spouse_age>=65):
count -= 1
return count
panel_li['n1864'] = panel_li.apply(lambda x: add1864(x['head_age'],x['spouse_age'],x['num_family'],x['num_children_under18']), axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panel_li['n1864'] = panel_li.apply(lambda x: add1864(x['head_age'],x['spouse_age'],x['num_family'],x['num_children_under18']), axis=1)
panel_li['n1864'] = panel_li.num_family - panel_li.n65 - panel_li.nu18

after moving the nu18 line above this, or just use num_children_under18


panel_li['nu18'] = panel_li.apply(lambda x: x['num_children_under18'], axis=1) #assumes only children can be <18

panel_li2 = panel_li.reset_index()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panel_li2 = panel_li.reset_index()
panel_li.reset_index(inplace=True)
panel_li_20_80 = panel_li[panel_li.head_age.isbetween(20, 80)]

replacing below lines too

panel_li2 = panel_li.reset_index()
panel_li3 = panel_li2[panel_li2['head_age'] <= 80]
panel_li3 = panel_li3[panel_li3['head_age'] >= 20]
panel_li4 = panel_li3.groupby(['head_age', 'li_group']).apply(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panel_li4 = panel_li3.groupby(['head_age', 'li_group']).apply(
panel_li_group = panel_li_20_80.groupby(['head_age', 'li_group'])[["nu18", "n1864", "n65"]].mean()

This takes care of much of the below code. Don't need microdf since nothing's weighted.

result65 = MVKDE(60, 7, temp65)
result65 = result65*panelFinal3

np.save('nu18', result18)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you include these files, as well as temp18 etc., in the PR?

@prrathi
Copy link
Contributor Author

prrathi commented Mar 16, 2021

@MaxGhenis Thanks for the edits will go through those...here's the average pre and post smoothing values for each of the age groups again by head age and income bracket as cvs- for examplenu18init is pre and nu18final is post. I think the visualizations for comparing is a good idea, can work on that.
distributions.zip

@codecov-commenter
Copy link

codecov-commenter commented May 2, 2021

Codecov Report

Merging #30 (1fdba4f) into master (93cab3d) will decrease coverage by 0.08%.
The diff coverage is n/a.

❗ Current head 1fdba4f differs from pull request most recent head f70a4f3. Consider uploading reports for the commit f70a4f3 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master      #30      +/-   ##
==========================================
- Coverage   63.13%   63.04%   -0.09%     
==========================================
  Files           8        8              
  Lines        1188     1188              
==========================================
- Hits          750      749       -1     
- Misses        438      439       +1     
Flag Coverage Δ
unittests 63.04% <ø> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
ogusa_calibrate/tests/test_txfunc.py 53.50% <0.00%> (-0.44%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 93cab3d...f70a4f3. Read the comment docs.

@prrathi
Copy link
Contributor Author

prrathi commented May 6, 2021

Moving this to PR #37

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants