-
-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds household_structure.py with smoothed averages of number of <18, 18-64, and 65+ people per family sorted by head age and income bracket #30
Conversation
Update 3/15
Returns the smoothed number of <18, 18-64, >64 by age/income
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @prrathi, left some suggestions to simplify the code, and if you can drop the files in as csv's that'd help diagnose the issues you brought up.
I think a Jupyter notebook visualizing the raw and smoothed values would also be helpful. For example, a plot of nu18
by head_age
, with dots for actuals and a line for the smoothed value, and the same by lifetime income bucket.
if(spouse_age>=65): | ||
count += 1 | ||
return count #assumes only head or spouse of head can be 65+ | ||
panel_li['n65'] = panel_li.apply(lambda x: add65(x['head_age'],x['spouse_age']), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panel_li['n65'] = panel_li.apply(lambda x: add65(x['head_age'],x['spouse_age']), axis=1) | |
panel_li['n65'] = np.where(panel_li.head_age > 64, 1, 0) + np.where(panel_li.spouse_age > 64, 1, 0) |
can replace the add65
function
|
||
panel_li = pickle.load(open('psid_lifetime_income.pkl', 'rb')) #created by psid_data_setup.py | ||
|
||
panel_li.insert(len(panel_li.columns),"weight",1) #create column of only 1s which is used as weights for taking microdf average |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
microdf
doesn't require weights, you can leave the weights argument in any function empty to have it be unweighted (or for this file, skip the microdf
import as it's unnecessary)
@@ -0,0 +1,146 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these empty lines
return count | ||
panel_li['n1864'] = panel_li.apply(lambda x: add1864(x['head_age'],x['spouse_age'],x['num_family'],x['num_children_under18']), axis=1) | ||
|
||
panel_li['nu18'] = panel_li.apply(lambda x: x['num_children_under18'], axis=1) #assumes only children can be <18 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panel_li['nu18'] = panel_li.apply(lambda x: x['num_children_under18'], axis=1) #assumes only children can be <18 | |
panel_li['nu18'] = panel_li.num_children_under18 |
if(spouse_age>=65): | ||
count -= 1 | ||
return count | ||
panel_li['n1864'] = panel_li.apply(lambda x: add1864(x['head_age'],x['spouse_age'],x['num_family'],x['num_children_under18']), axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panel_li['n1864'] = panel_li.apply(lambda x: add1864(x['head_age'],x['spouse_age'],x['num_family'],x['num_children_under18']), axis=1) | |
panel_li['n1864'] = panel_li.num_family - panel_li.n65 - panel_li.nu18 |
after moving the nu18
line above this, or just use num_children_under18
|
||
panel_li['nu18'] = panel_li.apply(lambda x: x['num_children_under18'], axis=1) #assumes only children can be <18 | ||
|
||
panel_li2 = panel_li.reset_index() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panel_li2 = panel_li.reset_index() | |
panel_li.reset_index(inplace=True) | |
panel_li_20_80 = panel_li[panel_li.head_age.isbetween(20, 80)] |
replacing below lines too
panel_li2 = panel_li.reset_index() | ||
panel_li3 = panel_li2[panel_li2['head_age'] <= 80] | ||
panel_li3 = panel_li3[panel_li3['head_age'] >= 20] | ||
panel_li4 = panel_li3.groupby(['head_age', 'li_group']).apply( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
panel_li4 = panel_li3.groupby(['head_age', 'li_group']).apply( | |
panel_li_group = panel_li_20_80.groupby(['head_age', 'li_group'])[["nu18", "n1864", "n65"]].mean() |
This takes care of much of the below code. Don't need microdf
since nothing's weighted.
result65 = MVKDE(60, 7, temp65) | ||
result65 = result65*panelFinal3 | ||
|
||
np.save('nu18', result18) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you include these files, as well as temp18
etc., in the PR?
@MaxGhenis Thanks for the edits will go through those...here's the average pre and post smoothing values for each of the age groups again by head age and income bracket as cvs- for example |
Codecov Report
@@ Coverage Diff @@
## master #30 +/- ##
==========================================
- Coverage 63.13% 63.04% -0.09%
==========================================
Files 8 8
Lines 1188 1188
==========================================
- Hits 750 749 -1
- Misses 438 439 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Moving this to PR #37 |
This code first determines the average number of people <18, 18-64, and 65+ per household by head age and income bracket, then uses a KDE for smoothing. The outputs of the code are currently saved as numpy arrays titled
nu18
,n1864
, andn65
. For example,nu18
is a two dimensional array with the rows being different head ages from 20 to 80 and the columns being the different income brackets used throughout OG-USA, and the value of each cell is the smoothed average number of people under 18 in a household with that head age and income bracket.Per line 16 in the code, the distribution, particularly the number of people aged between 18 and 64, depends on the new variable
num_family
that is pulled into the R dataset. Because all the variables from R dataset go through and are saved by thepsid_data_setup.py
outputted pickle, it would be included in this data. For testing, however, I assumed there to be 4 people in every house, so line 16 readpanel_li.insert(len(panel_li.columns),"num_family",4)
. Here are the results of the smoothing for each of <18, 18-64, and 65+ using this testing: Results.zip. There were definitely some irregular results, here are a few observations for thenu18
array:Again the
nu18
array wasn't affected by the assumed 4 people per household, this was only the product of smoothing. On the topic of smoothing, I used the same function that was suggested to be consolidated in #25. Look forward to everyone's thoughts!