-
-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extrapolate benefits #1500
Extrapolate benefits #1500
Conversation
@hdoupe said:
My first reaction has nothing to do with the logic of the approach in #1500, which I'll discuss in a subsequent comment. I think your choice of file and variable names needs to change. Why not just And more importantly, can you change |
@martinholmer said:
Thanks for pointing this out. Benefit definitely sounds more proper than welfare for our analysis. |
@martinholmer Thanks for your feedback. I made the changes that you suggested. |
Pull request #1500 includes the following lines:
I have a couple of basic questions about this code. First, I don't understand why you are setting 2013 benefit variables to benefit values for 2014. Second, I understand that the Third, how big is the |
@martinholmer said
That sounds sensible to me.
@Amy-Xu is this interpretation correct?
|
@hdoupe said:
OK, but you need to name this variable something else to avoid confusing people who have experience studying transfer programs. I think you want something like this for the
and perhaps something like this in the
The term Now a question about substance. What are the two filing units like that contain seven SSI recipients? That sounds pretty unusual. When you look at those two filing units, what is it about their situation that leads to seven of the filing unit's members getting SSI benefits? Are we sure this is not some kind of data-preparation error? And finally, just to confirm, you're saying the |
@martinholmer said
That sounds reasonable to me. I'll make the suggested changes.
That is worrisome. @Amy-Xu could you look into this?
Yes, it is about 1.3 MB. Sorry for not making that clear. |
@martinholmer asked:
I'll look into this issue on raw CPS first, but eventually might need @andersonfrailey help to see whether those two tax units have so many recipients. |
@andersonfrailey @Amy-Xu have you figured out what's going on here: @Amy-Xu said
|
@hdoupe @martinholmer So I looked into the imputed SSI data. There's one household headed by a never-married single mother, with no income (fwsval) and 7 children. All SSI participation are imputed. I wouldn't be surprised if in the tax unit dataset this household weight was split to two. |
@Amy-Xu Yes, this seems like a pretty rare case. What do you mean when you say:
|
@hdoupe What I had in mind was split-to-match-income, but then I realized that was for CPS-PUF match. This is a pure CPS dataset. Then I'm not sure why there're two records like this. @andersonfrailey Is there a way to check whether the two records are from household [h_seq==59152]? And if so, why are there two records instead of one? |
@Amy-Xu said:
I don't understand the SSI eligibility rules well enough to understand how her seven children are SSI recipients. Under which provisions of the SSI program do the kids qualify to receive SSI benefits even though their mother does not receive SSI benefits? |
@Amy-Xu because the CPS dataset is a combination of three CPS files |
@martinholmer Thanks for catching this. I got confused as well since children have to be blind or disabled to be qualify for SSI. But none of the kids in this family has been marked as disabled or blind. I look into my imputation code, and it turns out that we use a combined disability indicator that includes natural disability and work disability. Natural disability is based on CPS disability variables, and work disability is based on CPS ASEC work disability definition. In this set of work disability rules, item 4 says if someone is younger than 65, and is covered by medicare, then this person is classified as work disabled. In this particular family, everyone is less than 65 years old and (somehow) covered by medicare, which I frankly don't know this is a reporting error or what. Thus they all fulfill the disability requirement in my imputation routine, although in reality I don't think they belong to 'work disability' category or meet SSI eligibility rules. It is indeed possible that only children get SSI but not parents, when children are disabled. However, we have all children of this family imputed mainly because the number of children getting SSI is way too low, so children are more likely get imputed than adults. I'm aware this imputation routine is far from perfect, and will keep improving the algorithm. Martin, do you think it's ok for us just to remove this family for now since it is a tiny fraction of all imputed individuals? |
It sounds to me as if "someone younger than 65 and covered by Medicare" is referring to adults not children.
What do you mean by imputed? What is being imputed? Medicare receipt?
No, I don't think this family should be removed. That would make your CPS sample be different from your benefits sample, which I don't think you want to do. Plus, its not just this one filing unit: there is another one with 7 SSI beneficiaries, and many with 6 or 5 SSI beneficiaries. So, it seems as if this one filing unit is the tip of an iceberg of poor imputation. I think you should look at the imputation algorithm again. Here is the tally done by @hdoupe:
|
I'm adding adult to the constraints to see what I can get.
SSI benefits and participation, because original CPS SSI data is under-reported, and we imputed participants and benefits to match administrative totals. Medicare coverage is a part of CPS ASEC -- no imputation from our end. |
@Amy-Xu said:
You have to be kidding about that being a "near-term goal" given the state of of TaxBrain. There are TaxBrain bugs that were reported months ago that have not yet been fixed, so I don't see adding something major to TaxBrain as very likely in the "near-term". |
@Amy-Xu said:
OK, there's no public record of the review process you describe above. So, I stand corrected: the CPS benefit has undergone more than minimal review. But given that I could quickly spot problems with the SSI imputations after those reviews should make you wonder exactly how effective that review process was. |
@Amy-Xu said:
But the key question is how sensible are the imputed benefit data. To say that the imputed benefit data produce about the same results as earlier benefit data (used in the working paper) says nothing about the quality of the imputed data. It just means that the imputed data haven't changed much since the working paper. Anyway, my proposed approach (leaving open pull request #1500 until more progress is made) does nothing to slow down your data-imputation checking or improvement process and does nothing to slow down your code development work. |
It sounds like the primary objections to proceeding with this PR are:
With regard to the "uncertain nature of the data", I agree with @Amy-Xu that improving the data quality should be and is an ongoing process, rather than a one-time task or objective. As Amy notes, we have already incorporated several rounds of feedback from experts in the field, and we will continue to do so. Yes, the file should be considered 'beta,' and yes, we want to continue to find new ways to improve the file, but that does not mean that the file is not useful as is. Generally, I think the appropriate question about quality should be, "are these data better than the next best alternative, and is the documentation about the state of the data clear." The only alternative to this file is the raw CPS data, and I am quite certain that these data are better than the raw CPS. The C-TAM documentation is clear about the state of the project, as is the CPS.csv documentation. With regard to the "limited applicability of the data", I strongly disagree with the premise. These data are applicable to all users who want to repeal and replace benefit programs with tax programs or a new program like a UBI. Given the strong interaction and conceptual similarity between benefit programs and tax programs, these seem basic and essential. The forthcoming MTR data will further improve the quality of our behavioral analyses for even tax-only reforms. We currently systematically misstate the behavioral responses to tax reforms because we don't include MTRs from state and local taxes and benefit programs. With regard to the "size of the data", I agree that this is a problem. Given that the size of the data is the fundamental problem here, at least in my thinking as described above, I doesn't seem that leaving the data in a separate branch is the right approach. It provides no path forward for dealing with that fundamental problem. I wonder if it would make more sense to include these data in a conda package that would be an optional dependency for tax-calculator. Perhaps we made a mistake by not following this approach for the cps.csv file more generally. Including data of any sort in the tax-calculator repository creates a fundamental tension between adding new, useful, variables and minimizing the size of the repo. |
This needs to be accomplished by the end of January for external reasons, so ideally it would be accomplished before the December holidays. I think we are on schedule to do that, even while prioritizing the bug fixes ahead of time. In fact, I don't think we would get the CPS integration done any sooner even if we prioritized it ahead of the bug fixes, as the bug fixes are providing training for all of us --in particular Hank and Sean -- about webapp-public. |
I said:
@Amy-Xu reminded me via email that one of the good suggestions in Producing Open Source Software is to provide clarity about grant requirements as early as possible. To that end: it is a grant requirement for TaxBrain to have the capability to repeal and replace benefit programs with a UBI by the end of March. To facilitate, that, I would like to have a version of those capabilities up by the end of January, which we will clearly mark as "alpha" or "beta". My apologies for not being more clear about these requirements earlier in the process. I recommend anyone who hasn't read the book to do so-- it contains numerous good suggestions. |
@MattHJensen said in the discussion of pull request #1500:
I think moving the My view is that we did not make a mistake when we included |
The size of cps_benefits.csv.gz is a part of my initial concerns. If moving this datafile to a separate conda package works for TB, it is certainly a very sensible solution. |
I have a couple of questions about the CPS-benefits data file.
|
@martinholmer asked:
That file is produced by Hank @hdoupe, using the code in taxdata PR #108, which has not been merged into taxdata repo yet. Even though the title only mentions SSI, I believe it has already been capable of handling all five programs we included in the cps_benefits.csv file.
This cps_benefits.csv works in the same way as weights files we have for puf and cps. As you can see, those weights files don't have any ID column either, because the sequence of records is guaranteed to match puf or cps in its production process. Same applies to the benefit file here. |
@Amy-Xu said:
But that pull request does not contain the |
@martinholmer said
This is the same file. I just copied it into the Tax-Calculator repo with what I thought was a better name. Should I change the name in |
@hdoupe said:
Fine. But where is the latest version of the CPS-benefit data file? The one with at most four (not seven) SSI recipients per filing unit? Your pull request #108 contains only one Python file and no csv output. |
@martinholmer Both TC PR #1500 and taxdata PR#108 have been updated. |
@hdoupe said:
Thanks very much. But given the recent conversation in #1500 about putting I had interpreted #1500 comments in the last day or so by @MattHJensen, @martinholmer and @Amy-Xu as all endorsing the idea of moving the Am I confused and we do not have a consensus on the issue of how to distribute the |
@martinholmer I think we do have a consensus here. Since the dataset is already available on the taxdata PR, Hank @hdoupe, could you remove the cps_benefits.csv.gz file from this PR? |
@Amy-Xu @martinholmer All set. Sorry for causing the confusion. |
No need to feel bad: this has been an extended conversation that has gone on while you've been focusing most of your creative energy on fixing TaxBrain. You're making substantial progress on both fronts, so all is good. And your TaxBrain work revealed one of my bugs, so its me who has been "causing the confusion" with respect to the new Tax-Calculator error/warning messages. |
@Amy-Xu, could you propose a repository name for the data files and suggest who should have write access initially? I will create it in Open Source Economics and set the appropriate permissions. |
It seems to me the name should include a few key words including benefit, tax unit, and CPS. But I'm not sure whether CPS should be there since we're about to add institutional ACS data. So may just In terms of write access, Hank is the only person who directly deal with he dataset, so he should have write access. And you can add whoever else you think is appropriate. |
@Amy-Xu and @hdoupe, see https://github.com/open-source-economics/tax-unit_benefits. I gave you both write access. We can change the name anytime if you'd like. |
Closing in favor of #1719 |
This initial PR is a prototype that uses SSI data to demonstrate how benefits data could be incorporated into the
Records
object as discussed in the first proposal here. Ultimately, if we decide to go this route, the benefits SNAP, Social Security, Medicare, and Medicaid would also be included. Currently, the data setbenefit_extrapolation.csv.gz
only includes tax and SSI benefit data, but could easily be expanded to include participation and benefits data on each filing unit for each program from 2014 to 2026.This is only a demo of potential code modification to the Tax-Calculator. @Amy-Xu and I are very interested in hearing other opinions on this.
@martinholmer