Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] marginal distribution with rake #80

Open
EmanueleCeglia opened this issue May 10, 2024 · 11 comments
Open

[BUG] marginal distribution with rake #80

EmanueleCeglia opened this issue May 10, 2024 · 11 comments
Labels
bug Something isn't working

Comments

@EmanueleCeglia
Copy link

While attempting to calibrate the margins of a sample derived from a survey (df dataframe), I encounter the error displayed at the end of the code flow.

The margins used for calibration are real totals of country x size and country x sector, in the same order as obtained through the command sorted(set(df['ctrysize/sect'])).

image image image image image image image image
@EmanueleCeglia EmanueleCeglia added the bug Something isn't working label May 10, 2024
@talgalili
Copy link
Contributor

Hi @EmanueleCeglia
Thanks, for the report.
Any chance you could prepare a sample data (self contained code, no files) that could somehow reproduce your issue?
I'd like/need to run it locally to be able to reproduce and fix.

Thanks.

@EmanueleCeglia
Copy link
Author

EmanueleCeglia commented May 13, 2024

Hi @talgalili I didn't know how to do.
I created a public repository where you can run the code by yourself and see the bug.
https://github.com/EmanueleCeglia/marginal-distribution-with-rake.git
I hope it's fine for you.

Thanks :)

@EmanueleCeglia
Copy link
Author

@talgalili Hi, sorry if I bother you.
Do you have some news about the bug?
If the repository is not fine for you we can find another solution.
Best regards,
Emanuele

@talgalili
Copy link
Contributor

Hi @EmanueleCeglia
The simplest solution for me to work with would be code that I can run (without external files) that can reproduce the problem.
You can use .to_list() on a DaraFrame to create such a piece of code, and then use pd.DataFrame(the_list) to get it into a DataFrame.
The challenge for you is to create the smallest minimal situation that reproduces the issue (so that the code you paste won't be too long).
Could you try and do that?

Thanks!

@EmanueleCeglia
Copy link
Author

Hi @talgalili
I understand, I try to do this as soon as possible and I will come back to you.
Thanks a lot for your availability.
Best,
Emanuele

@crispy-wonton
Copy link

crispy-wonton commented May 21, 2024

Hi @talgalili and @EmanueleCeglia ,
We ran into a similar issue recently. Ours stemmed from the ipfn package. We forked the ipfn repo with a fix - see here: Dirguis/ipfn@master...nestauk:ipfn:master
It seems like this error occurs when using rake with pandas df when you have only one instance of a particular feature category in your sample dataframe.
If you have 1 row for a category, it gets converted into numpy array when you .loc for that category. The error has something to do with this .loc process going wrong with numpy array because of some kind of recursiveness (?) I think.

@talgalili
Copy link
Contributor

talgalili commented May 21, 2024 via email

@talgalili
Copy link
Contributor

Oh, I now see that this is a bug in ipfn (not in balance).

I think it's possible to fix this issue in balance using a monkey patch. Like was done here:

# Allows us to control exactly where monkey patching is applied (e.g.: for better code readability and exceptions tracking).

(until ipfn fixes the issue)

@crispy-wonton do you want to try a PR on adding this hack to balance? (or do you think it's easier to redirect the installation to just use your repo, WDYT?)

@EmanueleCeglia
Copy link
Author

Hi @talgalili @crispy-wonton thanks for your feedback.
I tried these combination:
1: remove categories that presents only one observation (and also related margins) -> usual error
2: update ipfn.py file with recommended changes (keeping all categories) -> usual error
3: update ipfn.py file and remove categories that presents only one observation (and also related margins) -> works

Now the only thing that I have to explore is why some categories are grouped together and so at the end they are not balanced.

INFO (2024-05-21 16:30:13,119) [rake/rake (line 154)]: Final covariates and levels that will be used in raking: {'ctrysize': ['_lumped_other', 'DE4', 'DE3', 'DE2', 'FR4', 'DE1', 'IT1'], 'ctrysect': ['_lumped_other', 'ESC', 'DEB', 'FRC', 'ITC', 'DEC', 'DED']}.

image

@talgalili
Copy link
Contributor

talgalili commented May 21, 2024 via email

@EmanueleCeglia
Copy link
Author

Hi @talgalili here I am for few updates, the library now doesn't give me any error even if I am keeping those categories that present only one observation.
The ipfn.py file is updated with recommended changes explained in previous messages.
So, maybe last time I was doing something wrong.

In order to avoid _lumped_other (categories grouped together in a generic one) I also changed other parameters inside the library:

  • convergence rate of the raking algorithm from 1e-8 to 0.0001
  • "prop" parameter contained in two functions: fct_lump_by and fct_lump (util.py) I changed it from default 0.05 to 0.

I still have a problem: I need to balance two categories inside my dataset: ctrysize and ctrysect but after the calibration only the first one is correctly balanced with the finals weights.

newplot
newplot2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants