[BUG] marginal distribution with rake #80

EmanueleCeglia · 2024-05-10T22:44:58Z

While attempting to calibrate the margins of a sample derived from a survey (df dataframe), I encounter the error displayed at the end of the code flow.

The margins used for calibration are real totals of country x size and country x sector, in the same order as obtained through the command sorted(set(df['ctrysize/sect'])).

talgalili · 2024-05-11T07:15:47Z

Hi @EmanueleCeglia
Thanks, for the report.
Any chance you could prepare a sample data (self contained code, no files) that could somehow reproduce your issue?
I'd like/need to run it locally to be able to reproduce and fix.

Thanks.

EmanueleCeglia · 2024-05-13T08:05:51Z

Hi @talgalili I didn't know how to do.
I created a public repository where you can run the code by yourself and see the bug.
https://github.com/EmanueleCeglia/marginal-distribution-with-rake.git
I hope it's fine for you.

Thanks :)

EmanueleCeglia · 2024-05-15T07:54:30Z

@talgalili Hi, sorry if I bother you.
Do you have some news about the bug?
If the repository is not fine for you we can find another solution.
Best regards,
Emanuele

talgalili · 2024-05-15T09:18:38Z

Hi @EmanueleCeglia
The simplest solution for me to work with would be code that I can run (without external files) that can reproduce the problem.
You can use .to_list() on a DaraFrame to create such a piece of code, and then use pd.DataFrame(the_list) to get it into a DataFrame.
The challenge for you is to create the smallest minimal situation that reproduces the issue (so that the code you paste won't be too long).
Could you try and do that?

Thanks!

EmanueleCeglia · 2024-05-15T09:45:45Z

Hi @talgalili
I understand, I try to do this as soon as possible and I will come back to you.
Thanks a lot for your availability.
Best,
Emanuele

crispy-wonton · 2024-05-21T10:41:15Z

Hi @talgalili and @EmanueleCeglia ,
We ran into a similar issue recently. Ours stemmed from the ipfn package. We forked the ipfn repo with a fix - see here: Dirguis/ipfn@master...nestauk:ipfn:master
It seems like this error occurs when using rake with pandas df when you have only one instance of a particular feature category in your sample dataframe.
If you have 1 row for a category, it gets converted into numpy array when you .loc for that category. The error has something to do with this .loc process going wrong with numpy array because of some kind of recursiveness (?) I think.

talgalili · 2024-05-21T10:59:12Z

Thanks for this! Could you please propose a PR for me to review?

…

On Tue, 21 May 2024, 11:41 Roisin, ***@***.***> wrote: Hi @talgalili <https://github.com/talgalili> and @EmanueleCeglia <https://github.com/EmanueleCeglia> , We ran into the same issue recently and forked the repo with a fix - see here: ***@***.***:ipfn:master <Dirguis/ipfn@master...nestauk:ipfn:master> It seems like this error occurs when using rake with pandas df when you have only one instance of a particular feature category in your sample dataframe. If you have 1 row for a category, it gets converted into numpy array when you .loc for that category. The error has something to do with this .loc process going wrong with numpy array because of some kind of recursiveness (?) I think. — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHOJBQG7ABHRBLL55NUYN3ZDMQGDAVCNFSM6AAAAABHRLDUZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSGMZDMMZXGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

talgalili · 2024-05-21T11:05:26Z

Oh, I now see that this is a bug in ipfn (not in balance).

I think it's possible to fix this issue in balance using a monkey patch. Like was done here:

balance/balance/weighting_methods/ipw.py

Line 32 in cf22b9f

    
           # Allows us to control exactly where monkey patching is applied (e.g.: for better code readability and exceptions tracking).

(until ipfn fixes the issue)

@crispy-wonton do you want to try a PR on adding this hack to balance? (or do you think it's easier to redirect the installation to just use your repo, WDYT?)

EmanueleCeglia · 2024-05-21T14:41:12Z

Hi @talgalili @crispy-wonton thanks for your feedback.
I tried these combination:
1: remove categories that presents only one observation (and also related margins) -> usual error
2: update ipfn.py file with recommended changes (keeping all categories) -> usual error
3: update ipfn.py file and remove categories that presents only one observation (and also related margins) -> works

Now the only thing that I have to explore is why some categories are grouped together and so at the end they are not balanced.

INFO (2024-05-21 16:30:13,119) [rake/rake (line 154)]: Final covariates and levels that will be used in raking: {'ctrysize': ['_lumped_other', 'DE4', 'DE3', 'DE2', 'FR4', 'DE1', 'IT1'], 'ctrysect': ['_lumped_other', 'ESC', 'DEB', 'FRC', 'ITC', 'DEC', 'DED']}.

talgalili · 2024-05-21T17:30:14Z

Thank you for the update! Your checks leave me confused. I don't understand why using both solutions is the only thing that works. Do you have any guesses?

…

On Tue, 21 May 2024, 15:41 Emanuele Ceglia, ***@***.***> wrote: Hi @talgalili <https://github.com/talgalili> @crispy-wonton <https://github.com/crispy-wonton> thanks for your feedback. I tried these combination: 1: remove categories that presents only one observation (and also related margins) -> usual error 2: update ipfn.py file with recommended changes (keeping all categories) -> usual error 3: update ipfn.py file and remove categories that presents only one observation (and also related margins) -> works Now the only thing that I have to explore is why some categories are grouped together and so at the end they are not balanced. INFO (2024-05-21 16:30:13,119) [rake/rake (line 154)]: Final covariates and levels that will be used in raking: {'ctrysize': ['_lumped_other', 'DE4', 'DE3', 'DE2', 'FR4', 'DE1', 'IT1'], 'ctrysect': ['_lumped_other', 'ESC', 'DEB', 'FRC', 'ITC', 'DEC', 'DED']}. image.png (view on web) <https://github.com/facebookresearch/balance/assets/99983605/27680136-5a28-456f-a79b-9912fdecb8f0> — Reply to this email directly, view it on GitHub <#80 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHOJBUQKEW6FK3Y2H66BKDZDNMJ5AVCNFSM6AAAAABHRLDUZCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRSG44TKOBXHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

EmanueleCeglia · 2024-05-29T12:38:54Z

Hi @talgalili here I am for few updates, the library now doesn't give me any error even if I am keeping those categories that present only one observation.
The ipfn.py file is updated with recommended changes explained in previous messages.
So, maybe last time I was doing something wrong.

In order to avoid _lumped_other (categories grouped together in a generic one) I also changed other parameters inside the library:

convergence rate of the raking algorithm from 1e-8 to 0.0001
"prop" parameter contained in two functions: fct_lump_by and fct_lump (util.py) I changed it from default 0.05 to 0.

I still have a problem: I need to balance two categories inside my dataset: ctrysize and ctrysect but after the calibration only the first one is correctly balanced with the finals weights.

EmanueleCeglia added the bug Something isn't working label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] marginal distribution with rake #80

[BUG] marginal distribution with rake #80

EmanueleCeglia commented May 10, 2024

talgalili commented May 11, 2024

EmanueleCeglia commented May 13, 2024 •

edited

Loading

EmanueleCeglia commented May 15, 2024

talgalili commented May 15, 2024

EmanueleCeglia commented May 15, 2024

crispy-wonton commented May 21, 2024 •

edited

Loading

talgalili commented May 21, 2024 via email

talgalili commented May 21, 2024

EmanueleCeglia commented May 21, 2024

talgalili commented May 21, 2024 via email

EmanueleCeglia commented May 29, 2024

[BUG] marginal distribution with rake #80

[BUG] marginal distribution with rake #80

Comments

EmanueleCeglia commented May 10, 2024

talgalili commented May 11, 2024

EmanueleCeglia commented May 13, 2024 • edited Loading

EmanueleCeglia commented May 15, 2024

talgalili commented May 15, 2024

EmanueleCeglia commented May 15, 2024

crispy-wonton commented May 21, 2024 • edited Loading

talgalili commented May 21, 2024 via email

talgalili commented May 21, 2024

EmanueleCeglia commented May 21, 2024

talgalili commented May 21, 2024 via email

EmanueleCeglia commented May 29, 2024

EmanueleCeglia commented May 13, 2024 •

edited

Loading

crispy-wonton commented May 21, 2024 •

edited

Loading