to_dummies implementation may be incorrect #8246

EdmundsEcho · 2023-04-14T23:01:16Z

Polars version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Issue description

Great work on polars. I'm so happy to be moving away from python... you have no idea :))

My understanding of the motivation for to_dummies is to feed a stats/ml model. To encode all of the levels in a series we only need n-1 levels, not n. For instance to encode a categorical column ["blue","green"] I only need a single column with [0,1] values. If it's not blue, it is implied to be green.

blue
0
1

Similarly, to encode a categorical column ["blue", "green", "red"], I only need two dummy columns. Note the first row where the value is neither blue nor green, it's implied to be red.

blue	green
0	0
1	0
0	1

Not only is the logic wrong, it is wrong with harm: If we don't n-1 dummies, we introduce collinearity errors. It also messes with how to interpret the intercept/bias. In this example, the intercept is an estimate for red.

At minimum, may I suggest that we have a option to return n vs n-1. I also noticed that others requested maintaining the original column. I would second that. There is no canonical or other reason to consume the original column categorical column to generate the dummies. It might be another good toggle?

Finally, only as aside, I tried to implement my own trait with a wrapped Series. There might be other ways, but this approach was problematic when trying to engage with multiple columns...

Reproducible example

See above.

Expected behavior

See above.

Installed versions

"to_dummies"

see above.

avimallu · 2023-04-14T23:17:30Z

Pandas has the same default behavior as Polars does, and implementation is just fine by definition (in the Python version, and I think it derives that from the Rust version itself). What I would like to see is the same option that Pandas has for get_dummies: a drop_first argument that defaults to False. In the documentation for the drop_first argument:

Whether to get k-1 dummies out of k categorical levels by removing the first level.'

I believe that this is what you're alluding to.

EdmundsEcho · 2023-04-14T23:36:31Z

@avimallu Thank you for the additional information. Your approach would work. Bottom line, there needs to be a way to do this because without it, the predictions are just plain "off". I suspect that when I use the stats packages in py that do this transformation on my behalf (string field -> dummies), they are doing k - 1. I would be really surprised if they did not... It's not hard to get the collinearity problem.

mcrumiller · 2023-04-15T20:27:59Z

I would definitely say don't leave one out: you cannot presume why the user wants dummy variables. Maybe they are inputting it into an ML model, maybe not. I can see the usefuless of a drop_first input argument.

EdmundsEcho · 2023-04-17T14:48:00Z

@mcrumiller ... at a "user-friendly" level, I absolutely conquer. But it is a short-term win because in terms of "hosting information", the default/only approach right now introduces a redundancy, and thus "two sources of truth", never a good thing. IMO the default should be drop_first; i.e., default to the long-term win.

slonik-az · 2023-04-17T16:32:51Z

TL;DR: polars behavior is fine. It is not a bug. One-hot encoding (aka dummies) has to sum to 1.

Here are more details. One-hot encoding of categoricals (aka dummies encoding) has a natural interpretation as a probability vector with a single position with probability value of 1 (indicating your chosen category) and the rest being exactly zero. This interpretation is widely used in ML and it is also the reason that dummy encoding mappings are 0,1 rather than false, true. Being numeric rather then boolean allows probabilistic interpretation. It is also consistent with soft-max mappings of output probability vector where uncertainty in prediction is mapped into similar probability vector but with several non-zero probabilities this time around. Now you can construct meaningful empirical loss functions comparing your input (one-hot) and output (soft-max) probability vectors. If you really care about saving memory use sparse vector representation and only store index where you have 1. Switching from n-dimensional vector to (n-1)-dimensional vector is mighty confusing and does not save memory in a meaningful way anyway.

mcrumiller · 2023-04-17T17:00:42Z

This interpretation is widely used in ML

The point @EdmundsEcho was making was that if you're viewing these data in any sort of linear model, your covariance matrix is degenerate because you have n vectors with n-1 degrees of freedom.

slonik-az · 2023-04-17T21:08:01Z

This interpretation is widely used in ML

The point @EdmundsEcho was making was that if you're viewing these data in any sort of linear model, your covariance matrix is degenerate because you have n vectors with n-1 degrees of freedom.

This is always the case when you have linear system with constrains. There are many standard ways (e.g. regularization) to deal with rank-deficient matrixes. But it should not be in scope for polars out of danger that it becomes a kitchen sync of everything under the sun.

mcrumiller · 2023-04-17T21:20:37Z

@slonik-az I completely agree with you, I wasn't trying to play devil's advocate.

EdmundsEcho · 2023-04-17T23:23:12Z

The scope of the request of to have the option.

This is always the case when you have linear system with constrains.

I'm not sure I follow... There is always some level of collinearity. It's also possible to make it worse. Depending on the optimization approach it could be moot.

That all said, the interpretation of the intercept becomes unnatural.

zundertj · 2023-05-01T13:56:47Z

That all said, the interpretation of the intercept becomes unnatural.

Even if we would want Polars to be more directed to the linear regression use case, a generic "drop_first" has the same problem, as the interpretation of the dummy loadings depends on which category was dropped ("the base case/level" or whatever you like to call it). How do you determine first? Order of occurrence in the original column? That seems very brittle to me.

the predictions are just plain "off"

This is a jargon thing, so maybe you didn't intend to say this, but with "prediction", one usually means the "predicted/fitted y" or "y-hat" in a linear regression context. Those are not affected by collinearity. The individual coefficient estimates are unstable, but as a collective they will produce the same y-hat.

EdmundsEcho · 2023-05-31T13:12:53Z

@zundertj I'm inclined to agree that each will produce the same "y-hat" (I have directly observed what you are saying). There are statistical power issues and the coefficients will be different... I have sometimes encountered scenarios where the collinearity introduced by this absolute redundancy, can "bug-out"/prevent a successful analysis. That all said, I think we have provided enough reason to include the option to opt-out of the extra information :))

…) (#9143)

…a-rs#8246) (pola-rs#9143)

EdmundsEcho added bug Something isn't working rust Related to Rust Polars labels Apr 14, 2023

EdmundsEcho mentioned this issue May 31, 2023

feat(python,rust): add drop_first parameter for to_dummies (issue #8246) #9143

Merged

avimallu mentioned this issue Jun 21, 2023

Add drop_first argument to .to_dummies() #9483

Closed

ritchie46 closed this as completed in #9143 Jun 23, 2023

ritchie46 pushed a commit that referenced this issue Jun 23, 2023

feat(python,rust): add drop_first parameter for to_dummies (issue #8246…

373a99a

…) (#9143)

c-peters pushed a commit to c-peters/polars that referenced this issue Jul 14, 2023

feat(python,rust): add drop_first parameter for to_dummies (issue pol…

79ef67d

…a-rs#8246) (pola-rs#9143)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

to_dummies implementation may be incorrect #8246

to_dummies implementation may be incorrect #8246

EdmundsEcho commented Apr 14, 2023 •

edited

Loading

avimallu commented Apr 14, 2023 •

edited

Loading

EdmundsEcho commented Apr 14, 2023

mcrumiller commented Apr 15, 2023

EdmundsEcho commented Apr 17, 2023 •

edited

Loading

slonik-az commented Apr 17, 2023

mcrumiller commented Apr 17, 2023

slonik-az commented Apr 17, 2023

mcrumiller commented Apr 17, 2023

EdmundsEcho commented Apr 17, 2023

zundertj commented May 1, 2023

EdmundsEcho commented May 31, 2023

to_dummies implementation may be incorrect #8246

to_dummies implementation may be incorrect #8246

Comments

EdmundsEcho commented Apr 14, 2023 • edited Loading

Polars version checks

Issue description

Reproducible example

Expected behavior

Installed versions

avimallu commented Apr 14, 2023 • edited Loading

EdmundsEcho commented Apr 14, 2023

mcrumiller commented Apr 15, 2023

EdmundsEcho commented Apr 17, 2023 • edited Loading

slonik-az commented Apr 17, 2023

mcrumiller commented Apr 17, 2023

slonik-az commented Apr 17, 2023

mcrumiller commented Apr 17, 2023

EdmundsEcho commented Apr 17, 2023

zundertj commented May 1, 2023

EdmundsEcho commented May 31, 2023

EdmundsEcho commented Apr 14, 2023 •

edited

Loading

avimallu commented Apr 14, 2023 •

edited

Loading

EdmundsEcho commented Apr 17, 2023 •

edited

Loading