Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

to_dummies implementation may be incorrect #8246

Closed
2 tasks done
EdmundsEcho opened this issue Apr 14, 2023 · 11 comments · Fixed by #9143
Closed
2 tasks done

to_dummies implementation may be incorrect #8246

EdmundsEcho opened this issue Apr 14, 2023 · 11 comments · Fixed by #9143
Labels
bug Something isn't working rust Related to Rust Polars

Comments

@EdmundsEcho
Copy link
Contributor

EdmundsEcho commented Apr 14, 2023

Polars version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of Polars.

Issue description

Great work on polars. I'm so happy to be moving away from python... you have no idea :))

My understanding of the motivation for to_dummies is to feed a stats/ml model. To encode all of the levels in a series we only need n-1 levels, not n. For instance to encode a categorical column ["blue","green"] I only need a single column with [0,1] values. If it's not blue, it is implied to be green.

blue
0
1

Similarly, to encode a categorical column ["blue", "green", "red"], I only need two dummy columns. Note the first row where the value is neither blue nor green, it's implied to be red.

blue green
0 0
1 0
0 1

Not only is the logic wrong, it is wrong with harm: If we don't n-1 dummies, we introduce collinearity errors. It also messes with how to interpret the intercept/bias. In this example, the intercept is an estimate for red.

At minimum, may I suggest that we have a option to return n vs n-1. I also noticed that others requested maintaining the original column. I would second that. There is no canonical or other reason to consume the original column categorical column to generate the dummies. It might be another good toggle?

Finally, only as aside, I tried to implement my own trait with a wrapped Series. There might be other ways, but this approach was problematic when trying to engage with multiple columns...

Reproducible example

See above.

Expected behavior

See above.

Installed versions

"to_dummies"
see above.
@EdmundsEcho EdmundsEcho added bug Something isn't working rust Related to Rust Polars labels Apr 14, 2023
@avimallu
Copy link
Contributor

avimallu commented Apr 14, 2023

Pandas has the same default behavior as Polars does, and implementation is just fine by definition (in the Python version, and I think it derives that from the Rust version itself). What I would like to see is the same option that Pandas has for get_dummies: a drop_first argument that defaults to False. In the documentation for the drop_first argument:

Whether to get k-1 dummies out of k categorical levels by removing the first level.'

I believe that this is what you're alluding to.

@EdmundsEcho
Copy link
Contributor Author

@avimallu Thank you for the additional information. Your approach would work. Bottom line, there needs to be a way to do this because without it, the predictions are just plain "off". I suspect that when I use the stats packages in py that do this transformation on my behalf (string field -> dummies), they are doing k - 1. I would be really surprised if they did not... It's not hard to get the collinearity problem.

@mcrumiller
Copy link
Contributor

I would definitely say don't leave one out: you cannot presume why the user wants dummy variables. Maybe they are inputting it into an ML model, maybe not. I can see the usefuless of a drop_first input argument.

@EdmundsEcho
Copy link
Contributor Author

EdmundsEcho commented Apr 17, 2023

@mcrumiller ... at a "user-friendly" level, I absolutely conquer. But it is a short-term win because in terms of "hosting information", the default/only approach right now introduces a redundancy, and thus "two sources of truth", never a good thing. IMO the default should be drop_first; i.e., default to the long-term win.

@slonik-az
Copy link
Contributor

TL;DR: polars behavior is fine. It is not a bug. One-hot encoding (aka dummies) has to sum to 1.

Here are more details. One-hot encoding of categoricals (aka dummies encoding) has a natural interpretation as a probability vector with a single position with probability value of 1 (indicating your chosen category) and the rest being exactly zero. This interpretation is widely used in ML and it is also the reason that dummy encoding mappings are 0,1 rather than false, true. Being numeric rather then boolean allows probabilistic interpretation. It is also consistent with soft-max mappings of output probability vector where uncertainty in prediction is mapped into similar probability vector but with several non-zero probabilities this time around. Now you can construct meaningful empirical loss functions comparing your input (one-hot) and output (soft-max) probability vectors. If you really care about saving memory use sparse vector representation and only store index where you have 1. Switching from n-dimensional vector to (n-1)-dimensional vector is mighty confusing and does not save memory in a meaningful way anyway.

@mcrumiller
Copy link
Contributor

This interpretation is widely used in ML

The point @EdmundsEcho was making was that if you're viewing these data in any sort of linear model, your covariance matrix is degenerate because you have n vectors with n-1 degrees of freedom.

@slonik-az
Copy link
Contributor

This interpretation is widely used in ML

The point @EdmundsEcho was making was that if you're viewing these data in any sort of linear model, your covariance matrix is degenerate because you have n vectors with n-1 degrees of freedom.

This is always the case when you have linear system with constrains. There are many standard ways (e.g. regularization) to deal with rank-deficient matrixes. But it should not be in scope for polars out of danger that it becomes a kitchen sync of everything under the sun.

@mcrumiller
Copy link
Contributor

@slonik-az I completely agree with you, I wasn't trying to play devil's advocate.

@EdmundsEcho
Copy link
Contributor Author

The scope of the request of to have the option.

This is always the case when you have linear system with constrains.

I'm not sure I follow... There is always some level of collinearity. It's also possible to make it worse. Depending on the optimization approach it could be moot.

That all said, the interpretation of the intercept becomes unnatural.

@zundertj
Copy link
Collaborator

zundertj commented May 1, 2023

That all said, the interpretation of the intercept becomes unnatural.

Even if we would want Polars to be more directed to the linear regression use case, a generic "drop_first" has the same problem, as the interpretation of the dummy loadings depends on which category was dropped ("the base case/level" or whatever you like to call it). How do you determine first? Order of occurrence in the original column? That seems very brittle to me.

the predictions are just plain "off"

This is a jargon thing, so maybe you didn't intend to say this, but with "prediction", one usually means the "predicted/fitted y" or "y-hat" in a linear regression context. Those are not affected by collinearity. The individual coefficient estimates are unstable, but as a collective they will produce the same y-hat.

@EdmundsEcho
Copy link
Contributor Author

@zundertj I'm inclined to agree that each will produce the same "y-hat" (I have directly observed what you are saying). There are statistical power issues and the coefficients will be different... I have sometimes encountered scenarios where the collinearity introduced by this absolute redundancy, can "bug-out"/prevent a successful analysis. That all said, I think we have provided enough reason to include the option to opt-out of the extra information :))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working rust Related to Rust Polars
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants