-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
to_dummies implementation may be incorrect #8246
Comments
Pandas has the same default behavior as Polars does, and implementation is just fine by definition (in the Python version, and I think it derives that from the Rust version itself). What I would like to see is the same option that Pandas has for
I believe that this is what you're alluding to. |
@avimallu Thank you for the additional information. Your approach would work. Bottom line, there needs to be a way to do this because without it, the predictions are just plain "off". I suspect that when I use the stats packages in py that do this transformation on my behalf (string field -> dummies), they are doing k - 1. I would be really surprised if they did not... It's not hard to get the collinearity problem. |
I would definitely say don't leave one out: you cannot presume why the user wants dummy variables. Maybe they are inputting it into an ML model, maybe not. I can see the usefuless of a |
@mcrumiller ... at a "user-friendly" level, I absolutely conquer. But it is a short-term win because in terms of "hosting information", the default/only approach right now introduces a redundancy, and thus "two sources of truth", never a good thing. IMO the default should be drop_first; i.e., default to the long-term win. |
TL;DR: polars behavior is fine. It is not a bug. One-hot encoding (aka dummies) has to sum to 1. Here are more details. One-hot encoding of categoricals (aka dummies encoding) has a natural interpretation as a probability vector with a single position with probability value of 1 (indicating your chosen category) and the rest being exactly zero. This interpretation is widely used in ML and it is also the reason that dummy encoding mappings are |
The point @EdmundsEcho was making was that if you're viewing these data in any sort of linear model, your covariance matrix is degenerate because you have n vectors with n-1 degrees of freedom. |
This is always the case when you have linear system with constrains. There are many standard ways (e.g. regularization) to deal with rank-deficient matrixes. But it should not be in scope for polars out of danger that it becomes a kitchen sync of everything under the sun. |
@slonik-az I completely agree with you, I wasn't trying to play devil's advocate. |
The scope of the request of to have the option.
I'm not sure I follow... There is always some level of collinearity. It's also possible to make it worse. Depending on the optimization approach it could be moot. That all said, the interpretation of the intercept becomes unnatural. |
Even if we would want Polars to be more directed to the linear regression use case, a generic "drop_first" has the same problem, as the interpretation of the dummy loadings depends on which category was dropped ("the base case/level" or whatever you like to call it). How do you determine first? Order of occurrence in the original column? That seems very brittle to me.
This is a jargon thing, so maybe you didn't intend to say this, but with "prediction", one usually means the "predicted/fitted y" or "y-hat" in a linear regression context. Those are not affected by collinearity. The individual coefficient estimates are unstable, but as a collective they will produce the same y-hat. |
@zundertj I'm inclined to agree that each will produce the same "y-hat" (I have directly observed what you are saying). There are statistical power issues and the coefficients will be different... I have sometimes encountered scenarios where the collinearity introduced by this absolute redundancy, can "bug-out"/prevent a successful analysis. That all said, I think we have provided enough reason to include the option to opt-out of the extra information :)) |
Polars version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.
Issue description
Great work on polars. I'm so happy to be moving away from python... you have no idea :))
My understanding of the motivation for to_dummies is to feed a stats/ml model. To encode all of the levels in a series we only need n-1 levels, not n. For instance to encode a categorical column ["blue","green"] I only need a single column with [0,1] values. If it's not blue, it is implied to be green.
Similarly, to encode a categorical column ["blue", "green", "red"], I only need two dummy columns. Note the first row where the value is neither blue nor green, it's implied to be red.
Not only is the logic wrong, it is wrong with harm: If we don't n-1 dummies, we introduce collinearity errors. It also messes with how to interpret the intercept/bias. In this example, the intercept is an estimate for red.
At minimum, may I suggest that we have a option to return n vs n-1. I also noticed that others requested maintaining the original column. I would second that. There is no canonical or other reason to consume the original column categorical column to generate the dummies. It might be another good toggle?
Finally, only as aside, I tried to implement my own trait with a wrapped Series. There might be other ways, but this approach was problematic when trying to engage with multiple columns...
Reproducible example
See above.
Expected behavior
See above.
Installed versions
The text was updated successfully, but these errors were encountered: