Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GroupedPredictor inconsistency for predict_proba having different classes per group #579

Closed
fabioscantamburlo opened this issue Sep 26, 2023 · 3 comments · Fixed by #582
Assignees
Labels
bug Something isn't working

Comments

@fabioscantamburlo
Copy link
Contributor

fabioscantamburlo commented Sep 26, 2023

Hello scikit-lego users.
While using predict_proba paired with GroupedPredictor and a classifier on a df with different labels per group, the final matrix is collapsed to the left, without caring about the label order.
This yields inconsistencies in the final output, especially while using an high number of labels.
To achieve a sound result every label should appear at least once in every group, that is somehow unrealistic.

Here a snippet of code:

import pandas as pd
import numpy as np

from sklego.meta import GroupedPredictor
from sklearn.linear_model import LogisticRegression


np.random.seed(43)

group_size = 5
n_groups = 2
df = pd.DataFrame({
    "group": ["A"] * group_size + ["B"] * group_size,
    "x": np.random.normal(size=group_size * n_groups),
    "y": np.hstack([
        np.random.choice([0, 1, 2], size=group_size),
        np.random.choice([0, 2], size=group_size),
        ])
})

print(df.groupby('group').agg({'y': set}))


X, y = df[["x", "group"]], df["y"]
model = GroupedPredictor(LogisticRegression(), groups=["group"])
_ = model.fit(X, y)
y_prob = model.predict_proba(X)

print(y_prob.round(2))

Outputs:

>>>                y
>>> group           
>>> A      {0, 1, 2}
>>> B         {0, 2}

>>> [[0.45 0.19 0.36]#grp A
>>> [0.3  0.23 0.47]
>>> [0.37 0.21 0.42]
>>> [0.35 0.22 0.44]
>>> [0.53 0.16 0.31]
>>> [0.79 0.21  nan]#grp B
>>> [0.8  0.2   nan]
>>> [0.81 0.19  nan]
>>> [0.81 0.19  nan]
>>> [0.79 0.21  nan]]

# Expected:
>>> [[0.45 0.19 0.36]#grp A
>>> [0.3  0.23 0.47]
>>> [0.37 0.21 0.42]
>>> [0.35 0.22 0.44]
>>> [0.53 0.16 0.31]
>>> [0.79 nan  0.21]#grp B
>>> [0.8  nan   0.2]
>>> [0.81 nan  0.19]
>>> [0.81 nan  0.19]
>>> [0.79 nan  0.21]]
@fabioscantamburlo fabioscantamburlo added the bug Something isn't working label Sep 26, 2023
@FBruzzesi
Copy link
Collaborator

Hey @fabioscantamburlo, thanks for reporting the bug.
At the moment there is no internal checking for these edge cases but it may be worth looking into it and adding such mechanism

@fabioscantamburlo
Copy link
Contributor Author

I would like to work on this if possible.

@FBruzzesi
Copy link
Collaborator

FBruzzesi commented Sep 28, 2023

Glad to hear that! Looking forward to a PR to address this issue😊

FBruzzesi added a commit that referenced this issue Oct 12, 2023
…/579-grouped-predictor-classifier

Labels fix in `GroupedPredictor.predict_proba`
resolves #579
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants