Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grid-search doesn't work with TableVectorizer #709

Closed
LilianBoulard opened this issue Aug 18, 2023 · 1 comment
Closed

Grid-search doesn't work with TableVectorizer #709

LilianBoulard opened this issue Aug 18, 2023 · 1 comment
Labels
bug Something isn't working

Comments

@LilianBoulard
Copy link
Member

Describe the bug

A known bug introduced by #583 is that the TableVectorizer is not grid-searchable with sklearn on column_specific_transformers.
This is due to two things: first, the inheritance (something to do with set_params and get_params AFAIK), and second, the fact that we have None as default for transformer parameters (e.g. high_card_cat_transformer).
cc @glemaitre

#583 also introduced a new example that explained how to grid-search with the tool, and while the code is fine, it doesn't work due to the aforementioned issues. PR should reintroduce it (it's located in examples/FIXME/.

Steps/Code to Reproduce

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingClassifier
from skrub import GapEncoder, TableVectorizer, MinHashEncoder
from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()

pipeline = make_pipeline(
    TableVectorizer(
        high_card_cat_transformer=GapEncoder(),
        specific_transformers=[
            ("mh_dep_name", MinHashEncoder(), ["department_name"]),
        ],
    ),
    HistGradientBoostingClassifier(),
)

params = {
    "tablevectorizer__high_card_cat_transformer__n_components": [10, 30, 50],
    "tablevectorizer__mh_dep_name__n_components": [25, 50],
}

grid_search = GridSearchCV(pipeline, param_grid=params)

grid_search.fit(dataset.X, dataset.y)

Expected Results

No error is thrown, the parameters are applied to the nested transformers.

Actual Results

ValueError: Invalid parameter 'mh_dep_name' for estimator TableVectorizer(high_card_cat_transformer=GapEncoder(),
                specific_transformers=[('mh_dep_name', MinHashEncoder(),
                                        ['department_name'])]). Valid parameters are: ['auto_cast', 'cardinality_threshold', 'datetime_transformer', 'high_card_cat_transformer', 'impute_missing', 'low_card_cat_transformer', 'n_jobs', 'numerical_transformer', 'remainder', 'sparse_threshold', 'specific_transformers', 'transformer_weights', 'verbose'].

Versions

Main branch, commit b2b3a7cafafe09431568e25570fa69431efafd51
@Vincent-Maladiere
Copy link
Member

Vincent-Maladiere commented Nov 21, 2023

Closed by #814

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants