Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add built-in column-specific transformers to TableVectorizer #583

Conversation

LilianBoulard
Copy link
Member

@LilianBoulard LilianBoulard commented Jun 9, 2023

Fixes part of #554

As discussed in #580, adds the column_specific_transformers parameter to the TableVectorizer.

This allows this kind of functionality:

from skrub import TableVectorizer, MinHashEncoder
from sklearn.compose import make_column_transformer

make_column_transformer([
    (
        MinHashEncoder(),
        ["PRODUCTTYPENAME"],
    ),
    remainder=TableVectorizer(),
])

from the TableVectorizer directly, with this syntax:

TableVectorizer(
    column_specific_transformers=[
        (MinHashEncoder(), ["PRODUCTTYPENAME"])
    ],
)

When the assignements need to be named (e.g. for a grid-search), the user can specify a name (same syntax as the ColumnTransformer):

TableVectorizer(
    column_specific_transformers=[
        ("mh_product_type", MinHashEncoder(), ["PRODUCTTYPENAME"])
    ],
)

@LilianBoulard LilianBoulard added the enhancement New feature or request label Jun 9, 2023
@LilianBoulard LilianBoulard self-assigned this Jun 9, 2023
@LilianBoulard LilianBoulard changed the title [Add column_specific_transformers to TableVectorizer Add built-in column-specific transformers to TableVectorizer Jun 9, 2023
@LilianBoulard LilianBoulard marked this pull request as draft June 9, 2023 14:56
@LilianBoulard LilianBoulard marked this pull request as ready for review June 30, 2023 18:05
Copy link
Member

@GaelVaroquaux GaelVaroquaux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! A few minor comments.

One major one: we should use this in an example. Maybe by modifying an existing one

CHANGES.rst Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
@GaelVaroquaux
Copy link
Member

You have failing tests. Can you address them please

@LilianBoulard
Copy link
Member Author

Thanks for reminding me :)
The PR should be ready to be reviewed/merged now. The implementation is complete and the tests are done.

@LilianBoulard LilianBoulard dismissed GaelVaroquaux’s stale review July 28, 2023 14:19

Concerns were addressed

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @LilianBoulard, nice to see this implemented :)
A few comments for the example

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Lilian, here is my first round of reviews! The example looks neat, let's add the plots from the results of the grid search :)

examples/07_grid_searching_with_the_tablevectorizer.py Outdated Show resolved Hide resolved
examples/07_grid_searching_with_the_tablevectorizer.py Outdated Show resolved Hide resolved
examples/07_grid_searching_with_the_tablevectorizer.py Outdated Show resolved Hide resolved
examples/07_grid_searching_with_the_tablevectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
skrub/_table_vectorizer.py Show resolved Hide resolved
skrub/tests/test_table_vectorizer.py Show resolved Hide resolved
skrub/tests/test_table_vectorizer.py Outdated Show resolved Hide resolved
@LilianBoulard
Copy link
Member Author

LilianBoulard commented Jul 31, 2023

The grid-search doesn't work as expected, there is something different between the ColumnTransformer and the TableVectorizer (in the sense, it works with one but not the other), but I don't know what yet. I think you had a concern about set_params @glemaitre?

@LilianBoulard
Copy link
Member Author

So the feature has been ready for a while, but this example is blocking. Since it doesn't work and needs some more work, I've put it into a temporary directory, so it's not accessible on the website, but so we don't lose the code. I'll open an issue in a few minutes to fix that.

Copy link
Member

@jovan-stojanovic jovan-stojanovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we need to move on, until we find a good solution for the grid search.
Add one more small test before approving this :)

skrub/_table_vectorizer.py Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @LilianBoulard! After adding the missing test, I'm happy with this PR as-is. Let's investigate the grid-search on a different PR :)

CHANGES.rst Show resolved Hide resolved
Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge once the CI is green

@Vincent-Maladiere Vincent-Maladiere merged commit 9f4ca19 into skrub-data:main Aug 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants