Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MAINT use composition in TableVectorizer contn'd #761

Merged
merged 16 commits into from
Oct 30, 2023

Conversation

glemaitre
Copy link
Member

closes #660
closes #675

The same as #675 but handles the split/merge parallelism. It seems that we can simplify the code quite a bit.

I need to add the tests and modify some as in the original PR.

@Vincent-Maladiere
Copy link
Member

Thank you for this!

@LeoGrin
Copy link
Contributor

LeoGrin commented Sep 28, 2023

Awesome, the split/merge logic is much simpler now!

@glemaitre
Copy link
Member Author

Awesome, the split/merge logic is much simpler now!

And we don't need the _split method indeed.

@glemaitre glemaitre marked this pull request as draft September 28, 2023 14:02
@jeromedockes
Copy link
Member

@glemaitre with the PR I opened on your branch the conflicts are resolved and the tests should pass so it should be easier to move forward

@glemaitre
Copy link
Member Author

Thanks @jeromedockes I just merge the PR.

@glemaitre
Copy link
Member Author

I assume that it still miss an implementation of the get_params/set_params as done by @Vincent-Maladiere. I did not yet have the time to check those.

@jeromedockes
Copy link
Member

@glemaitre should the merging of transformers be applied also to named_transformers_ and column_indices_? I think it might be confusing otherwise if the transformers_ don't map to anything in those attributes

also we might want these properties to check if the estimator is fitted otherwise we get some attributeerror a bit harder to interpret

@jeromedockes
Copy link
Member

@glemaitre one option to avoid the split and merge stuff without nested parallelism is to rely on the encoder's built-in parallelization, and have no parallelization at the TableVectorizer level. Indeed, in most cases what takes a lot of time will be encoding the text columns, so performing the numerical passthrough at the same time has little benefit. Advantages would be much simpler code, avoiding the tight coupling between the TableVectorizer and the encoders, treating user-provided or skrub encoders in the same way, and focussing the parallelization & optimization efforts in 1 place only -- for example the minhash's parallelization is a bit more involved than just parallelizing over columns because it takes unique values accross columns, so using its own specialized approach could be better. We can talk about it today but it would be great to also have @LeoGrin 's opinion on this

@jeromedockes
Copy link
Member

a small script to show it doesn't slow down a typical example:

import timeit

from skrub.datasets import fetch_employee_salaries
from skrub import TableVectorizer, GapEncoder, MinHashEncoder

dataset = fetch_employee_salaries()

X = dataset.X
y = dataset.y


print(X.iloc[0])
print(X.shape)

number = 10
n_jobs = 8
for encoder in GapEncoder, MinHashEncoder:
    for n_components in 10, 30, 90:
        print(f"\n{encoder.__name__}, {n_components} components")
        vectorizer = TableVectorizer(
            n_jobs=n_jobs,
            high_card_cat_transformer=encoder(n_components=n_components, n_jobs=1),
        )
        elapsed = timeit.timeit(
            "vectorizer.fit_transform(X)", number=number, globals=globals()
        )
        print(f"parallelize vectorizer: {elapsed / number:.2f}s")

        vectorizer = TableVectorizer(
            n_jobs=1,
            high_card_cat_transformer=encoder(n_components=n_components, n_jobs=n_jobs),
        )
        elapsed = timeit.timeit(
            "vectorizer.fit_transform(X)", number=number, globals=globals()
        )
        print(f"parallelize encoder: {elapsed / number:.2f}s")
gender                                                                     F
department                                                               POL
department_name                                         Department of Police
division                   MSB Information Mgmt and Tech Division Records...
assignment_category                                         Fulltime-Regular
employee_position_title                          Office Services Coordinator
date_first_hired                                                  09/22/1986
year_first_hired                                                        1986
Name: 0, dtype: object
(9228, 8)

GapEncoder, 10 components
parallelize vectorizer: 1.16s
parallelize encoder: 1.14s

GapEncoder, 30 components
parallelize vectorizer: 1.67s
parallelize encoder: 1.71s

GapEncoder, 90 components
parallelize vectorizer: 3.41s
parallelize encoder: 3.49s

MinHashEncoder, 10 components
parallelize vectorizer: 0.15s
parallelize encoder: 0.13s

MinHashEncoder, 30 components
parallelize vectorizer: 0.21s
parallelize encoder: 0.18s

MinHashEncoder, 90 components
parallelize vectorizer: 0.39s
parallelize encoder: 0.24s

We can do more benchmarks if needed

@LeoGrin
Copy link
Contributor

LeoGrin commented Oct 4, 2023

We can talk about it today but it would be great to also have @LeoGrin 's opinion on this

Favouring the inner loop by removing the column transformer parallelism and parallelising each transformer was one of the things we considered, but it was deemed too surprising for the user (for instance the n_jobs attribute wouldn't match what the user provided), see #586. I hadn't thought about it, but with this PR (using composition), I think that this is less of a concern: users might be surprised by how we parallelize, but we can document it well. Split/merge would be faster than favouring the inner loop if we have multiple slow transformers (and maybe if we have a lot of normal transformers, I'm not sure), but in the current situation, I agree that the speed gain will probably be small. All in all I think it may actually be a good idea to go back to the simple solution :)

@jeromedockes
Copy link
Member

All in all I think it may actually be a good idea to go back to the simple solution :)

Thanks for your insights! I had seen the original PR but missed that discussion in the issue (and of course the IRL one). Given that you @LeoGrin agree to parallelizing the encoder rather than the TableVectorizer, and so do @glemaitre @Vincent-Maladiere and @GaelVaroquaux, we'll do that. We'll do it in this PR to avoid adding code and tests that would be removed afterwards.

If I misunderstood someone's preference on this please LMK!

@glemaitre
Copy link
Member Author

I'm going to work on this in the afternoon.

@glemaitre glemaitre marked this pull request as ready for review October 4, 2023 13:52
skrub/_table_vectorizer.py Show resolved Hide resolved
skrub/_table_vectorizer.py Show resolved Hide resolved
skrub/_table_vectorizer.py Outdated Show resolved Hide resolved
Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few remarks and questions before it LGTM.

skrub/_table_vectorizer.py Show resolved Hide resolved
# Note that when fitting on a dataframe and transforming on
# the same dataframe with different column names,
# _check_feature_names will raise an error.
self._check_feature_names(X, reset=reset)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this block be put before L699? So that feature_names_in is passed to the dataframe constructor instead of being set afterward. Furthermore, can feature_names be None since we convert X to a dataframe?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move those checks in fit_transform as done in the ColumnTransformer.

skrub/_table_vectorizer.py Show resolved Hide resolved
return list(ct_feature_names)

return all_trans_feature_names
if name == "remainder" and len(columns) < 20:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a TODO to remove this when scikit-learn/scikit-learn#27533 is closed.

Copy link
Contributor

@LeoGrin LeoGrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

As discussed during the meeting, #709 will be closed in another PR.

Copy link
Member

@Vincent-Maladiere Vincent-Maladiere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! :)

@jeromedockes
Copy link
Member

awesome, thanks a lot!! @glemaitre all reviewers approved it; can I merge it or were you still planning to push some changes?

@glemaitre
Copy link
Member Author

Let's merge and iterate on the improvements.

@jeromedockes jeromedockes merged commit bcd7cd8 into skrub-data:main Oct 30, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use composition in the TableVectorizer
4 participants