-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MAINT use composition in TableVectorizer contn'd #761
Conversation
Thank you for this! |
Awesome, the split/merge logic is much simpler now! |
And we don't need the |
@glemaitre with the PR I opened on your branch the conflicts are resolved and the tests should pass so it should be easier to move forward |
fix conflicts & tests
Thanks @jeromedockes I just merge the PR. |
I assume that it still miss an implementation of the |
@glemaitre should the merging of transformers be applied also to also we might want these properties to check if the estimator is fitted otherwise we get some attributeerror a bit harder to interpret |
@glemaitre one option to avoid the split and merge stuff without nested parallelism is to rely on the encoder's built-in parallelization, and have no parallelization at the TableVectorizer level. Indeed, in most cases what takes a lot of time will be encoding the text columns, so performing the numerical passthrough at the same time has little benefit. Advantages would be much simpler code, avoiding the tight coupling between the TableVectorizer and the encoders, treating user-provided or skrub encoders in the same way, and focussing the parallelization & optimization efforts in 1 place only -- for example the minhash's parallelization is a bit more involved than just parallelizing over columns because it takes unique values accross columns, so using its own specialized approach could be better. We can talk about it today but it would be great to also have @LeoGrin 's opinion on this |
a small script to show it doesn't slow down a typical example: import timeit
from skrub.datasets import fetch_employee_salaries
from skrub import TableVectorizer, GapEncoder, MinHashEncoder
dataset = fetch_employee_salaries()
X = dataset.X
y = dataset.y
print(X.iloc[0])
print(X.shape)
number = 10
n_jobs = 8
for encoder in GapEncoder, MinHashEncoder:
for n_components in 10, 30, 90:
print(f"\n{encoder.__name__}, {n_components} components")
vectorizer = TableVectorizer(
n_jobs=n_jobs,
high_card_cat_transformer=encoder(n_components=n_components, n_jobs=1),
)
elapsed = timeit.timeit(
"vectorizer.fit_transform(X)", number=number, globals=globals()
)
print(f"parallelize vectorizer: {elapsed / number:.2f}s")
vectorizer = TableVectorizer(
n_jobs=1,
high_card_cat_transformer=encoder(n_components=n_components, n_jobs=n_jobs),
)
elapsed = timeit.timeit(
"vectorizer.fit_transform(X)", number=number, globals=globals()
)
print(f"parallelize encoder: {elapsed / number:.2f}s")
We can do more benchmarks if needed |
Favouring the inner loop by removing the column transformer parallelism and parallelising each transformer was one of the things we considered, but it was deemed too surprising for the user (for instance the n_jobs attribute wouldn't match what the user provided), see #586. I hadn't thought about it, but with this PR (using composition), I think that this is less of a concern: users might be surprised by how we parallelize, but we can document it well. Split/merge would be faster than favouring the inner loop if we have multiple slow transformers (and maybe if we have a lot of normal transformers, I'm not sure), but in the current situation, I agree that the speed gain will probably be small. All in all I think it may actually be a good idea to go back to the simple solution :) |
Thanks for your insights! I had seen the original PR but missed that discussion in the issue (and of course the IRL one). Given that you @LeoGrin agree to parallelizing the encoder rather than the TableVectorizer, and so do @glemaitre @Vincent-Maladiere and @GaelVaroquaux, we'll do that. We'll do it in this PR to avoid adding code and tests that would be removed afterwards. If I misunderstood someone's preference on this please LMK! |
I'm going to work on this in the afternoon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a few remarks and questions before it LGTM.
skrub/_table_vectorizer.py
Outdated
# Note that when fitting on a dataframe and transforming on | ||
# the same dataframe with different column names, | ||
# _check_feature_names will raise an error. | ||
self._check_feature_names(X, reset=reset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this block be put before L699? So that feature_names_in
is passed to the dataframe constructor instead of being set afterward. Furthermore, can feature_names
be None since we convert X
to a dataframe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can move those checks in fit_transform
as done in the ColumnTransformer
.
return list(ct_feature_names) | ||
|
||
return all_trans_feature_names | ||
if name == "remainder" and len(columns) < 20: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a TODO to remove this when scikit-learn/scikit-learn#27533 is closed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
As discussed during the meeting, #709 will be closed in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! :)
awesome, thanks a lot!! @glemaitre all reviewers approved it; can I merge it or were you still planning to push some changes? |
Let's merge and iterate on the improvements. |
closes #660
closes #675
The same as #675 but handles the split/merge parallelism. It seems that we can simplify the code quite a bit.
I need to add the tests and modify some as in the original PR.