Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better parallelism for TableVectorizer #586

Closed
LeoGrin opened this issue Jun 10, 2023 · 5 comments · Fixed by #592
Closed

Better parallelism for TableVectorizer #586

LeoGrin opened this issue Jun 10, 2023 · 5 comments · Fixed by #592
Labels
enhancement New feature or request

Comments

@LeoGrin
Copy link
Contributor

LeoGrin commented Jun 10, 2023

Problem Description

Right now, skrub's TableVectorizer parallelism relies on the inherited ColumnTransformer behavior, which create 1 job per transformer. This means that calling TableVectorizer(n_jobs=5).fit_transform(X), where X has 3 low cardinality columns and 2 high cardinality colums, will only be parallelized on 2 cores instead of 5. This also means that by default, we don't use our encoders parallelism (e.g #582 ) even when using n_jobs > 1 in TableVectorizer.

Note: Why doesn't ColumnTransformer already do this? I hope I'm not missing anything, but my understanding is that sklearn's transformers are fast, so they're not parallelized, so this feature is not useful for ColumnTransformer.

Feature Description

Split the cores assigned to TableVectorizer between each estimators, in proportion of the number of columns assigned to each estimators. We can also do weirder things, like assign more cores to the transformer assigned to high cardinality columns, which should be slower.

Alternative Solutions

Parallelize the TableVectorizer on each column, instead of each transformer. Pro: we don't have to support parallelization for each estimator. Cons: some estimators gain something from fitting several columns at the same time (for instance the MinHashEncoder if there are shared ngrams).

Additional Context

TableVectorizer can be quite slow, the main culprit being the GapEncoder(#342), which is in the process of being parallelized (#582).

@LeoGrin LeoGrin added the enhancement New feature or request label Jun 10, 2023
@GaelVaroquaux
Copy link
Member

The reason that the ColumnTransform creates one job per transformer is that some transformers can be multivariate (for instance a PCA, or a feature selection=.

I do see your point that parallel computing could be much improved by special casing a few encoders, such as the GapEncoder that must be parallelized.

The challenge in my eyes is: how to do this. One approach is to override the "_iter" method of the ColumnTransformer, in a way similar to (pseudo-code, won't run):

UNIVARIATE_TRANSFORMERS = (GapEncoder, MinHashEncoder)

...

   def _iter(self, fitted=False, replace_strings=False, column_as_strings=False):
        for (name, trans, columns, get_weight(name)) in ColumnTransformer._iter(self, fitted=fitted, replace_strings=replace_strings, column_as_strings=column_as_strings)
               if isinstance(trans, UNIVARIATE_TRANSFORMERS):
                      for column in columns:
                           yield (name, trans, (column, ), get_weight(name))
               else:
                     yield (name, trans, columns, get_weight(name))

This will need to be very extensively tested, as we are going to be toying with internals (_iter is a private function, and we are clearly putting our fingers a bit deep inside scikit-learn's private code).

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 11, 2023

Thanks ! Another solution which might be simpler: find all transformers with the n_jobs attribute, and set it manually. I'm wondering how simple it is to combine this and the ColumnTransformer's parallelism (doesn't seem to work very well when I do it naively on current TableVectorizer).
What do you think? Here's the pseudo-code I have in mind:

for (name, trans, columns) in self.transformers:
    if trans.has_attribute("n_jobs"):
         trans.n_jobs = len(columns) #assuming we have a lot of cores, should be set better
self.n_jobs = #override if necessary

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Jun 11, 2023 via email

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 12, 2023

If we want to avoid nested parallelism, something which would be very simple while still being an improvement is to set the TableVectorizer n_jobs to 1, and pass the n_jobs argument to all transformers which have this attribute. This would be faster than what we have now (usually more columns of each type than transformers). And in the current situation, where high-cardinality transformers are much slower than the other transformers, it should be close to your solution in term of performance. What do you think about implementing this right now, and eventually going back to your solution if the situation changes?

@LeoGrin
Copy link
Contributor Author

LeoGrin commented Jun 12, 2023

Summarizing the meeting discussion, two possibility:

  • If all transformers have n_jobs attributes and their n_jobs=None, pass the n_jobs arguments to the transformers and set TableVectorizer.n_jobs to 1. Pro: Benefits from some encoders parallelism (e.g MinHashEncoder). Cons: surprising for the user.
  • Overrite the "_iter" method. Pro: less surprising, more jobs being created. Cons: Dependance on a sklearn private function.

The second method was chosen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants