Better parallelism for TableVectorizer #586

LeoGrin · 2023-06-10T10:13:35Z

Problem Description

Right now, skrub's TableVectorizer parallelism relies on the inherited ColumnTransformer behavior, which create 1 job per transformer. This means that calling TableVectorizer(n_jobs=5).fit_transform(X), where X has 3 low cardinality columns and 2 high cardinality colums, will only be parallelized on 2 cores instead of 5. This also means that by default, we don't use our encoders parallelism (e.g #582 ) even when using n_jobs > 1 in TableVectorizer.

Note: Why doesn't ColumnTransformer already do this? I hope I'm not missing anything, but my understanding is that sklearn's transformers are fast, so they're not parallelized, so this feature is not useful for ColumnTransformer.

Feature Description

Split the cores assigned to TableVectorizer between each estimators, in proportion of the number of columns assigned to each estimators. We can also do weirder things, like assign more cores to the transformer assigned to high cardinality columns, which should be slower.

Alternative Solutions

Parallelize the TableVectorizer on each column, instead of each transformer. Pro: we don't have to support parallelization for each estimator. Cons: some estimators gain something from fitting several columns at the same time (for instance the MinHashEncoder if there are shared ngrams).

Additional Context

TableVectorizer can be quite slow, the main culprit being the GapEncoder(#342), which is in the process of being parallelized (#582).

The text was updated successfully, but these errors were encountered:

GaelVaroquaux · 2023-06-11T19:29:32Z

The reason that the ColumnTransform creates one job per transformer is that some transformers can be multivariate (for instance a PCA, or a feature selection=.

I do see your point that parallel computing could be much improved by special casing a few encoders, such as the GapEncoder that must be parallelized.

The challenge in my eyes is: how to do this. One approach is to override the "_iter" method of the ColumnTransformer, in a way similar to (pseudo-code, won't run):

UNIVARIATE_TRANSFORMERS = (GapEncoder, MinHashEncoder)

...

   def _iter(self, fitted=False, replace_strings=False, column_as_strings=False):
        for (name, trans, columns, get_weight(name)) in ColumnTransformer._iter(self, fitted=fitted, replace_strings=replace_strings, column_as_strings=column_as_strings)
               if isinstance(trans, UNIVARIATE_TRANSFORMERS):
                      for column in columns:
                           yield (name, trans, (column, ), get_weight(name))
               else:
                     yield (name, trans, columns, get_weight(name))

This will need to be very extensively tested, as we are going to be toying with internals (_iter is a private function, and we are clearly putting our fingers a bit deep inside scikit-learn's private code).

LeoGrin · 2023-06-11T20:34:52Z

Thanks ! Another solution which might be simpler: find all transformers with the n_jobs attribute, and set it manually. I'm wondering how simple it is to combine this and the ColumnTransformer's parallelism (doesn't seem to work very well when I do it naively on current TableVectorizer).
What do you think? Here's the pseudo-code I have in mind:

for (name, trans, columns) in self.transformers:
    if trans.has_attribute("n_jobs"):
         trans.n_jobs = len(columns) #assuming we have a lot of cores, should be set better
self.n_jobs = #override if necessary

GaelVaroquaux · 2023-06-11T20:37:09Z

Thanks ! Another solution which might be simpler: find all transformers with the n_jobs attribute, and set it manually.

I don't think that this would work terribly well: it creates nested parallelism, with barriers, and would probably lead to much starvation.

LeoGrin · 2023-06-12T09:46:25Z

If we want to avoid nested parallelism, something which would be very simple while still being an improvement is to set the TableVectorizer n_jobs to 1, and pass the n_jobs argument to all transformers which have this attribute. This would be faster than what we have now (usually more columns of each type than transformers). And in the current situation, where high-cardinality transformers are much slower than the other transformers, it should be close to your solution in term of performance. What do you think about implementing this right now, and eventually going back to your solution if the situation changes?

LeoGrin · 2023-06-12T13:29:08Z

Summarizing the meeting discussion, two possibility:

If all transformers have n_jobs attributes and their n_jobs=None, pass the n_jobs arguments to the transformers and set TableVectorizer.n_jobs to 1. Pro: Benefits from some encoders parallelism (e.g MinHashEncoder). Cons: surprising for the user.
Overrite the "_iter" method. Pro: less surprising, more jobs being created. Cons: Dependance on a sklearn private function.

The second method was chosen.

LeoGrin added the enhancement New feature or request label Jun 10, 2023

LeoGrin mentioned this issue Jun 12, 2023

Parallelize the TableVectorizer column-wise #592

Merged

jovan-stojanovic mentioned this issue Jun 19, 2023

Column-wise parallelism for Column Transformer scikit-learn/scikit-learn#26614

Open

LeoGrin closed this as completed in #592 Sep 20, 2023

LeoGrin mentioned this issue Oct 4, 2023

MAINT use composition in TableVectorizer contn'd #761

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better parallelism for TableVectorizer #586

Better parallelism for TableVectorizer #586

LeoGrin commented Jun 10, 2023

GaelVaroquaux commented Jun 11, 2023

LeoGrin commented Jun 11, 2023

GaelVaroquaux commented Jun 11, 2023 via email

LeoGrin commented Jun 12, 2023

LeoGrin commented Jun 12, 2023 •

edited

Loading

Better parallelism for TableVectorizer #586

Better parallelism for TableVectorizer #586

Comments

LeoGrin commented Jun 10, 2023

Problem Description

Feature Description

Alternative Solutions

Additional Context

GaelVaroquaux commented Jun 11, 2023

LeoGrin commented Jun 11, 2023

GaelVaroquaux commented Jun 11, 2023 via email

LeoGrin commented Jun 12, 2023

LeoGrin commented Jun 12, 2023 • edited Loading

LeoGrin commented Jun 12, 2023 •

edited

Loading