-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better parallelism for TableVectorizer #586
Comments
The reason that the ColumnTransform creates one job per transformer is that some transformers can be multivariate (for instance a PCA, or a feature selection=. I do see your point that parallel computing could be much improved by special casing a few encoders, such as the GapEncoder that must be parallelized. The challenge in my eyes is: how to do this. One approach is to override the "_iter" method of the ColumnTransformer, in a way similar to (pseudo-code, won't run): UNIVARIATE_TRANSFORMERS = (GapEncoder, MinHashEncoder)
...
def _iter(self, fitted=False, replace_strings=False, column_as_strings=False):
for (name, trans, columns, get_weight(name)) in ColumnTransformer._iter(self, fitted=fitted, replace_strings=replace_strings, column_as_strings=column_as_strings)
if isinstance(trans, UNIVARIATE_TRANSFORMERS):
for column in columns:
yield (name, trans, (column, ), get_weight(name))
else:
yield (name, trans, columns, get_weight(name)) This will need to be very extensively tested, as we are going to be toying with internals (_iter is a private function, and we are clearly putting our fingers a bit deep inside scikit-learn's private code). |
Thanks ! Another solution which might be simpler: find all transformers with the
|
Thanks ! Another solution which might be simpler: find all transformers with the n_jobs attribute, and set it manually.
I don't think that this would work terribly well: it creates nested parallelism, with barriers, and would probably lead to much starvation.
|
If we want to avoid nested parallelism, something which would be very simple while still being an improvement is to set the |
Summarizing the meeting discussion, two possibility:
The second method was chosen. |
Problem Description
Right now, skrub's
TableVectorizer
parallelism relies on the inheritedColumnTransformer
behavior, which create 1 job per transformer. This means that callingTableVectorizer(n_jobs=5).fit_transform(X)
, whereX
has 3 low cardinality columns and 2 high cardinality colums, will only be parallelized on 2 cores instead of 5. This also means that by default, we don't use our encoders parallelism (e.g #582 ) even when usingn_jobs > 1
inTableVectorizer
.Note: Why doesn't
ColumnTransformer
already do this? I hope I'm not missing anything, but my understanding is that sklearn's transformers are fast, so they're not parallelized, so this feature is not useful forColumnTransformer
.Feature Description
Split the cores assigned to
TableVectorizer
between each estimators, in proportion of the number of columns assigned to each estimators. We can also do weirder things, like assign more cores to the transformer assigned to high cardinality columns, which should be slower.Alternative Solutions
Parallelize the
TableVectorizer
on each column, instead of each transformer. Pro: we don't have to support parallelization for each estimator. Cons: some estimators gain something from fitting several columns at the same time (for instance theMinHashEncoder
if there are shared ngrams).Additional Context
TableVectorizer
can be quite slow, the main culprit being theGapEncoder
(#342), which is in the process of being parallelized (#582).The text was updated successfully, but these errors were encountered: