-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding the StringEncoder transformer #1159
base: main
Are you sure you want to change the base?
Conversation
Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices. @jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version. Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer? (writing this down so I don't forget) |
I'm very happy to see this progressing. Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters) |
Where can I find the benchmarks? |
Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634 |
Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset. |
It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder |
That's very interesting! |
Good point. I was confusing with the "add_word" strategy of the GapEncoder ( Line 236 in 8a542bb
|
One last thing (I always come up with more :D ): we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned |
We discussed this PR during this week's meeting, and some points came up:
I'll clean up the code I am using and try to run the experiments in the next days. |
This is something for a separate PR though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mostly corner cases remaining 🎉 :)
@@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans): | |||
# We set ``n_components`` to 30; however, to achieve the best performance, we would | |||
# need to find the optimal value for this hyperparameter using either |GridSearchCV| | |||
# or |RandomizedSearchCV|. We skip this part to keep the computation time for this | |||
# example small. | |||
# small example. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep the computation time for this ...
skrub/_string_encoder.py
Outdated
First, apply a tf-idf vectorization of the text, then reduce the dimensionality | ||
with a truncated SVD decomposition with the given number of parameters. | ||
New features will be named `{col_name}_{component}` if the series has a name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need double backticks
skrub/_string_encoder.py
Outdated
Parameters | ||
---------- | ||
n_components : int, default=30 | ||
Number of components to be used for the PCA decomposition. Must be a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to keep the number of acronyms under control maybe we should stick to "SVD" not "PCA". also we could have the expanded acronyms in parentheses the first time we mention them and links to their wikipedia pages in a Notes section
Number of components to be used for the PCA decomposition. Must be a | ||
positive integer. | ||
vectorizer : str, "tfidf" or "hashing" | ||
Vectorizer to apply to the strings, either `tfidf` or `hashing` for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also here not sure what was your desired formatting -- single backticks will be italic, double for monospace
skrub/_string_encoder.py
Outdated
scikit-learn TfidfVectorizer or HashingVectorizer respectively. | ||
ngram_range : tuple of (int, int) pairs, default=(3,4) | ||
Whether the feature should be made of word or character n-grams. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the docs for n_gram_range and analyzer got swapped
skrub/_string_encoder.py
Outdated
analyzer : str, "char", "word" or "char_wb", default="char_wb" | ||
The lower and upper boundary of the range of n-values for different | ||
n-grams to be extracted. All values of n such that min_n <= n <= max_n | ||
will be used. For example an `ngram_range` of `(1, 1)` means only unigrams, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about rst vs markdown
skrub/_string_encoder.py
Outdated
ngram_range=self.ngram_range, analyzer=self.analyzer | ||
), | ||
), | ||
("tsvd", TruncatedSVD(n_components=self.n_components)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as in the textencoder, I think we need to handle the case where we end up with the smaller dimension of the tfidf < self.n_components (could happen for example if fitting on a column with few unique words and setting a large n_components and using the word analyzer). in that case we can do the same as textencoder ie keep tfidf[:, :self.n_components]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(adding that logic might require you to move the svd out of the pipeline)
I tested TableVectorizer with It's also surprising to see that GapEncoder with |
Nice! So what is the conclusion regarding the |
I'm happy that the string encoder looks like a great baseline for short, messy columns and long, free-form text as well. |
we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned
This is something for a separate PR though
I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.
|
I updated the doc page on the Encoders, but it was only to add the |
My feeling is that OrdinalEncoder is just not that good if there is no order in the feature to begin with, while strings that are similar to each other usually are related no matter how they are sliced. I think an interesting experiment would be having a dictionary replacement where all strings in the starting table are replaced by random alphanumeric strings and check the performance of the encoders on that. In that case, I can imagine StringEncoder would not do so well compared to OrdinalEncoder. |
.. _dirty_categories: | ||
Summary | ||
....... | ||
:class:`StringEncoder` should be used in most cases when working with high-cardinality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StringEncoder is not rendered properly, not sure how to fix this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add it to the reference documentation index:
https://github.com/skrub-data/skrub/blob/main/doc/reference/index.rst?plain=1#L43
so that sphinx can create a page and a link for it
This is a first draft of a PR to address #1121
I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.
Things left to do: