Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the StringEncoder transformer #1159

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

rcap107
Copy link
Contributor

@rcap107 rcap107 commented Nov 26, 2024

This is a first draft of a PR to address #1121

I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.

Things left to do:

  • Testing
  • Parameter checking?
  • Default value for the PCA?
  • Docstrings
  • Deciding name of the features

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 5, 2024

Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices.

@jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version.

Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer?

(writing this down so I don't forget)

@GaelVaroquaux
Copy link
Member

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 9, 2024

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

Where can I find the benchmarks?

@GaelVaroquaux
Copy link
Member

Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark

You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634

@Vincent-Maladiere
Copy link
Member

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 9, 2024

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder

image

@Vincent-Maladiere
Copy link
Member

That's very interesting!

@GaelVaroquaux
Copy link
Member

IIUC correctly char_wb prevents char ngrams from crossing word boundaries but they're still only character ngrams no?

Good point. I was confusing with the "add_word" strategy of the GapEncoder (

if self.add_words: # Init a word counts vectorizer if needed
). I would be interested if we could also explore this option. I seem to remember that it can help markedly, though come at a cost

@GaelVaroquaux
Copy link
Member

One last thing (I always come up with more :D ):

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 16, 2024

We discussed this PR during this week's meeting, and some points came up:

  • In the employees salary case (second example), the prediction performance may be due mostly to columns other than the one that is being encoded, so I should try both OrdinalEncoder and simply dropping the column to see what's the effect of the column on the prediction.
  • The StringEncoder with the current default parameters seems like a good default as a high cardinality encoder.
  • The overhead of HashingVectorizer in the small datasets I considered is probably the reason why it's so slow, and why it's probably not worth using (at least as default) for our use case

I'll clean up the code I am using and try to run the experiments in the next days.

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 16, 2024

One last thing (I always come up with more :D ):

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned

This is something for a separate PR though

Copy link
Member

@jeromedockes jeromedockes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mostly corner cases remaining 🎉 :)

@@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans):
# We set ``n_components`` to 30; however, to achieve the best performance, we would
# need to find the optimal value for this hyperparameter using either |GridSearchCV|
# or |RandomizedSearchCV|. We skip this part to keep the computation time for this
# example small.
# small example.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the computation time for this ...

First, apply a tf-idf vectorization of the text, then reduce the dimensionality
with a truncated SVD decomposition with the given number of parameters.
New features will be named `{col_name}_{component}` if the series has a name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need double backticks

Parameters
----------
n_components : int, default=30
Number of components to be used for the PCA decomposition. Must be a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to keep the number of acronyms under control maybe we should stick to "SVD" not "PCA". also we could have the expanded acronyms in parentheses the first time we mention them and links to their wikipedia pages in a Notes section

Number of components to be used for the PCA decomposition. Must be a
positive integer.
vectorizer : str, "tfidf" or "hashing"
Vectorizer to apply to the strings, either `tfidf` or `hashing` for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here not sure what was your desired formatting -- single backticks will be italic, double for monospace

scikit-learn TfidfVectorizer or HashingVectorizer respectively.
ngram_range : tuple of (int, int) pairs, default=(3,4)
Whether the feature should be made of word or character n-grams.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the docs for n_gram_range and analyzer got swapped

analyzer : str, "char", "word" or "char_wb", default="char_wb"
The lower and upper boundary of the range of n-values for different
n-grams to be extracted. All values of n such that min_n <= n <= max_n
will be used. For example an `ngram_range` of `(1, 1)` means only unigrams,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about rst vs markdown

skrub/_string_encoder.py Outdated Show resolved Hide resolved
ngram_range=self.ngram_range, analyzer=self.analyzer
),
),
("tsvd", TruncatedSVD(n_components=self.n_components)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as in the textencoder, I think we need to handle the case where we end up with the smaller dimension of the tfidf < self.n_components (could happen for example if fitting on a column with few unique words and setting a large n_components and using the word analyzer). in that case we can do the same as textencoder ie keep tfidf[:, :self.n_components]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(adding that logic might require you to move the svd out of the pipeline)

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 16, 2024

I tested TableVectorizer with drop for high_cardinality and the result is pretty bad. I also tested GapEncoder with add_words=True, but it does not seem to help here. Actually reading the traceback let me run the OrdinalEncoder, which seems to provide some benefit over straight up dropping the column, but it's still not quite as good as the other Encoders (which is a good thing imo)

image

image

image

It's also surprising to see that GapEncoder with add_words=True seems to be slightly faster than default GapEncoder

@Vincent-Maladiere
Copy link
Member

Nice! So what is the conclusion regarding the StringEncoder(1, 1)? How can it perform so well against drop and OrdinalEncoder, when it only considers individual characters?

@Vincent-Maladiere
Copy link
Member

I'm happy that the string encoder looks like a great baseline for short, messy columns and long, free-form text as well.

@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 16, 2024 via email

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 17, 2024

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned This is something for a separate PR though
I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.

I updated the doc page on the Encoders, but it was only to add the StringEncoder and a short summary of the different methods. Looking at the page, I think it would be better to expand on it with more detail for all encoders and maybe an explanation of the parameters, but that's something that would take way more effort (and definitely something for a separate PR).

@rcap107
Copy link
Contributor Author

rcap107 commented Dec 17, 2024

Nice! So what is the conclusion regarding the StringEncoder(1, 1)? How can it perform so well against drop and OrdinalEncoder, when it only considers individual characters?

image

My feeling is that OrdinalEncoder is just not that good if there is no order in the feature to begin with, while strings that are similar to each other usually are related no matter how they are sliced.

I think an interesting experiment would be having a dictionary replacement where all strings in the starting table are replaced by random alphanumeric strings and check the performance of the encoders on that. In that case, I can imagine StringEncoder would not do so well compared to OrdinalEncoder.

.. _dirty_categories:
Summary
.......
:class:`StringEncoder` should be used in most cases when working with high-cardinality
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StringEncoder is not rendered properly, not sure how to fix this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add it to the reference documentation index:

https://github.com/skrub-data/skrub/blob/main/doc/reference/index.rst?plain=1#L43

so that sphinx can create a page and a link for it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants