Adding the StringEncoder transformer #1159

rcap107 · 2024-11-26T16:17:05Z

This is a first draft of a PR to address #1121

I looked at GapEncoder to figure out what to do. This is a very early version just to have an idea of the kind of code that's needed.

Things left to do:

rcap107 · 2024-12-05T15:43:10Z

Tests fail on minimum requirements because I am using PCA rather than TruncatedSVD for the decomposition, and that raises issues with potentially sparse matrices.

@jeromedockes suggests using directly TruncatedSVD to begin with, rather than adding a check on the version.

Also, I am using tf-idf as vectorizer, should I use something else? Maybe HashVectorizer?

(writing this down so I don't forget)

GaelVaroquaux · 2024-12-09T14:25:47Z

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

rcap107 · 2024-12-09T14:27:13Z

I'm very happy to see this progressing.

Can you benchmark it on the experiments from Leo's paper: this is important for modeling choices (eg the hyper-parameters)

Where can I find the benchmarks?

GaelVaroquaux · 2024-12-09T15:07:03Z

Actually, let's keep it simple, and use the CARTE datasets, they are good enough: https://huggingface.co/datasets/inria-soda/carte-benchmark

You probably want to instanciate a pipeline that uses TableVectorizer + HistGradientBoosting, but embeds one of the string columns with the StringEncoder (the one that is either higest cardinality, or most "diverse entry" in the sense of https://arxiv.org/abs/2312.09634

Vincent-Maladiere · 2024-12-09T15:30:42Z

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

rcap107 · 2024-12-09T15:32:48Z

Should we also add this to the text encoder example, along the TextEncoder, MinHashEncoder and GapEncoder? It shows a tiny benchmark on the toxicity dataset.

It's already there, and it shows that StringEncoder has performance similar to that of GapEncoder and runtime similar to that of MinHashEncoder

Vincent-Maladiere · 2024-12-09T15:42:11Z

That's very interesting!

Co-authored-by: Jérôme Dockès <[email protected]>

…df-pca

GaelVaroquaux · 2024-12-15T18:07:19Z

IIUC correctly char_wb prevents char ngrams from crossing word boundaries but they're still only character ngrams no?

Good point. I was confusing with the "add_word" strategy of the GapEncoder (

skrub/skrub/_gap_encoder.py

Line 236 in 8a542bb

if self.add_words: # Init a word counts vectorizer if needed

). I would be interested if we could also explore this option. I seem to remember that it can help markedly, though come at a cost

GaelVaroquaux · 2024-12-15T18:09:08Z

One last thing (I always come up with more :D ):

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned

rcap107 · 2024-12-16T11:00:47Z

We discussed this PR during this week's meeting, and some points came up:

In the employees salary case (second example), the prediction performance may be due mostly to columns other than the one that is being encoded, so I should try both OrdinalEncoder and simply dropping the column to see what's the effect of the column on the prediction.
The StringEncoder with the current default parameters seems like a good default as a high cardinality encoder.
The overhead of HashingVectorizer in the small datasets I considered is probably the reason why it's so slow, and why it's probably not worth using (at least as default) for our use case

I'll clean up the code I am using and try to run the experiments in the next days.

rcap107 · 2024-12-16T11:01:03Z

One last thing (I always come up with more :D ):

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned

This is something for a separate PR though

jeromedockes

mostly corner cases remaining 🎉 :)

jeromedockes · 2024-12-16T13:39:39Z

examples/02_text_with_string_encoders.py

@@ -132,7 +135,7 @@ def plot_gap_feature_importance(X_trans):
 # We set ``n_components`` to 30; however, to achieve the best performance, we would
 # need to find the optimal value for this hyperparameter using either |GridSearchCV|
 # or |RandomizedSearchCV|. We skip this part to keep the computation time for this
-# example small.
+# small example.


to keep the computation time for this ...

jeromedockes · 2024-12-16T13:41:32Z

skrub/_string_encoder.py

+    First, apply a tf-idf vectorization of the text, then reduce the dimensionality
+    with a truncated SVD decomposition with the given number of parameters.
+
+    New features will be named `{col_name}_{component}` if the series has a name,


I think you need double backticks

jeromedockes · 2024-12-16T13:42:49Z

skrub/_string_encoder.py

+    Parameters
+    ----------
+    n_components : int, default=30
+        Number of components to be used for the PCA decomposition. Must be a


to keep the number of acronyms under control maybe we should stick to "SVD" not "PCA". also we could have the expanded acronyms in parentheses the first time we mention them and links to their wikipedia pages in a Notes section

jeromedockes · 2024-12-16T13:43:38Z

skrub/_string_encoder.py

+        Number of components to be used for the PCA decomposition. Must be a
+        positive integer.
+    vectorizer : str, "tfidf" or "hashing"
+        Vectorizer to apply to the strings, either `tfidf` or `hashing` for


also here not sure what was your desired formatting -- single backticks will be italic, double for monospace

jeromedockes · 2024-12-16T13:44:20Z

skrub/_string_encoder.py

+        scikit-learn TfidfVectorizer or HashingVectorizer respectively.
+
+    ngram_range : tuple of (int, int) pairs, default=(3,4)
+        Whether the feature should be made of word or character n-grams.


looks like the docs for n_gram_range and analyzer got swapped

jeromedockes · 2024-12-16T13:44:40Z

skrub/_string_encoder.py

+    analyzer : str, "char", "word" or "char_wb", default="char_wb"
+        The lower and upper boundary of the range of n-values for different
+        n-grams to be extracted. All values of n such that min_n <= n <= max_n
+        will be used. For example an `ngram_range` of `(1, 1)` means only unigrams,


same comment about rst vs markdown

skrub/_string_encoder.py

jeromedockes · 2024-12-16T13:48:52Z

skrub/_string_encoder.py

+                            ngram_range=self.ngram_range, analyzer=self.analyzer
+                        ),
+                    ),
+                    ("tsvd", TruncatedSVD(n_components=self.n_components)),


as in the textencoder, I think we need to handle the case where we end up with the smaller dimension of the tfidf < self.n_components (could happen for example if fitting on a column with few unique words and setting a large n_components and using the word analyzer). in that case we can do the same as textencoder ie keep tfidf[:, :self.n_components]

(adding that logic might require you to move the svd out of the pipeline)

rcap107 · 2024-12-16T14:16:08Z

I tested TableVectorizer with drop for high_cardinality and the result is pretty bad. I also tested GapEncoder with add_words=True, but it does not seem to help here. Actually reading the traceback let me run the OrdinalEncoder, which seems to provide some benefit over straight up dropping the column, but it's still not quite as good as the other Encoders (which is a good thing imo)

It's also surprising to see that GapEncoder with add_words=True seems to be slightly faster than default GapEncoder

Co-authored-by: Jérôme Dockès <[email protected]>

…df-pca

Vincent-Maladiere · 2024-12-16T15:50:34Z

Nice! So what is the conclusion regarding the StringEncoder(1, 1)? How can it perform so well against drop and OrdinalEncoder, when it only considers individual characters?

Vincent-Maladiere · 2024-12-16T15:52:24Z

I'm happy that the string encoder looks like a great baseline for short, messy columns and long, free-form text as well.

GaelVaroquaux · 2024-12-16T18:12:18Z

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned This is something for a separate PR though

I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.

rcap107 · 2024-12-17T11:06:23Z

we'll have to be very careful to summarize the tradeoffs between the different encoders in a few lines (a few lines, something short and clear :D ) at the top of the corresponding section of the docs. It is very important that we convey to the user what we have learned This is something for a separate PR though
I'd rather not. IMHO the docs need to be reorganized as we add complexity to the package. Also, the evidence for this recommendation comes from this PR.

I updated the doc page on the Encoders, but it was only to add the StringEncoder and a short summary of the different methods. Looking at the page, I think it would be better to expand on it with more detail for all encoders and maybe an explanation of the parameters, but that's something that would take way more effort (and definitely something for a separate PR).

rcap107 · 2024-12-17T12:24:54Z

Nice! So what is the conclusion regarding the StringEncoder(1, 1)? How can it perform so well against drop and OrdinalEncoder, when it only considers individual characters?

My feeling is that OrdinalEncoder is just not that good if there is no order in the feature to begin with, while strings that are similar to each other usually are related no matter how they are sliced.

I think an interesting experiment would be having a dictionary replacement where all strings in the starting table are replaced by random alphanumeric strings and check the performance of the encoders on that. In that case, I can imagine StringEncoder would not do so well compared to OrdinalEncoder.

rcap107 · 2024-12-17T14:55:48Z

doc/encoding.rst

-.. _dirty_categories:
+Summary
+.......
+:class:`StringEncoder` should be used in most cases when working with high-cardinality


StringEncoder is not rendered properly, not sure how to fix this

you need to add it to the reference documentation index:

https://github.com/skrub-data/skrub/blob/main/doc/reference/index.rst?plain=1#L43

so that sphinx can create a page and a link for it

rcap107 added 12 commits November 21, 2024 10:56

Fixing changelog with correct account

ec37e13

Merge remote-tracking branch 'upstream/main'

b3dae47

Merge branch 'main' of github.com:skrub-data/skrub

99e5450

Initial commit

4f7e46e

Update

583250b

Merge branch 'main' of github.com:skrub-data/skrub

4a39f36

Merge branch 'main' of github.com:skrub-data/skrub

ee2f739

Merge branch 'main' into tfidf-pca

30ad689

Merge remote-tracking branch 'upstream/main' into tfidf-pca

d7f1cd7

Updated object and added test

8686d7f

quick update to changelog

eb4de97

Fixed test

96423ba

rcap107 added 7 commits December 7, 2024 12:20

Merge branch 'main' of github.com:skrub-data/skrub

e01637c

Replacing PCA with TruncatedSVD

3a1f6eb

Updated init

398f9db

Updated example to add StringEncoder

3a45f19

Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca

38a9f2d

Updating changelog.

51856b3

📝 Updating docstrings

58a3559

📝 Fixing example

8e4fce2

rcap107 added 2 commits December 9, 2024 16:20

✅ Fixing tests and renaming test file

afdb361

✅ Fixing coverage

6c6d884

🐛 Fixing the name of a variable

9366d90

rcap107 and others added 4 commits December 13, 2024 14:33

Update examples/02_text_with_string_encoders.py

af3b087

Co-authored-by: Jérôme Dockès <[email protected]>

Adding another example (needs formatting)

09b55a1

Simplified error checking

2bb353d

Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…

bfb8c55

…df-pca

rcap107 added 5 commits December 16, 2024 14:10

Fixing hashing test.

7783565

Merge branch 'main' of github.com:skrub-data/skrub into tfidf-pca

50b6e14

Merge branch 'tfidf-pca' into string-encoder-bench

171db27

Making coverage happy

3ff3f1a

Merge branch 'tfidf-pca' into string-encoder-bench

ba6ace7

jeromedockes reviewed Dec 16, 2024

View reviewed changes

rcap107 and others added 6 commits December 16, 2024 15:19

Updating code for clarity

144ab11

Updating docstring

ffc0d73

Fixing a bug

c5c3a73

Update skrub/_string_encoder.py

b103ca6

Co-authored-by: Jérôme Dockès <[email protected]>

Updating docstring

d9242fa

Merge branch 'tfidf-pca' of https://github.com/rcap107/skrub into tfi…

b8ee33d

…df-pca

rcap107 added 2 commits December 17, 2024 11:15

Updating tests and code to address corner cases

64c43c3

Updating docs for encoders

eb0a131

rcap107 commented Dec 17, 2024

View reviewed changes

Delete examples/02_text_with_string_encoders_employee_salaries.py

9268331

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the StringEncoder transformer #1159

Adding the StringEncoder transformer #1159

rcap107 commented Nov 26, 2024 •

edited

Loading

rcap107 commented Dec 5, 2024

GaelVaroquaux commented Dec 9, 2024

rcap107 commented Dec 9, 2024

GaelVaroquaux commented Dec 9, 2024

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 9, 2024 •

edited

Loading

Vincent-Maladiere commented Dec 9, 2024

GaelVaroquaux commented Dec 15, 2024

GaelVaroquaux commented Dec 15, 2024

rcap107 commented Dec 16, 2024

rcap107 commented Dec 16, 2024

jeromedockes left a comment

jeromedockes Dec 16, 2024

jeromedockes Dec 16, 2024

jeromedockes Dec 16, 2024

jeromedockes Dec 16, 2024

jeromedockes Dec 16, 2024

jeromedockes Dec 16, 2024

jeromedockes Dec 16, 2024

jeromedockes Dec 16, 2024

rcap107 commented Dec 16, 2024 •

edited

Loading

Vincent-Maladiere commented Dec 16, 2024

Vincent-Maladiere commented Dec 16, 2024

GaelVaroquaux commented Dec 16, 2024 via email

rcap107 commented Dec 17, 2024

rcap107 commented Dec 17, 2024

rcap107 Dec 17, 2024

jeromedockes Dec 17, 2024

Adding the StringEncoder transformer #1159

Are you sure you want to change the base?

Adding the StringEncoder transformer #1159

Conversation

rcap107 commented Nov 26, 2024 • edited Loading

rcap107 commented Dec 5, 2024

GaelVaroquaux commented Dec 9, 2024

rcap107 commented Dec 9, 2024

GaelVaroquaux commented Dec 9, 2024

Vincent-Maladiere commented Dec 9, 2024

rcap107 commented Dec 9, 2024 • edited Loading

Vincent-Maladiere commented Dec 9, 2024

GaelVaroquaux commented Dec 15, 2024

GaelVaroquaux commented Dec 15, 2024

rcap107 commented Dec 16, 2024

rcap107 commented Dec 16, 2024

jeromedockes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Dec 16, 2024 • edited Loading

Vincent-Maladiere commented Dec 16, 2024

Vincent-Maladiere commented Dec 16, 2024

GaelVaroquaux commented Dec 16, 2024 via email

rcap107 commented Dec 17, 2024

rcap107 commented Dec 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rcap107 commented Nov 26, 2024 •

edited

Loading

rcap107 commented Dec 9, 2024 •

edited

Loading

rcap107 commented Dec 16, 2024 •

edited

Loading