skrub-data · Vincent-Maladiere · Aug 31, 2023 · Jun 9, 2023 · Jun 9, 2023 · Jun 30, 2023
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -53,7 +53,9 @@ Major changes
   - scikit-learn >= 1.2.1
   - pandas >= 1.5.3 :pr:`613` by :user:`Lilian Boulard <LilianBoulard>`
 
-* Removed `requests` from the requirements. :pr:`613` by :user:`Lilian Boulard <LilianBoulard>`
+* You can now pass column-specific transformers to :class:`TableVectorizer`
+  using the `specific_transformers` argument.
+  :pr:`583` by :user:`Lilian Boulard <LilianBoulard>`.
 
 * Do not support 1-D array (and pandas Series) in :class:`TableVectorizer`. Pass a
   2-D array (or a pandas DataFrame) with a single column instead. This change is for
@@ -95,6 +97,8 @@ Minor changes
 * Add `get_feature_names_out` method to :class:`MinHashEncoder`.
   :pr:`616` by :user:`Leo Grinsztajn <LeoGrin>`
 
+* Removed `requests` from the requirements. :pr:`613` by :user:`Lilian Boulard <LilianBoulard>`
+
 * :class:`TableVectorizer` now handles mixed types columns without failing
   by converting them to string before type inference.
   :pr:`623`by :user:`Leo Grinsztajn <LeoGrin>`

diff --git a/examples/FIXME/07_grid_searching_with_the_tablevectorizer.py b/examples/FIXME/07_grid_searching_with_the_tablevectorizer.py
@@ -0,0 +1,142 @@
+"""
+.. _example_grid_search_with_the_tablevectorizer:
+
+=================================================
+Performing a grid-search with the TableVectorizer
+=================================================
+
+In this example, we will see how to customize the |TableVectorizer|,
+and see how we can perform a grid-search with it.
+
+
+.. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer`
+
+.. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder`
+
+.. |GapEncoder| replace:: :class:`~skrub.GapEncoder`
+
+.. |MinHashEncoder| replace:: :class:`~skrub.MinHashEncoder`
+
+"""
+
+###############################################################################
+# Customizing the TableVectorizer
+# -------------------------------
+#
+# In this section, we will see two cases where we might want to customize the
+# |TableVectorizer|: when we want to use a custom transformer for a column type
+# and when we want to use a custom transformer for a specific column.
+#
+# The data
+# ........
+#
+# Throughout this example, we will use the employee salaries dataset.
+
+from skrub.datasets import fetch_employee_salaries
+
+dataset = fetch_employee_salaries()
+X = dataset.X
+y = dataset.y
+
+X.head(10)
+
+###############################################################################
+# Let's import the |TableVectorizer| and see what the default assignation is:
+
+from skrub import TableVectorizer
+from pprint import pprint
+
+tv = TableVectorizer()
+tv.fit(X)
+
+pprint(tv.transformers_)
+
+###############################################################################
+# Using a custom Transformer for a column type
+# ............................................
+#
+# Say we wanted to use a |MinHashEncoder| instead of the default
+# |GapEncoder| for the high cardinality categorical columns.
+# It is easy to do that by using the dedicated parameter:
+
+from skrub import MinHashEncoder
+
+tv = TableVectorizer(
+    high_card_cat_transformer=MinHashEncoder(),
+)
+tv.fit(X)
+
+pprint(tv.transformers_)
+
+###############################################################################
+# If we want to modify what we classify as a high cardinality categorical
+# column, we can tweak the ``cardinality_threshold`` parameter.
+# Check out the |TableVectorizer|'s doc for more information.
+#
+# Also have a look at the other types of columns supported by default!
+#
+# Using a custom Transformer for a specific column
+# ................................................
+#
+# Say we wanted to use a |MinHashEncoder| instead of the default |GapEncoder|,
+# but only for the column ``department_name``.
+# We can apply a column-specific transformer by using the ``specific_transformers`` parameter.
+
+tv = TableVectorizer(specific_transformers=[(MinHashEncoder(), ["department_name"])])
+tv.fit(X)
+
+pprint(tv.transformers_)
+
+###############################################################################
+# Here, for simplicity, we used the unnamed 2-tuple syntax.
+#
+# You can also give a name to the assignment, as we will see in the next
+# section.
+#
+# Grid-searching with the TableVectorizer
+# ---------------------------------------
+#
+# Grid-searching the encoders' hyperparameters contained in the
+# |TableVectorizer| is easy!
+# For that, we use the dunder separator, which indicates a nesting layer.
+# That means that for tuning the parameter ``n_components`` of the
+# |GapEncoder| saved in the |TableVectorizer| attribute
+# ``high_card_cat_transformer``, we use the syntax
+# ``tablevectorizer__high_card_cat_transformer__n_components``.
+#
+# We recommend using the 3-tuple syntax for the column-specific transformers,
+# which allows us to give a name to the assignment (here ``mh_dep_name``).
+
+from sklearn.model_selection import GridSearchCV
+from sklearn.pipeline import make_pipeline
+from sklearn.ensemble import HistGradientBoostingClassifier
+from skrub import GapEncoder
+
+pipeline = make_pipeline(
+    TableVectorizer(
+        high_card_cat_transformer=GapEncoder(),
+        specific_transformers=[
+            ("mh_dep_name", MinHashEncoder(), ["department_name"]),
+        ],
+    ),
+    HistGradientBoostingClassifier(),
+)
+
+params = {
+    "tablevectorizer__high_card_cat_transformer__n_components": [10, 30, 50],
+    "tablevectorizer__mh_dep_name__n_components": [25, 50],
+}
+
+grid_search = GridSearchCV(pipeline, param_grid=params)
+
+###############################################################################
+# Conclusion
+# ----------
+#
+# In this notebook, we saw how to better customize the |TableVectorizer| so
+# it fits all your needs!
+#
+# If you've got any improvement ideas, please open a feature request on
+# `GitHub <https://github.com/skrub-data/skrub/issues/new?labels=enhancement&template=feature_request.yml>`_!
+#
+# We are always happy to see new suggestions from the community :)