Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add built-in column-specific transformers to TableVectorizer #583

Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
0a6c438
[WIP] Add `column_specific_transformers` parameter
LilianBoulard Jun 9, 2023
85c5182
Add changelog entry
LilianBoulard Jun 9, 2023
8890279
Merge with main
LilianBoulard Jun 30, 2023
0b30144
Add specific column assignment feature
LilianBoulard Jun 30, 2023
03c6124
Add test for the functionality
LilianBoulard Jun 30, 2023
96959e9
Fix bug in implementation
LilianBoulard Jun 30, 2023
1036d46
Improve test
LilianBoulard Jun 30, 2023
6e4ef5e
Add `test_deterministic`
LilianBoulard Jun 30, 2023
458a1d7
Fix docstring
LilianBoulard Jun 30, 2023
c855ef6
Fix function call
LilianBoulard Jun 30, 2023
e836e56
Better comments
LilianBoulard Jun 30, 2023
fbdb6ff
Rename `column_specific_transformers` to `specific_transformers`
LilianBoulard Jul 19, 2023
ba69810
Better edge cases handling
LilianBoulard Jul 19, 2023
efe45f6
Merge with main
LilianBoulard Jul 19, 2023
d50d3fe
Remove unused import
LilianBoulard Jul 19, 2023
9c8a8d6
Fix test
LilianBoulard Jul 21, 2023
89612ac
Fix omitted renaming
LilianBoulard Jul 21, 2023
2d02c97
Add unexpected specific transformers test
LilianBoulard Jul 21, 2023
6424c20
Update CHANGES.rst
LilianBoulard Jul 24, 2023
31b4b75
Simplify test
LilianBoulard Jul 24, 2023
e05372a
Merge remote-tracking branch 'fork/add_column_specific_transformers_t…
LilianBoulard Jul 24, 2023
4f1c1b6
Fix test
LilianBoulard Jul 24, 2023
e4073ba
Merge branch 'main' of https://github.com/skrub-data/skrub into add_c…
LilianBoulard Jul 24, 2023
6401979
Remove explicit RST for xref
LilianBoulard Jul 24, 2023
d24d2f1
Reword error message
LilianBoulard Jul 24, 2023
ec2c9c2
Typo
LilianBoulard Jul 24, 2023
170049e
Reword error
LilianBoulard Jul 24, 2023
35b059a
Rename internals accordingly
LilianBoulard Jul 28, 2023
26f9945
Add grid-search example
LilianBoulard Jul 28, 2023
3e1e920
Merge remote-tracking branch 'fork/add_column_specific_transformers_t…
LilianBoulard Jul 28, 2023
d1fcf7e
Fix the damn names
LilianBoulard Jul 28, 2023
2509699
Apply suggestions from code review
LilianBoulard Jul 31, 2023
bf682ab
Merge branch 'add_column_specific_transformers_tv' of https://github.…
LilianBoulard Jul 31, 2023
e21e97e
Better duplicate check
LilianBoulard Jul 31, 2023
25a0bab
Fix deterministic test as advised
LilianBoulard Jul 31, 2023
ede761c
Move example to temp dir for future improvement
LilianBoulard Aug 18, 2023
586ad0b
Merge branch 'main' of https://github.com/skrub-data/skrub into add_c…
LilianBoulard Aug 18, 2023
c678a4c
Apply suggestions from code review
LilianBoulard Aug 25, 2023
140773e
Merge branch 'add_column_specific_transformers_tv' of https://github.…
LilianBoulard Aug 25, 2023
8188702
Merge branch 'main' of https://github.com/skrub-data/skrub into add_c…
LilianBoulard Aug 25, 2023
4347c09
Update skrub/tests/test_table_vectorizer.py
jovan-stojanovic Aug 31, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,9 @@ Major changes
- scikit-learn >= 1.2.1
- pandas >= 1.5.3 :pr:`613` by :user:`Lilian Boulard <LilianBoulard>`

* Removed `requests` from the requirements. :pr:`613` by :user:`Lilian Boulard <LilianBoulard>`
* You can now pass column-specific transformers to :class:`TableVectorizer`
using the `specific_transformers` argument.
:pr:`583` by :user:`Lilian Boulard <LilianBoulard>`.

* Do not support 1-D array (and pandas Series) in :class:`TableVectorizer`. Pass a
2-D array (or a pandas DataFrame) with a single column instead. This change is for
Expand Down Expand Up @@ -95,6 +97,8 @@ Minor changes
* Add `get_feature_names_out` method to :class:`MinHashEncoder`.
:pr:`616` by :user:`Leo Grinsztajn <LeoGrin>`

* Removed `requests` from the requirements. :pr:`613` by :user:`Lilian Boulard <LilianBoulard>`
LilianBoulard marked this conversation as resolved.
Show resolved Hide resolved

* :class:`TableVectorizer` now handles mixed types columns without failing
by converting them to string before type inference.
:pr:`623`by :user:`Leo Grinsztajn <LeoGrin>`
Expand Down
142 changes: 142 additions & 0 deletions examples/FIXME/07_grid_searching_with_the_tablevectorizer.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
"""
.. _example_grid_search_with_the_tablevectorizer:

=================================================
Performing a grid-search with the TableVectorizer
=================================================

In this example, we will see how to customize the |TableVectorizer|,
and see how we can perform a grid-search with it.


.. |TableVectorizer| replace:: :class:`~skrub.TableVectorizer`

.. |OneHotEncoder| replace:: :class:`~sklearn.preprocessing.OneHotEncoder`

.. |GapEncoder| replace:: :class:`~skrub.GapEncoder`

.. |MinHashEncoder| replace:: :class:`~skrub.MinHashEncoder`

"""

###############################################################################
# Customizing the TableVectorizer
# -------------------------------
#
# In this section, we will see two cases where we might want to customize the
# |TableVectorizer|: when we want to use a custom transformer for a column type
# and when we want to use a custom transformer for a specific column.
#
# The data
# ........
#
# Throughout this example, we will use the employee salaries dataset.

from skrub.datasets import fetch_employee_salaries

dataset = fetch_employee_salaries()
X = dataset.X
y = dataset.y

X.head(10)

###############################################################################
# Let's import the |TableVectorizer| and see what the default assignation is:

from skrub import TableVectorizer
from pprint import pprint

tv = TableVectorizer()
tv.fit(X)

pprint(tv.transformers_)

###############################################################################
# Using a custom Transformer for a column type
# ............................................
#
# Say we wanted to use a |MinHashEncoder| instead of the default
# |GapEncoder| for the high cardinality categorical columns.
# It is easy to do that by using the dedicated parameter:

from skrub import MinHashEncoder

tv = TableVectorizer(
high_card_cat_transformer=MinHashEncoder(),
)
tv.fit(X)

pprint(tv.transformers_)

###############################################################################
# If we want to modify what we classify as a high cardinality categorical
# column, we can tweak the ``cardinality_threshold`` parameter.
# Check out the |TableVectorizer|'s doc for more information.
#
# Also have a look at the other types of columns supported by default!
#
# Using a custom Transformer for a specific column
# ................................................
#
# Say we wanted to use a |MinHashEncoder| instead of the default |GapEncoder|,
# but only for the column ``department_name``.
# We can apply a column-specific transformer by using the ``specific_transformers`` parameter.

tv = TableVectorizer(specific_transformers=[(MinHashEncoder(), ["department_name"])])
tv.fit(X)

pprint(tv.transformers_)

###############################################################################
# Here, for simplicity, we used the unnamed 2-tuple syntax.
#
# You can also give a name to the assignment, as we will see in the next
# section.
#
# Grid-searching with the TableVectorizer
# ---------------------------------------
#
# Grid-searching the encoders' hyperparameters contained in the
# |TableVectorizer| is easy!
# For that, we use the dunder separator, which indicates a nesting layer.
# That means that for tuning the parameter ``n_components`` of the
# |GapEncoder| saved in the |TableVectorizer| attribute
# ``high_card_cat_transformer``, we use the syntax
# ``tablevectorizer__high_card_cat_transformer__n_components``.
#
# We recommend using the 3-tuple syntax for the column-specific transformers,
# which allows us to give a name to the assignment (here ``mh_dep_name``).

from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import HistGradientBoostingClassifier
from skrub import GapEncoder

pipeline = make_pipeline(
TableVectorizer(
high_card_cat_transformer=GapEncoder(),
specific_transformers=[
("mh_dep_name", MinHashEncoder(), ["department_name"]),
],
),
HistGradientBoostingClassifier(),
)

params = {
"tablevectorizer__high_card_cat_transformer__n_components": [10, 30, 50],
"tablevectorizer__mh_dep_name__n_components": [25, 50],
}

grid_search = GridSearchCV(pipeline, param_grid=params)

###############################################################################
# Conclusion
# ----------
#
# In this notebook, we saw how to better customize the |TableVectorizer| so
# it fits all your needs!
#
# If you've got any improvement ideas, please open a feature request on
# `GitHub <https://github.com/skrub-data/skrub/issues/new?labels=enhancement&template=feature_request.yml>`_!
#
# We are always happy to see new suggestions from the community :)
Loading