[WIP] `get_feature_names_out` for `sklego.preprocessing`. #544

CarloLepelaars · 2022-10-11T12:21:33Z

This PR solves issue #543 and implements get_feature_names_out for all relevant transformers in sklego.preprocessing (i.e. transformers that do not contain the TrainOnlyTransformerMixin).

Functionality is implemented through adding the _ClassNamePrefixFeaturesOutMixin to the class and making sure self._n_features_out is defined in .fit. This is also generally how scikit-learn implements get_feature_names_out for many of its transformers (Example). Unit tests are added for all new functionality.

P.S. Don't pay attention to the commit history before October 10th. These changes have already been merged into koaning/scikit-lego/main, but is still displayed here as commit history. Will try to fix this. Suggestions to remove these redundant commits from the commit history of this CarloLepelaars/scikit-lego/ fork are welcome.

~~Find alternative solution for using _ClassNamePrefixFeaturesOutMixin so it works with scikit-learn on Python 3.7. (Remove Python 3.7. support)~~
Add implementation of get_feature_names_out to contributing guidelines so people implement this for each new preprocessor that is not TrainOnly.
Remove Python 3.7. GitHub Actions pipelines and update Optional dependencies GitHub Actions pipeline to use Python 3.8.
Add general unit test that checks if get_feature_names_out can be called for all relevant preprocessors and EstimatorTransformer.

…fit`.

…weight=1.

Co-authored-by: MBrouns <[email protected]>

…`FeatureUnion`.

…apper`.

…res if not fitted.

…tion.

CarloLepelaars · 2022-10-11T12:51:48Z

_ClassNamePrefixFeaturesOutMixin is not available for the scikit-learn version on Python 3.7. 😭
Will find a different solution.

koaning · 2022-10-11T12:58:12Z

@MBrouns for how long do we want to keep Python3.7 support? I think it's end of life in about 8 months, no?

MBrouns · 2022-10-11T14:11:47Z

I think I'm okay with dropping 3.7 support. EOL in 8 months is soon enough and if it makes this feature easier to implement I'd say go for it

…on guidelines for new preprocessors

CarloLepelaars · 2022-10-11T16:29:14Z

@koaning @MBrouns I added contributing guidelines to the readme. Idea is that contributors of a new preprocessor will make sure to add get_feature_names_out functionality. We might want to consider creating a CONTRIBUTING.MD file like HuggingFace transformer has.

Unfortunately it will take quite some work to get this working for Python 3.7. since in that case we don't have access to the convenient Mixins that implement get_feature_names_out.

If you are fine with dropping Python 3.7. support then the PR seems ready to be merged.

koaning · 2022-10-11T17:10:49Z

Feel free to remove 3.7 support. Maybe upgrade the github action that checks the optional deps.

koaning · 2022-10-11T17:11:12Z

Is there a general unit test we might write that can check every component if it's implemented?

CarloLepelaars · 2022-10-11T18:10:03Z

Don't think I have permissions to update the optional deps actions pipeline and remove Python 3.7. pipelines. Isn't that only available for maintainers?

Is there a general unit test we might write that can check every component if it's implemented?

Good point! Will add a general unit test that checks if get_feature_names_out can be called for all relevant preprocessors (and EstimatorTransformer).

koaning · 2022-10-12T05:44:23Z

You can change the files in .github in your PR right?

Calmcode has a github actions course that gives more details.

…ll preprocessors (except TrainOnly)

CarloLepelaars · 2022-10-12T19:08:01Z

Aha of course, forgot I can adjust files in .github. Bumped minimum Python versions from 3.7. to 3.8 in .github and .gitpod.yml. Also added the general unit test.

@koaning The Github Actions pipeline still shows a build (3.7.) job. I don't know why this one still shows up.

koaning · 2022-10-13T06:34:44Z

I don't know why this one still shows up.

That's a setting that only I can change, but feel free to ignore that.

koaning · 2022-10-13T06:40:51Z

tests/test_preprocessing/test_interval_encoder.py

-            # estimator_checks.check_dtype_object,
-            # estimator_checks.check_sample_weights_pandas_series,
-            # estimator_checks.check_sample_weights_list,
-            # estimator_checks.check_sample_weights_invariance,


Please don't remove these. By leaving these comments here, we're conscious of the sklearn tests that we've manually removed.

Ok, no problem. Added them back in.

I noticed some print statements in test_patsy_transformer.py. Are they also supposed to be left in?

koaning · 2022-10-13T07:24:22Z

readme.md

+
+### Implementing a new preprocessor
+
+When creating a completely new transformer in `sklego.preprocessing`, make sure to implement `get_feature_names_out` functionality. 
+
+The preferred method is to add [_ClassNamePrefixFeaturesOutMixin](https://github.com/scikit-learn/scikit-learn/blob/626b4608d4f840af7c37bff2ccb38fcfd2ef594f/sklearn/base.py#L868) as a base class\
+and make sure `self._n_features_out` (i.e. the number of output features) is defined in the `fit` method of the new preprocessor.
+


@MBrouns I'm wondering. Would it be better to have folks import a hidden mixin, or just to have them implement get_feature_names_out themselves? Part of me is uneasy about relying on a non-public part of the sklearn api.

@CarloLepelaars also, seen this?

interesting that this is part of the private API. Maybe copy the sklearn implementation into one of our own files then and use that instead for the time being?

Fair point, the Mixin seems like the most clean implementation, but agree it is not a good idea to use Mixins from the non-public API. Would be interesting to ask a sklearn core maintainer about this decision. I think we have good arguments to make these kinds of Mixins part of the public API.

@MBrouns, that sounds like a valid option. The tricky thing is that get_feature_names_out in this Mixin also calls a hidden function in sklearn.utils.validation _generate_get_feature_names_out, which calls another hidden function _check_feature_names_in. I'm afraid also copying these over to sklego will pollute the codebase.

@koaning oops, didn't notice the contribution documentation yet. That already looks great! Perhaps we can add a link to here in the "New Features" section of the readme?

That already looks great! Perhaps we can add a link to here in the "New Features" section of the readme?

Sounds good.

Mhm, yeah, this is tricky territory.

If their internal class has many more links to internal objects it'll be even riskier to rely on it.

Open to ideas though, there doesn't seem to be an "obvious" solution here.

MBrouns · 2022-10-18T17:04:12Z

It seems that there is also a standard test in scikit-learns check_estimator that verifies this feature_names_out behaviour: https://github.com/scikit-learn/scikit-learn/blob/36958fb24/sklearn/utils/estimator_checks.py#L3921

We should probably go over the test cases from estimator_checks to see if any others got added since we made our own version

…essing` objects.

CarloLepelaars · 2022-10-18T19:29:24Z

Nice find, @MBrouns! I adjusted the general tests to use check_transformer_get_feature_names_out and check_transformer_get_feature_names_out_pandas.

Unfortunately they both break on the 1st line with TypeError: _get_tags() missing 1 required positional argument: 'self'. Should every estimator have _get_tags implemented? If so, why doesn't it seem to be part of BaseEstimator or TransformerMixin?

CarloLepelaars · 2022-11-01T19:59:04Z

@koaning As discussed we can keep this PR on hold for now and revisit later. The reason is that even when _ClassNamePrefixFeaturesOutMixin is made part of the public API, it would still restrict people to using the latest version of sklearn.

Some things that we should still consider:

The PR for EstimatorTransformer get_feature_names_out for EstimatorTransformer #539 is done, tests are passing (also for Python 3.7) and it does not use the Mixin. Shall we go ahead and merge that one?
Do we still want to drop support for Python 3.7.? I can make a separate PR to adjust the Github Actions pipelines, test configs, etc.
Perhaps we can still implement get_feature_names_out for sklego.preprocessing ourselves? At least we already have the tests ready. If we implement it this way it is likely that the tests would still pass for Python 3.7.

FYI @MBrouns

CarloLepelaars and others added 30 commits September 12, 2022 13:29

Pass along arbitrary parameters to fit EstimatorTransformer

22da776

Remove *args option from EstimatorTransformer.fit()

955a0ee

Setup test for passing additional arguments in `EstimatorTransformer.…

8d93361

…fit`.

Test if EstimatorTransformer fit+transform is the same with sample_…

082ce5f

…weight=1.

EstimatorTransformer test_kwargs comments

237edb4

Use array to test passing of sample_weight in EstimatorTransformer

3eff767

Use more simple LinearRegression in test_kwargs

8ef9f70

Update tests/test_meta/test_estimatortransformer.py

ae7b061

Co-authored-by: MBrouns <[email protected]>

Update tests/test_meta/test_estimatortransformer.py

f06a7af

Co-authored-by: MBrouns <[email protected]>

Use unittest.Mock to check if fit method works with added kwargs

b0ca1a0

Merge branch 'main' into main

15e25fc

Working solution to test EstimatorTransformer.fit with added kwargs

c471311

Fix Python3.7 issue with Mock().call_args for non-keyword args.

9af3a6d

Simplify test_kwargs so passing of kwargs is tested.

50b4b06

Remove redundant whitespace at bottom of tests file

2102583

Fix Python3.7 issue for Mock().call_args

c7df2aa

Merge branch 'koaning:main' into main

aa526aa

PoC for get_feature_names_out for EstimatorTransformer

999c197

Refine get_feature_names_out for EstimatorTransformer. Tests for …

78016af

…`FeatureUnion`.

Custom check_is_fitted requirements.

ea9627a

Remove redundant imports

4325348

Remove redundant check in __sklearn_.is_fitted

7404a7f

Clean up tests for EstimatorTransformer

7978e1c

Merge branch 'main' into feature/meta-feature-names-out

b9706fd

New lines in docstrings

e8f1d19

Merge branch 'koaning:main' into main

f6cc211

get_feature_names_out+test for ColumnCapper

6bf9abb

get_feature_names_out implementations for ColumnCapper and `DictM…

e471b6c

…apper`.

ValueError check for get_feature_names_out call without input_featu…

5aa735f

…res if not fitted.

get_feature_names_out for IdentityTransformer and test simplifica…

410f98a

…tion.

CarloLepelaars added 2 commits October 11, 2022 18:23

Simplify get_feature_names_out for IntervalEncoder and contributi…

fcb3058

…on guidelines for new preprocessors

Finetune contribution guidelines for new preprocessors

54fecb2

CarloLepelaars changed the title ~~[WIP] get_feature_names_out for sklego.preprocessing.~~ get_feature_names_out for sklego.preprocessing. Oct 11, 2022

CarloLepelaars changed the title ~~get_feature_names_out for sklego.preprocessing.~~ [WIP] get_feature_names_out for sklego.preprocessing. Oct 11, 2022

CarloLepelaars added 3 commits October 12, 2022 20:40

Bump minimum Python version from 3.7 to 3.8 in Github Actions pipelines.

6c42050

Bump build to Python 3.8 in .gitpod.yml

9e52aae

General test to check if get_feature_names_out is implemented for a…

d46ae69

…ll preprocessors (except TrainOnly)

CarloLepelaars changed the title ~~[WIP] get_feature_names_out for sklego.preprocessing.~~ get_feature_names_out for sklego.preprocessing. Oct 12, 2022

koaning reviewed Oct 13, 2022

View reviewed changes

CarloLepelaars added 2 commits October 13, 2022 10:43

Put back commented checks in test_interval_encoder

96b39b3

Link to contribution docs in readme

5a6d2b8

CarloLepelaars changed the title ~~get_feature_names_out for sklego.preprocessing.~~ [WIP] get_feature_names_out for sklego.preprocessing. Oct 13, 2022

Use sklearn checks to check get_feature_names_out for `sklego.preproc…

328af9a

…essing` objects.

CarloLepelaars mentioned this pull request Nov 4, 2022

[FEATURE] get_feature_names_out for sklego.preprocessing transformers. #543

Open

13 tasks

CarloLepelaars closed this Apr 30, 2024

CarloLepelaars deleted the feature/preprocessing-feature-names-out branch April 30, 2024 10:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] `get_feature_names_out` for `sklego.preprocessing`. #544

[WIP] `get_feature_names_out` for `sklego.preprocessing`. #544

CarloLepelaars commented Oct 11, 2022 •

edited

Loading

CarloLepelaars commented Oct 11, 2022

koaning commented Oct 11, 2022 •

edited

Loading

MBrouns commented Oct 11, 2022

CarloLepelaars commented Oct 11, 2022 •

edited

Loading

koaning commented Oct 11, 2022

koaning commented Oct 11, 2022

CarloLepelaars commented Oct 11, 2022

koaning commented Oct 12, 2022

CarloLepelaars commented Oct 12, 2022 •

edited

Loading

koaning commented Oct 13, 2022 •

edited

Loading

koaning Oct 13, 2022

CarloLepelaars Oct 13, 2022 •

edited

Loading

koaning Oct 13, 2022

koaning Oct 13, 2022

MBrouns Oct 13, 2022

CarloLepelaars Oct 13, 2022 •

edited

Loading

koaning Oct 15, 2022

koaning Oct 15, 2022

MBrouns commented Oct 18, 2022

CarloLepelaars commented Oct 18, 2022 •

edited

Loading

CarloLepelaars commented Nov 1, 2022

[WIP] get_feature_names_out for sklego.preprocessing. #544

[WIP] get_feature_names_out for sklego.preprocessing. #544

Conversation

CarloLepelaars commented Oct 11, 2022 • edited Loading

CarloLepelaars commented Oct 11, 2022

koaning commented Oct 11, 2022 • edited Loading

MBrouns commented Oct 11, 2022

CarloLepelaars commented Oct 11, 2022 • edited Loading

koaning commented Oct 11, 2022

koaning commented Oct 11, 2022

CarloLepelaars commented Oct 11, 2022

koaning commented Oct 12, 2022

CarloLepelaars commented Oct 12, 2022 • edited Loading

koaning commented Oct 13, 2022 • edited Loading

koaning Oct 13, 2022

Choose a reason for hiding this comment

CarloLepelaars Oct 13, 2022 • edited Loading

Choose a reason for hiding this comment

koaning Oct 13, 2022

Choose a reason for hiding this comment

koaning Oct 13, 2022

Choose a reason for hiding this comment

MBrouns Oct 13, 2022

Choose a reason for hiding this comment

CarloLepelaars Oct 13, 2022 • edited Loading

Choose a reason for hiding this comment

koaning Oct 15, 2022

Choose a reason for hiding this comment

koaning Oct 15, 2022

Choose a reason for hiding this comment

MBrouns commented Oct 18, 2022

CarloLepelaars commented Oct 18, 2022 • edited Loading

CarloLepelaars commented Nov 1, 2022

[WIP] `get_feature_names_out` for `sklego.preprocessing`. #544

[WIP] `get_feature_names_out` for `sklego.preprocessing`. #544

CarloLepelaars commented Oct 11, 2022 •

edited

Loading

koaning commented Oct 11, 2022 •

edited

Loading

CarloLepelaars commented Oct 11, 2022 •

edited

Loading

CarloLepelaars commented Oct 12, 2022 •

edited

Loading

koaning commented Oct 13, 2022 •

edited

Loading

CarloLepelaars Oct 13, 2022 •

edited

Loading

CarloLepelaars Oct 13, 2022 •

edited

Loading

CarloLepelaars commented Oct 18, 2022 •

edited

Loading