Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] get_feature_names_out for sklego.preprocessing. #544

Closed
wants to merge 47 commits into from
Closed

[WIP] get_feature_names_out for sklego.preprocessing. #544

wants to merge 47 commits into from

Conversation

CarloLepelaars
Copy link
Contributor

@CarloLepelaars CarloLepelaars commented Oct 11, 2022

This PR solves issue #543 and implements get_feature_names_out for all relevant transformers in sklego.preprocessing (i.e. transformers that do not contain the TrainOnlyTransformerMixin).

Functionality is implemented through adding the _ClassNamePrefixFeaturesOutMixin to the class and making sure self._n_features_out is defined in .fit. This is also generally how scikit-learn implements get_feature_names_out for many of its transformers (Example). Unit tests are added for all new functionality.

P.S. Don't pay attention to the commit history before October 10th. These changes have already been merged into koaning/scikit-lego/main, but is still displayed here as commit history. Will try to fix this. Suggestions to remove these redundant commits from the commit history of this CarloLepelaars/scikit-lego/ fork are welcome.

  • Find alternative solution for using _ClassNamePrefixFeaturesOutMixin so it works with scikit-learn on Python 3.7. (Remove Python 3.7. support)
  • Add implementation of get_feature_names_out to contributing guidelines so people implement this for each new preprocessor that is not TrainOnly.
  • Remove Python 3.7. GitHub Actions pipelines and update Optional dependencies GitHub Actions pipeline to use Python 3.8.
  • Add general unit test that checks if get_feature_names_out can be called for all relevant preprocessors and EstimatorTransformer.

CarloLepelaars and others added 30 commits September 12, 2022 13:29
@CarloLepelaars
Copy link
Contributor Author

_ClassNamePrefixFeaturesOutMixin is not available for the scikit-learn version on Python 3.7. 😭
Will find a different solution.

@koaning
Copy link
Owner

koaning commented Oct 11, 2022

@MBrouns for how long do we want to keep Python3.7 support? I think it's end of life in about 8 months, no?

@MBrouns
Copy link
Collaborator

MBrouns commented Oct 11, 2022

I think I'm okay with dropping 3.7 support. EOL in 8 months is soon enough and if it makes this feature easier to implement I'd say go for it

@CarloLepelaars CarloLepelaars changed the title [WIP] get_feature_names_out for sklego.preprocessing. get_feature_names_out for sklego.preprocessing. Oct 11, 2022
@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Oct 11, 2022

@koaning @MBrouns I added contributing guidelines to the readme. Idea is that contributors of a new preprocessor will make sure to add get_feature_names_out functionality. We might want to consider creating a CONTRIBUTING.MD file like HuggingFace transformer has.

Unfortunately it will take quite some work to get this working for Python 3.7. since in that case we don't have access to the convenient Mixins that implement get_feature_names_out.

If you are fine with dropping Python 3.7. support then the PR seems ready to be merged.

@koaning
Copy link
Owner

koaning commented Oct 11, 2022

Feel free to remove 3.7 support. Maybe upgrade the github action that checks the optional deps.

@koaning
Copy link
Owner

koaning commented Oct 11, 2022

Is there a general unit test we might write that can check every component if it's implemented?

@CarloLepelaars CarloLepelaars changed the title get_feature_names_out for sklego.preprocessing. [WIP] get_feature_names_out for sklego.preprocessing. Oct 11, 2022
@CarloLepelaars
Copy link
Contributor Author

Don't think I have permissions to update the optional deps actions pipeline and remove Python 3.7. pipelines. Isn't that only available for maintainers?

Is there a general unit test we might write that can check every component if it's implemented?

Good point! Will add a general unit test that checks if get_feature_names_out can be called for all relevant preprocessors (and EstimatorTransformer).

@koaning
Copy link
Owner

koaning commented Oct 12, 2022

You can change the files in .github in your PR right?

Calmcode has a github actions course that gives more details.

@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Oct 12, 2022

Aha of course, forgot I can adjust files in .github. Bumped minimum Python versions from 3.7. to 3.8 in .github and .gitpod.yml. Also added the general unit test.

@koaning The Github Actions pipeline still shows a build (3.7.) job. I don't know why this one still shows up.

@CarloLepelaars CarloLepelaars changed the title [WIP] get_feature_names_out for sklego.preprocessing. get_feature_names_out for sklego.preprocessing. Oct 12, 2022
@koaning
Copy link
Owner

koaning commented Oct 13, 2022

I don't know why this one still shows up.

That's a setting that only I can change, but feel free to ignore that.

Comment on lines 18 to 21
# estimator_checks.check_dtype_object,
# estimator_checks.check_sample_weights_pandas_series,
# estimator_checks.check_sample_weights_list,
# estimator_checks.check_sample_weights_invariance,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't remove these. By leaving these comments here, we're conscious of the sklearn tests that we've manually removed.

Copy link
Contributor Author

@CarloLepelaars CarloLepelaars Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, no problem. Added them back in.

I noticed some print statements in test_patsy_transformer.py. Are they also supposed to be left in?

readme.md Outdated
Comment on lines 151 to 158

### Implementing a new preprocessor

When creating a completely new transformer in `sklego.preprocessing`, make sure to implement `get_feature_names_out` functionality.

The preferred method is to add [_ClassNamePrefixFeaturesOutMixin](https://github.com/scikit-learn/scikit-learn/blob/626b4608d4f840af7c37bff2ccb38fcfd2ef594f/sklearn/base.py#L868) as a base class\
and make sure `self._n_features_out` (i.e. the number of output features) is defined in the `fit` method of the new preprocessor.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MBrouns I'm wondering. Would it be better to have folks import a hidden mixin, or just to have them implement get_feature_names_out themselves? Part of me is uneasy about relying on a non-public part of the sklearn api.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CarloLepelaars also, seen this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting that this is part of the private API. Maybe copy the sklearn implementation into one of our own files then and use that instead for the time being?

Copy link
Contributor Author

@CarloLepelaars CarloLepelaars Oct 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, the Mixin seems like the most clean implementation, but agree it is not a good idea to use Mixins from the non-public API. Would be interesting to ask a sklearn core maintainer about this decision. I think we have good arguments to make these kinds of Mixins part of the public API.

@MBrouns, that sounds like a valid option. The tricky thing is that get_feature_names_out in this Mixin also calls a hidden function in sklearn.utils.validation _generate_get_feature_names_out, which calls another hidden function _check_feature_names_in. I'm afraid also copying these over to sklego will pollute the codebase.

@koaning oops, didn't notice the contribution documentation yet. That already looks great! Perhaps we can add a link to here in the "New Features" section of the readme?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That already looks great! Perhaps we can add a link to here in the "New Features" section of the readme?

Sounds good.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mhm, yeah, this is tricky territory.

If their internal class has many more links to internal objects it'll be even riskier to rely on it.

Open to ideas though, there doesn't seem to be an "obvious" solution here.

@CarloLepelaars CarloLepelaars changed the title get_feature_names_out for sklego.preprocessing. [WIP] get_feature_names_out for sklego.preprocessing. Oct 13, 2022
@MBrouns
Copy link
Collaborator

MBrouns commented Oct 18, 2022

It seems that there is also a standard test in scikit-learns check_estimator that verifies this feature_names_out behaviour: https://github.com/scikit-learn/scikit-learn/blob/36958fb24/sklearn/utils/estimator_checks.py#L3921

We should probably go over the test cases from estimator_checks to see if any others got added since we made our own version

@CarloLepelaars
Copy link
Contributor Author

CarloLepelaars commented Oct 18, 2022

Nice find, @MBrouns! I adjusted the general tests to use check_transformer_get_feature_names_out and check_transformer_get_feature_names_out_pandas.

Unfortunately they both break on the 1st line with TypeError: _get_tags() missing 1 required positional argument: 'self'. Should every estimator have _get_tags implemented? If so, why doesn't it seem to be part of BaseEstimator or TransformerMixin?

@CarloLepelaars
Copy link
Contributor Author

@koaning As discussed we can keep this PR on hold for now and revisit later. The reason is that even when _ClassNamePrefixFeaturesOutMixin is made part of the public API, it would still restrict people to using the latest version of sklearn.

Some things that we should still consider:

  1. The PR for EstimatorTransformer get_feature_names_out for EstimatorTransformer #539 is done, tests are passing (also for Python 3.7) and it does not use the Mixin. Shall we go ahead and merge that one?
  2. Do we still want to drop support for Python 3.7.? I can make a separate PR to adjust the Github Actions pipelines, test configs, etc.
  3. Perhaps we can still implement get_feature_names_out for sklego.preprocessing ourselves? At least we already have the tests ready. If we implement it this way it is likely that the tests would still pass for Python 3.7.

FYI @MBrouns

@CarloLepelaars CarloLepelaars deleted the feature/preprocessing-feature-names-out branch April 30, 2024 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants