-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] get_feature_names_out
for sklego.preprocessing
.
#544
[WIP] get_feature_names_out
for sklego.preprocessing
.
#544
Conversation
Co-authored-by: MBrouns <[email protected]>
Co-authored-by: MBrouns <[email protected]>
…res if not fitted.
_ClassNamePrefixFeaturesOutMixin is not available for the |
@MBrouns for how long do we want to keep Python3.7 support? I think it's end of life in about 8 months, no? |
I think I'm okay with dropping 3.7 support. EOL in 8 months is soon enough and if it makes this feature easier to implement I'd say go for it |
…on guidelines for new preprocessors
get_feature_names_out
for sklego.preprocessing
. get_feature_names_out
for sklego.preprocessing
.
@koaning @MBrouns I added contributing guidelines to the readme. Idea is that contributors of a new preprocessor will make sure to add Unfortunately it will take quite some work to get this working for Python 3.7. since in that case we don't have access to the convenient Mixins that implement If you are fine with dropping Python 3.7. support then the PR seems ready to be merged. |
Feel free to remove 3.7 support. Maybe upgrade the github action that checks the optional deps. |
Is there a general unit test we might write that can check every component if it's implemented? |
get_feature_names_out
for sklego.preprocessing
. get_feature_names_out
for sklego.preprocessing
.
Don't think I have permissions to update the optional deps actions pipeline and remove Python 3.7. pipelines. Isn't that only available for maintainers?
Good point! Will add a general unit test that checks if |
You can change the files in .github in your PR right? Calmcode has a github actions course that gives more details. |
Aha of course, forgot I can adjust files in @koaning The Github Actions pipeline still shows a |
get_feature_names_out
for sklego.preprocessing
. get_feature_names_out
for sklego.preprocessing
.
That's a setting that only I can change, but feel free to ignore that. |
# estimator_checks.check_dtype_object, | ||
# estimator_checks.check_sample_weights_pandas_series, | ||
# estimator_checks.check_sample_weights_list, | ||
# estimator_checks.check_sample_weights_invariance, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't remove these. By leaving these comments here, we're conscious of the sklearn tests that we've manually removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, no problem. Added them back in.
I noticed some print statements in test_patsy_transformer.py
. Are they also supposed to be left in?
readme.md
Outdated
|
||
### Implementing a new preprocessor | ||
|
||
When creating a completely new transformer in `sklego.preprocessing`, make sure to implement `get_feature_names_out` functionality. | ||
|
||
The preferred method is to add [_ClassNamePrefixFeaturesOutMixin](https://github.com/scikit-learn/scikit-learn/blob/626b4608d4f840af7c37bff2ccb38fcfd2ef594f/sklearn/base.py#L868) as a base class\ | ||
and make sure `self._n_features_out` (i.e. the number of output features) is defined in the `fit` method of the new preprocessor. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MBrouns I'm wondering. Would it be better to have folks import a hidden mixin, or just to have them implement get_feature_names_out
themselves? Part of me is uneasy about relying on a non-public part of the sklearn api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CarloLepelaars also, seen this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting that this is part of the private API. Maybe copy the sklearn implementation into one of our own files then and use that instead for the time being?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point, the Mixin seems like the most clean implementation, but agree it is not a good idea to use Mixins from the non-public API. Would be interesting to ask a sklearn core maintainer about this decision. I think we have good arguments to make these kinds of Mixins part of the public API.
@MBrouns, that sounds like a valid option. The tricky thing is that get_feature_names_out
in this Mixin also calls a hidden function in sklearn.utils.validation
_generate_get_feature_names_out, which calls another hidden function _check_feature_names_in. I'm afraid also copying these over to sklego will pollute the codebase.
@koaning oops, didn't notice the contribution documentation yet. That already looks great! Perhaps we can add a link to here in the "New Features" section of the readme?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That already looks great! Perhaps we can add a link to here in the "New Features" section of the readme?
Sounds good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhm, yeah, this is tricky territory.
If their internal class has many more links to internal objects it'll be even riskier to rely on it.
Open to ideas though, there doesn't seem to be an "obvious" solution here.
get_feature_names_out
for sklego.preprocessing
. get_feature_names_out
for sklego.preprocessing
.
It seems that there is also a standard test in We should probably go over the test cases from |
Nice find, @MBrouns! I adjusted the general tests to use Unfortunately they both break on the 1st line with |
@koaning As discussed we can keep this PR on hold for now and revisit later. The reason is that even when Some things that we should still consider:
FYI @MBrouns |
This PR solves issue #543 and implements
get_feature_names_out
for all relevant transformers insklego.preprocessing
(i.e. transformers that do not contain theTrainOnlyTransformerMixin
).Functionality is implemented through adding the
_ClassNamePrefixFeaturesOutMixin
to the class and making sureself._n_features_out
is defined in.fit
. This is also generally howscikit-learn
implementsget_feature_names_out
for many of its transformers (Example). Unit tests are added for all new functionality.P.S. Don't pay attention to the commit history before October 10th. These changes have already been merged into
koaning/scikit-lego/main
, but is still displayed here as commit history. Will try to fix this. Suggestions to remove these redundant commits from the commit history of thisCarloLepelaars/scikit-lego/
fork are welcome.Find alternative solution for using _ClassNamePrefixFeaturesOutMixin so it works with scikit-learn on Python 3.7. (Remove Python 3.7. support)get_feature_names_out
to contributing guidelines so people implement this for each new preprocessor that is not TrainOnly.get_feature_names_out
can be called for all relevant preprocessors andEstimatorTransformer
.