-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python-package] make a shallow copy on dataframe rename (fixes #4596) #5254
Conversation
@jmoralez I want to remind you that every PR needs to have a label in https://github.com/microsoft/LightGBM/blob/59f59c9fbe060eae6114b21d6aa5f7b0ad80b4e7/.github/release-drafter.yml, so that it can be correctly categorized in the release notes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very thorough investigation with profiling!
I have one question...does this change mean that lightgbm
will now modify the input pandas
DataFrame in place? Like if I bring a pandas
DataFrame with integer column names to lightgbm
, and I use feature_name="auto"
, will those now be changed to strings?
If no, then I totally support this change.
If yes then I still support it (I think the memory savings is worth the potential challenges caused by silent side-effects like changing the type of column names), but think we should label this PR breaking
and maybe include a documentation change and/or warning in that situation. I think "changing your DataFrame's column names when lightgbm
previously used to leave it unmodified" is a breaking change that could affect users' code that works on that DataFrame post-training.
@@ -648,9 +648,7 @@ def test_no_copy_when_single_float_dtype_dataframe(dtype): | |||
pd = pytest.importorskip('pandas') | |||
X = np.random.rand(10, 2).astype(dtype) | |||
df = pd.DataFrame(X) | |||
# feature names are required to not make a copy (rename makes a copy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test currently tests "when you pass a pandas
DataFrame and explicit list of feature names, a copy is not made".
This PR is proposing changing it to "when you pass a pandas
DataFrame and feature_names="auto"
, a copy is not made".
To test the broader behavior "when you pass a pandas
DataFrame, a copy is not made, regardless of how you specify feature names", would you consider instead changing the tests touched in this PR such that both cases (feature_name
and "auto"
) are tested?
No, the argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look great, and thanks for the answer to my question about whether the rename will have side effects outside of LightGBM.
Looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the nice improvement and test enhancements!
This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
Adds on #5225. Fixes #4596.