Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python-package] make a shallow copy on dataframe rename (fixes #4596) #5254

Merged
merged 2 commits into from
Jun 5, 2022

Conversation

jmoralez
Copy link
Collaborator

Adds on #5225. Fixes #4596.

@jameslamb
Copy link
Collaborator

@jmoralez I want to remind you that every PR needs to have a label in https://github.com/microsoft/LightGBM/blob/59f59c9fbe060eae6114b21d6aa5f7b0ad80b4e7/.github/release-drafter.yml, so that it can be correctly categorized in the release notes.

Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very thorough investigation with profiling!

I have one question...does this change mean that lightgbm will now modify the input pandas DataFrame in place? Like if I bring a pandas DataFrame with integer column names to lightgbm, and I use feature_name="auto", will those now be changed to strings?

If no, then I totally support this change.

If yes then I still support it (I think the memory savings is worth the potential challenges caused by silent side-effects like changing the type of column names), but think we should label this PR breaking and maybe include a documentation change and/or warning in that situation. I think "changing your DataFrame's column names when lightgbm previously used to leave it unmodified" is a breaking change that could affect users' code that works on that DataFrame post-training.

@@ -648,9 +648,7 @@ def test_no_copy_when_single_float_dtype_dataframe(dtype):
pd = pytest.importorskip('pandas')
X = np.random.rand(10, 2).astype(dtype)
df = pd.DataFrame(X)
# feature names are required to not make a copy (rename makes a copy)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test currently tests "when you pass a pandas DataFrame and explicit list of feature names, a copy is not made".

This PR is proposing changing it to "when you pass a pandas DataFrame and feature_names="auto", a copy is not made".

To test the broader behavior "when you pass a pandas DataFrame, a copy is not made, regardless of how you specify feature names", would you consider instead changing the tests touched in this PR such that both cases (feature_name and "auto") are tested?

@jmoralez
Copy link
Collaborator Author

jmoralez commented Jun 1, 2022

does this change mean that lightgbm will now modify the input pandas DataFrame in place?

No, the argument copy to rename is actually passed as the deep argument to DataFrame.copy, links here: #4596 (comment). So it will still make a copy but a shallow one, i.e. it won't copy the underlying arrays and the index, but things like renaming the columns will only modify the copy.

@jmoralez jmoralez changed the title [python-package] don't copy dataframe on rename (fixes #4596) [python-package] make a shallow copy on dataframe rename (fixes #4596) Jun 1, 2022
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look great, and thanks for the answer to my question about whether the rename will have side effects outside of LightGBM.

Looks good!

Copy link
Collaborator

@StrikerRUS StrikerRUS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the nice improvement and test enhancements!

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Avoid dataframe deep copy while creating dataset
3 participants