Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: fix long string representation #36638

Merged
merged 4 commits into from
Sep 26, 2020

Conversation

ivanovmg
Copy link
Member

@ivanovmg ivanovmg commented Sep 25, 2020

  • closes PERF: large perf regression in DataFrame repr #36636

  • tests added / passed

  • passes black pandas

  • passes git diff upstream/master -u -- "*.py" | flake8 --diff

  • whatsnew entry

  • Fix long string representation for large dataframes.

  • Eliminate for loop, which was filtering out the proper rows/columns to be displayed.

  • Revert to the original implementation with concat-ing head+tail and left+right parts.

@WillAyd
Copy link
Member

WillAyd commented Sep 25, 2020

Seems more reasonable - how does the performance look?

@WillAyd WillAyd added the Performance Memory or execution speed performance label Sep 25, 2020
@ivanovmg
Copy link
Member Author

Before refactoring:

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.randn(1_000_000, 10))
In [4]: %timeit repr(df)
19.8 ms ± 850 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

After refactoring:

In [4]: %timeit repr(df)
2.36 s ± 47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After this fix:

In [4]: %timeit repr(df)
103 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Changes look good.

Would be nice to maybe further profile before refactor / after this PR to see where the additional slowdown is coming from (but doesn't necessarily need to be here)

@ivanovmg
Copy link
Member Author

I guess, I see some area for improvement. In _truncate() we create a copy of dataframe to tr_frame. However, when making terminal output, this function is called once again, so the copy is created one more time.
If we move self.tr_frame = self.frame.copy() to init, then execution time will become close to 50 ms (twice as small as what this PR suggests right now).
However, if we use the original approach, when there was no copy created whatsoever (self.tr_frame = self.frame), then the same becomes 12 ms.

When refactoring I was concerned with the statement self.tr_frame = self.frame as the upcoming changes in tr_frame would have effect on the original dataframe. Therefore I decided to deal with the copy. If it is not an issue, then I will get rid of copy at all.

@jorisvandenbossche
Copy link
Member

When refactoring I was concerned with the statement self.tr_frame = self.frame as the upcoming changes in tr_frame would have effect on the original dataframe. Therefore I decided to deal with the copy. If it is not an issue, then I will get rid of copy at all.

With "upcoming changes", you mean code that is now not yet in formats.py, but you are wanting to do in future PRs? (and if so, can you give some examples of what you have in mind?) Or is there now already some code that mutates tr_frame?

@ivanovmg
Copy link
Member Author

By the "upcoming changes" I meant changes in tr_frame in the code, inside DataFrameFormatter.
Particularly, self.tr_frame = self.tr_frame.iloc[whatever] would change self.tr_frame. However if self.tr_frame and self.frame point to the same object, that would change self.frame as well, which is not what we expect when displaying the object. Or probably I do not fully understand the slicing via iloc.

Regarding the future RPs on the topic. Right now I am working on restructuring formatters, in an attempt to have them more aligned with each other. PR in progress #36510.

@jorisvandenbossche
Copy link
Member

Thanks for the explanation!

self.tr_frame = self.tr_frame.iloc[whatever] would change self.tr_frame. However if self.tr_frame and self.frame point to the same object, that would change self.frame as well

Actually, with python assignment/reference semantics, the self.tr_frame.iloc[whatever] creates a new object and then assigning it to self.tr_frame = .. makes that the self.tr_frame attribute now refers to this new object, but it doesn't change the original object that self.tr_frame was pointing to. So basically the assignment only lets the self.tr_frame variable point to the new object without changing any object inplace.
So for this, we should normally not need to worry about mutation, and a copy should not be needed.

@jreback
Copy link
Contributor

jreback commented Sep 25, 2020

yeah i would remove the .copy as it is not necessary (you could also add a test to assert that we don't mutate the inpute), but doesn't need to be in this PR

@jreback jreback added this to the 1.2 milestone Sep 25, 2020
@jreback jreback added the Output-Formatting __repr__ of pandas objects, to_string label Sep 25, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanovmg thanks for the quick follow-up!

@jorisvandenbossche jorisvandenbossche merged commit d04b343 into pandas-dev:master Sep 26, 2020
@ivanovmg ivanovmg deleted the bug_36636 branch October 4, 2020 13:21
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: large perf regression in DataFrame repr
4 participants