PERF: fix long string representation #36638

ivanovmg · 2020-09-25T15:55:36Z

closes PERF: large perf regression in DataFrame repr #36636
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry
Fix long string representation for large dataframes.
Eliminate for loop, which was filtering out the proper rows/columns to be displayed.
Revert to the original implementation with concat-ing head+tail and left+right parts.

WillAyd · 2020-09-25T15:58:45Z

Seems more reasonable - how does the performance look?

ivanovmg · 2020-09-25T16:25:38Z

Before refactoring:

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.random.randn(1_000_000, 10))
In [4]: %timeit repr(df)
19.8 ms ± 850 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

After refactoring:

In [4]: %timeit repr(df)
2.36 s ± 47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

After this fix:

In [4]: %timeit repr(df)
103 ms ± 707 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

jorisvandenbossche

Thanks! Changes look good.

Would be nice to maybe further profile before refactor / after this PR to see where the additional slowdown is coming from (but doesn't necessarily need to be here)

ivanovmg · 2020-09-25T19:28:36Z

I guess, I see some area for improvement. In _truncate() we create a copy of dataframe to tr_frame. However, when making terminal output, this function is called once again, so the copy is created one more time.
If we move self.tr_frame = self.frame.copy() to init, then execution time will become close to 50 ms (twice as small as what this PR suggests right now).
However, if we use the original approach, when there was no copy created whatsoever (self.tr_frame = self.frame), then the same becomes 12 ms.

When refactoring I was concerned with the statement self.tr_frame = self.frame as the upcoming changes in tr_frame would have effect on the original dataframe. Therefore I decided to deal with the copy. If it is not an issue, then I will get rid of copy at all.

jorisvandenbossche · 2020-09-25T19:32:31Z

When refactoring I was concerned with the statement self.tr_frame = self.frame as the upcoming changes in tr_frame would have effect on the original dataframe. Therefore I decided to deal with the copy. If it is not an issue, then I will get rid of copy at all.

With "upcoming changes", you mean code that is now not yet in formats.py, but you are wanting to do in future PRs? (and if so, can you give some examples of what you have in mind?) Or is there now already some code that mutates tr_frame?

ivanovmg · 2020-09-25T20:31:49Z

By the "upcoming changes" I meant changes in tr_frame in the code, inside DataFrameFormatter.
Particularly, self.tr_frame = self.tr_frame.iloc[whatever] would change self.tr_frame. However if self.tr_frame and self.frame point to the same object, that would change self.frame as well, which is not what we expect when displaying the object. Or probably I do not fully understand the slicing via iloc.

Regarding the future RPs on the topic. Right now I am working on restructuring formatters, in an attempt to have them more aligned with each other. PR in progress #36510.

jorisvandenbossche · 2020-09-25T20:41:50Z

Thanks for the explanation!

self.tr_frame = self.tr_frame.iloc[whatever] would change self.tr_frame. However if self.tr_frame and self.frame point to the same object, that would change self.frame as well

Actually, with python assignment/reference semantics, the self.tr_frame.iloc[whatever] creates a new object and then assigning it to self.tr_frame = .. makes that the self.tr_frame attribute now refers to this new object, but it doesn't change the original object that self.tr_frame was pointing to. So basically the assignment only lets the self.tr_frame variable point to the new object without changing any object inplace.
So for this, we should normally not need to worry about mutation, and a copy should not be needed.

jreback · 2020-09-25T21:41:27Z

yeah i would remove the .copy as it is not necessary (you could also add a test to assert that we don't mutate the inpute), but doesn't need to be in this PR

jorisvandenbossche

@ivanovmg thanks for the quick follow-up!

ivanovmg added 3 commits September 25, 2020 21:49

PERF: remove long running loop for rows filtering

2ece866

CLN: extract variables head and tail

6ec02c5

PERF: remove potentially long running loop on cols

8375a74

ivanovmg requested a review from jorisvandenbossche September 25, 2020 15:57

WillAyd added the Performance Memory or execution speed performance label Sep 25, 2020

jorisvandenbossche reviewed Sep 25, 2020

View reviewed changes

jreback added this to the 1.2 milestone Sep 25, 2020

jreback added the Output-Formatting __repr__ of pandas objects, to_string label Sep 25, 2020

PERF: eliminate copying dataframe

14ad17c

jorisvandenbossche approved these changes Sep 26, 2020

View reviewed changes

jorisvandenbossche merged commit d04b343 into pandas-dev:master Sep 26, 2020

ivanovmg deleted the bug_36636 branch October 4, 2020 13:21

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

PERF: fix long string representation (pandas-dev#36638)

c20aa23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: fix long string representation #36638

PERF: fix long string representation #36638

ivanovmg commented Sep 25, 2020 •

edited

Loading

WillAyd commented Sep 25, 2020

ivanovmg commented Sep 25, 2020

jorisvandenbossche left a comment

ivanovmg commented Sep 25, 2020

jorisvandenbossche commented Sep 25, 2020

ivanovmg commented Sep 25, 2020

jorisvandenbossche commented Sep 25, 2020

jreback commented Sep 25, 2020

jorisvandenbossche left a comment

PERF: fix long string representation #36638

PERF: fix long string representation #36638

Conversation

ivanovmg commented Sep 25, 2020 • edited Loading

WillAyd commented Sep 25, 2020

ivanovmg commented Sep 25, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

ivanovmg commented Sep 25, 2020

jorisvandenbossche commented Sep 25, 2020

ivanovmg commented Sep 25, 2020

jorisvandenbossche commented Sep 25, 2020

jreback commented Sep 25, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

ivanovmg commented Sep 25, 2020 •

edited

Loading