Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Test suite #51520

Closed
jbrockmendel opened this issue Feb 21, 2023 · 7 comments
Closed

PERF: Test suite #51520

jbrockmendel opened this issue Feb 21, 2023 · 7 comments
Labels
Performance Memory or execution speed performance Testing pandas testing functions or related to the test suite

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Feb 21, 2023

The runtime of the test suite has roughly doubled in the past year. A lot of this reflects new features (pyarrow dtypes, non-nanosecond dt64/td64) or just improved test coverage, so this isn't a bad thing. But it is a pain point for my workflow and I expect not a coincidence that we see so many CI timeouts (xref #51409).

Some observations:

  1. Just collecting the test suite has a footprint of 1-2GB. And IIUC test collection doesn't parallelize, so doing pytest -n=many doesn't go very far unless you've got loads of memory.
  2. xfails are really expensive since pytest renders the tracebacks even when it doesn't print them.
  3. Using --durations=20000 shows many tests as over .01 seconds, but then re-running subsets of those tests in isolation they are usually faster.
  4. assert_frame_equal is slow. profiling the pyarrow test_unstack tests takes 27s, of which 6.6 is in unstack and 7.7 is in assert_frame_equal (for very small frames AFAICT)
  5. test call overhead is hard to measure. when doing the test_unstack profiling the total runtime is 27s, but only 20.4 show up in the profile output. of that, 19.4 is inside test_unstack. So the overhead is either 5% or 18% depending on which denominator we use.
  6. Many tests have unreasonably large parametrizations. test_dt64arr_add_sub_DateOffsets doesn't need 1056 variants. (many of the worst offenders use tz_naive_fixture, so my bad!)
    • I don't see a nice way to look for tests with many parametrizations.
  7. test_multiindex_get_loc is spending a lot of time in maybe_promote, which seems weird because that should be cached.

* Some reported timings are with --profile enabled, which can throw off the big-picture numbers. Take grains of salt where appropriate.

cc @mroeschke @phofl

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 21, 2023
@phofl phofl added Testing pandas testing functions or related to the test suite Performance Memory or execution speed performance labels Feb 21, 2023
@lithomas1 lithomas1 removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 21, 2023
@harsimran44
Copy link

HEY, i am new. can u guide me for doing this please. i want to do this

@jbrockmendel
Copy link
Member Author

Welcome @harsimran44! We suggest that newcomers look for issues with the "good first issue" label.

@rhshadrach
Copy link
Member

Many tests have unreasonably large parametrizations

This is an issue in groupby as well, especially for some tests I've added or directed others to add. The issue is that there are so many combinations:

  • ops (e.g. sum)
  • how that op is used (.sum() vs agg, apply, or transform)
  • SGB vs DFGB
  • groupby args (as_index, observed, dropna)
  • input column dtype
  • input index dtype (single vs multi)
  • whether the grouping is done on indices, columns, or not in the grouping object

A good example demonstrating this is tests/groupby/test_raises.py that just makes sure we are raising correctly on bad dtypes. This has over 10k tests. Perhaps some of it can be whittled down, but I think the bulk of it cannot without losing coverage.

@rhshadrach
Copy link
Member

rhshadrach commented Mar 18, 2023

Another example is tests/groupby/test_groupby_dropna.py::test_no_sort_keep_na. This has 5832 parametrizations. The original test in #46584 only did one sequence of values, and induced a pretty bad regression since the op failed if the NA value occurred in a different place. I added a parametrization that tested all combinations of NA/non-NA values of length 4 (81 combinations).

One thing I've been wondering is if in this case (and others like it) that we should replace "all possible sequences" with a randomly chosen sequence or small (5?) set of sequences. This would be better for performance, but would also possibly introduce a flaky test.

With -n0, on my machine the test takes 10s.

@mroeschke
Copy link
Member

I don't think having large parameterized test in itself is a bad thing, but I suspect those test possibly render other (older) tests redundant.

One thing I've been wondering is if in this case (and others like it) that we should replace "all possible sequences" with a randomly chosen sequence or small (5?) set of sequences.

IMO I dislike randomly generated data in tests for the flakiness you mention; it makes PR review a pain. I would rather be thorough or use hypothesis in these cases.

@jbrockmendel
Copy link
Member Author

I'm of mixed mind on randomly generated test data, but trending an an "anti" direction.

For tz_(naive|aware)_fixture in particular, we could in principle use all of pytz.all_timezones (596 entries locally) and presumably multiply that by 3 to include dateutil/zoneinfo counterparts. And arbitrarily many fixed offsets. Clearly we're not going to add all of those to the fixtures. A few years back I toyed with the idea of using a randomized subset. At this point I could be convinced, but don't see enough value here to pursue it myself.

One place where I've poked at generating randomized data is generating "any" Index (possibly for use in tests.indexes.test_any_index). I think there are a lot of corner cases we likely miss in the status quo (increasing, decreasing, non-monotonic, mixed-Xs-and-Ys, values near implementation bounds, int64 values near float64 rounding bounds, infs, ...). It's been a while since I looked at this. IIRC I tried implementing something with hypothesis and found a) enumerating all the cases I wanted was tough, b) the runtime would be really rough, c) there are relatively few tests where we want genuinely any index as opposed to "any datetime index", "any datetimeIndex with a freq", ... and enumerating all of those any_foos would be annoying.

@jbrockmendel
Copy link
Member Author

Closing as "no action"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

6 participants