How to handle tests that "should work" but don't? #252

gjoseph92 · 2022-08-17T18:28:59Z

In #243 (comment), I'm adding a workload that should work—and in fact does work on a fresh cluster—but always fails in CI right now due to some combination of

Are reference cycles a performance problem? dask/distributed#4987 (comment)
Workers run twice as many root tasks as they should, causing memory pressure dask/distributed#5223

In general, how to we want to handle these sorts of tests?

1. Don't run them (skip / don't add at all)

This is simple since then we don't have to change anything right now.

If they're skipped, when we think the issue might be fixed, we could hopefully remember to go un-skip them, see if that passes, and if so, merge the un-skip.

However, that manual process (remember to un-skip this) won't scale well to many tests, is easy to forget, and also doesn't exactly confirm that the thing that you think fixed the problem actually did, since (without even more manual work) you don't have a record of the test failing before the suspected fix was merged.

2. Allow some tests to fail (xfail?), and have a way of displaying that on the dashboard

This would be a little more work, since we'd need to figure out a way to represent failure on the dashboard. But that would also tell a nice story when a lot of Xs turned into a normal line.

It's also nice to know if something unintentionally fixes the test (though this is probably pretty rare). Plus, I think it's generally good to be upfront about things that don't work.

A downside here is increased runtime (and cost) for the tests, running something all the time that we know is going to fail (and the failure mode is probably that it times out after 10min of runtime or something).

Another thing @fjetter mentioned is that we don't want to have a bunch of xfailed tests hanging around forever. If a test is going to be xfailed, it must have an issue opened for it, and the expectation that we'd actually work on that issue in the near term. Otherwise, if something doesn't work and we don't expect it will for a long time, it should probably be skipped or not added at all.

cc @hayesgb @jrbourbeau @ian-r-rose @ncclementi

ian-r-rose · 2022-08-17T20:36:22Z

I said a version of this over in #243, but re-stating here:

I very much like the idea of having failing tests also show up in whatever charts we are showing, so that we can note when they are fixed. We are already tracking whether a test succeeded or failed, so we could do something like change the shape or color of a marker for failures.

However, I'm not sure exactly how useful the specific measures would be. If we are looking at duration, then we are likely to just see the timeout time over and over until it is fixed. If we are looking at a memory metric, then there might be more signal, but I'd still be reticent to interpret specific values: if memory pressure is widely felt, we might just see peak memory close to the worker limit. If memory pressure is acutely felt on one or a few workers, we might not see much at all. In neither case would I really trust the output of a failed test.

So, I think that it is okay to skip failing tests and open corresponding issues for them, as you have done in #253

gjoseph92 · 2022-08-17T20:53:34Z

I'm not sure exactly how useful the specific measures would be

Yeah, I agree that the measures wouldn't be useful, or potentially even recorded (if a test crashes a cluster, say, we'd never know its memory use).

I was thinking that we (somehow) wouldn't display the measures in the visualization at all, or maybe that they'd have dummy/default values. Just that they'd be displayed with a different mark type and color (like ❌ ).

Mostly I just like the idea of clearly seeing on the dashboard when a test started working.

And I guess more broadly, if a test that we thought should pass fails, I might also like that to show up as an ❌ on the dashboard. Right now, I'd imagine that run is just dropped? I think the visual indicator is valuable.

ian-r-rose · 2022-08-17T20:54:56Z

And I guess more broadly, if a test that we thought should pass fails, I might also like that to show up as an x on the dashboard. Right now, I'd imagine that run is just dropped? I think the visual indicator is valuable.

Right now we still show it, including the probably-nonsense-measure, but it's been on my backlog to fix that.

fjetter · 2022-08-18T08:03:47Z

A bit more context to what Gabe already cited me on

In my experience most tests that are not even runnable will not automagically be fixed. most of them require a couple of steps that have us analyze the workload, narrow the problems down to what is actually wrong and then work on a fix. Most of our benchmarks are still that crude that we will likely need rather big efforts to fix problems.
I don't see the value in having them part of CI since that eats up cost and CI runtime, i.e. our jobs are always 5 min slower

I could see us trying to run these things less frequently, e.g. "known failure jobs" but I think it's more important to have the code available somewhere such that we can engage on the analyze+fix cycle.

gjoseph92 added the discussion Discussion about an idea or problem. Nothing actionable so far. label Aug 17, 2022

gjoseph92 mentioned this issue Aug 17, 2022

Benchmarks for some common workloads #243

Merged

gjoseph92 added the dx Developer experience label Aug 19, 2022

ncclementi mentioned this issue Aug 30, 2022

Revisit test_rechunk_out_of_memory - currently skipping #55

Open

gjoseph92 mentioned this issue Oct 15, 2022

OLS Workload #443

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle tests that "should work" but don't? #252

How to handle tests that "should work" but don't? #252

gjoseph92 commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

gjoseph92 commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

fjetter commented Aug 18, 2022

How to handle tests that "should work" but don't? #252

How to handle tests that "should work" but don't? #252

Comments

gjoseph92 commented Aug 17, 2022

1. Don't run them (skip / don't add at all)

2. Allow some tests to fail (xfail?), and have a way of displaying that on the dashboard

ian-r-rose commented Aug 17, 2022

gjoseph92 commented Aug 17, 2022

ian-r-rose commented Aug 17, 2022

fjetter commented Aug 18, 2022