Test htex_auto_scale partial scaling-in #3097

benclifford · 2024-02-22T10:48:24Z

The existing scaling-in test parsl/tests/test_scaling/test_scale_down.py only tests full scaling-in which is implemented in a separate code path to partial scaling-in (case 4b vs case 1a in parsl/jobs/strategy.py)

Type of change

Code maintenance/cleanup

The existing scaling-in test parsl/tests/test_scaling/test_scale_down.py only tests full scaling-in which is implemented in a separate code path to partial scaling-in (case 4b vs case 1a in parsl/jobs/strategy.py)

khk-globus

This is a good test, but I'd like to see if we can speed it up. As it stands, it takes 27s locally, which is a big proportion of time when the hundreds of other tests and configurations take ~10m as a whole.

khk-globus · 2024-02-22T14:34:23Z

parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py

+    while ready_path.read_text().count("\n") < _max_blocks:
+        time.sleep(0.5)
+
+    assert len(dfk.executors['htex_local'].connected_managers()) == _max_blocks


Consider (for new code, anyway) use of try_assert:

def test_scale_out(try_assert, ...): ... try_assert(lambda: ready_path.read_text().count("\n") < _max_blocks) ...

It doesn't save much SLOC-wise, but it might prevent a hung test at some point (e.g., file not getting written [for reason]) and removes the mental context of the loop.

Alternatively, I think what matters for this test is that the .connected_managers() rises to _max_blocks? Perhaps just test strictly that:

def test_scale_out(try_assert, ...): dfk = parsl.dfk() htex = dfk.executors['htex_local'] ... try_assert(lambda: htex.connected_managers() == _max_blocks, "Verify test setup")

I switched this to try_assert.

On the immediately following connected managers assert, I added a note to future readers who are trying to understand why the assert fails. What's really needed here is to check how many blocks are successfully running at least to the registration stage, but that info isn't so readily available - running one worker per manager and one manager per block gives two different proxies for that.

parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py

khk-globus · 2024-02-22T15:00:15Z

parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py

+        timeout_ms=15000,
+    )


I've mentioned this in Slack, but I think we should do something about these timeouts. That is, clearly they're needed for the test to pass, but I'm wondering if we can engineer the test setup so that (a) we don't need to up the value to 15s and, more importantly, (b) the test spends ~no time unnecessarily waiting. That is, we're clearly waiting for (an amalgamation) of some loops internally ... can we short-circuit those loops somehow and still ensure this test is valid and useful?

To put some meat to my displeasure with this, when running this test locally, it took 27s.

(In my experience, the usual avenues for this kind of request are mocking, subclassing, and judicious test setups.)

goes down to about 12 seconds if I set the strategy to run 1s, 0.5s, or 0.1s using direct-poking-into-the-strategy-timer

(I'm not poking at the strategy code timing deeper than this: this PR is part of a sponsored project to bugfix a single bug in scaling in, not refactor it for testability)

…htex-partial

…ch imports it itself

…htex-partial

…r scaling to scale up

…htex-partial' into benc-ghent-scalein-htex-partial

PR #3097 introduced a more comprehensive test for the htex_auto_scale strategy, based on this test, and prior to this PR, test_scale_down only tested the 'simple' strategy parts of htex_auto_scale.

This is initially driven by a desire to run strategy polling faster in tests: there's no fundamental reason why the previous hard-coded value of 5 seconds needs to set the timescale for test execution. This was demonstrated previously in parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py in PR #3097 performing modification on the internals of a live DFK-private JobStatusPoller. Work I've done on tests elsewhere benefits from strategy polling period reconfiguration too, so this PR makes that facility a publicly exposed feature. This change allows the interval to be set before the job status poller starts running, which means a racy initial first 5s poll in the above mentioned test_scale_down_htex_auto_scale.py is avoided: median runtime of that test on my laptop goes from 11s before this PR to 6s after this PR (dropping by exactly the 5s initial poll that is now avoided). Its reasonable to expect some users to want to use this facility too: perhaps a user doesn't want to wait 5 seconds before the scaling code notices their workload; or perhaps they are more interested in running the strategy code much less frequently (for example, if running workloads on the scale of hours/days to reduce eg debug log load)

Test htex_auto_scale partial scaling-in

142a155

The existing scaling-in test parsl/tests/test_scaling/test_scale_down.py only tests full scaling-in which is implemented in a separate code path to partial scaling-in (case 4b vs case 1a in parsl/jobs/strategy.py)

benclifford force-pushed the benc-ghent-scalein-htex-partial branch from 5864489 to 142a155 Compare February 22, 2024 10:51

khk-globus requested changes Feb 22, 2024

View reviewed changes

benclifford added 7 commits February 28, 2024 09:14

Merge remote-tracking branch 'origin/master' into benc-ghent-scalein-…

69947bf

…htex-partial

Address comments from Kevin

b045602

Fix flake8: time is now no longer used except inside a python_app whi…

3dde78a

…ch imports it itself

Merge branch 'master' into benc-ghent-scalein-htex-partial

9eeff6a

Merge remote-tracking branch 'origin/master' into benc-ghent-scalein-…

551e51a

…htex-partial

Remove all timeout_ms from test except one, where it takes a while fo…

405428a

…r scaling to scale up

Merge remote-tracking branch 'refs/remotes/origin/benc-ghent-scalein-…

e4be64d

…htex-partial' into benc-ghent-scalein-htex-partial

benclifford requested a review from khk-globus February 29, 2024 16:55

khk-globus approved these changes Feb 29, 2024

View reviewed changes

benclifford merged commit c1e0d2c into master Mar 1, 2024
6 checks passed

benclifford deleted the benc-ghent-scalein-htex-partial branch March 1, 2024 08:52

This was referenced Mar 1, 2024

Repurpose test_scaling/test_scale_down.py for simple strategy #3114

Merged

Remove unused codepath from HighThroughputExecutor scale_in #3115

Merged

benclifford mentioned this pull request Mar 13, 2024

Allow strategy polling period to be configured #3246

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test htex_auto_scale partial scaling-in #3097

Test htex_auto_scale partial scaling-in #3097

benclifford commented Feb 22, 2024

khk-globus left a comment

khk-globus Feb 22, 2024

benclifford Feb 28, 2024

khk-globus Feb 22, 2024

benclifford Feb 28, 2024

benclifford Feb 28, 2024

Test htex_auto_scale partial scaling-in #3097

Test htex_auto_scale partial scaling-in #3097

Conversation

benclifford commented Feb 22, 2024

Type of change

khk-globus left a comment

Choose a reason for hiding this comment

khk-globus Feb 22, 2024

Choose a reason for hiding this comment

benclifford Feb 28, 2024

Choose a reason for hiding this comment

khk-globus Feb 22, 2024

Choose a reason for hiding this comment

benclifford Feb 28, 2024

Choose a reason for hiding this comment

benclifford Feb 28, 2024

Choose a reason for hiding this comment