-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test htex_auto_scale partial scaling-in #3097
Conversation
The existing scaling-in test parsl/tests/test_scaling/test_scale_down.py only tests full scaling-in which is implemented in a separate code path to partial scaling-in (case 4b vs case 1a in parsl/jobs/strategy.py)
5864489
to
142a155
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good test, but I'd like to see if we can speed it up. As it stands, it takes 27s locally, which is a big proportion of time when the hundreds of other tests and configurations take ~10m as a whole.
while ready_path.read_text().count("\n") < _max_blocks: | ||
time.sleep(0.5) | ||
|
||
assert len(dfk.executors['htex_local'].connected_managers()) == _max_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider (for new code, anyway) use of try_assert
:
def test_scale_out(try_assert, ...):
...
try_assert(lambda: ready_path.read_text().count("\n") < _max_blocks)
...
It doesn't save much SLOC-wise, but it might prevent a hung test at some point (e.g., file not getting written [for reason]) and removes the mental context of the loop.
Alternatively, I think what matters for this test is that the .connected_managers()
rises to _max_blocks
? Perhaps just test strictly that:
def test_scale_out(try_assert, ...):
dfk = parsl.dfk()
htex = dfk.executors['htex_local']
...
try_assert(lambda: htex.connected_managers() == _max_blocks, "Verify test setup")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched this to try_assert.
On the immediately following connected managers assert, I added a note to future readers who are trying to understand why the assert fails. What's really needed here is to check how many blocks are successfully running at least to the registration stage, but that info isn't so readily available - running one worker per manager and one manager per block gives two different proxies for that.
timeout_ms=15000, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've mentioned this in Slack, but I think we should do something about these timeouts. That is, clearly they're needed for the test to pass, but I'm wondering if we can engineer the test setup so that (a) we don't need to up the value to 15s and, more importantly, (b) the test spends ~no time unnecessarily waiting. That is, we're clearly waiting for (an amalgamation) of some loops internally ... can we short-circuit those loops somehow and still ensure this test is valid and useful?
To put some meat to my displeasure with this, when running this test locally, it took 27s.
(In my experience, the usual avenues for this kind of request are mocking, subclassing, and judicious test setups.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
goes down to about 12 seconds if I set the strategy to run 1s, 0.5s, or 0.1s using direct-poking-into-the-strategy-timer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'm not poking at the strategy code timing deeper than this: this PR is part of a sponsored project to bugfix a single bug in scaling in, not refactor it for testability)
…ch imports it itself
…r scaling to scale up
…htex-partial' into benc-ghent-scalein-htex-partial
PR #3097 introduced a more comprehensive test for the htex_auto_scale strategy, based on this test, and prior to this PR, test_scale_down only tested the 'simple' strategy parts of htex_auto_scale.
PR #3097 introduced a more comprehensive test for the htex_auto_scale strategy, based on this test, and prior to this PR, test_scale_down only tested the 'simple' strategy parts of htex_auto_scale.
This is initially driven by a desire to run strategy polling faster in tests: there's no fundamental reason why the previous hard-coded value of 5 seconds needs to set the timescale for test execution. This was demonstrated previously in parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py in PR #3097 performing modification on the internals of a live DFK-private JobStatusPoller. Work I've done on tests elsewhere benefits from strategy polling period reconfiguration too, so this PR makes that facility a publicly exposed feature. This change allows the interval to be set before the job status poller starts running, which means a racy initial first 5s poll in the above mentioned test_scale_down_htex_auto_scale.py is avoided: median runtime of that test on my laptop goes from 11s before this PR to 6s after this PR (dropping by exactly the 5s initial poll that is now avoided). Its reasonable to expect some users to want to use this facility too: perhaps a user doesn't want to wait 5 seconds before the scaling code notices their workload; or perhaps they are more interested in running the strategy code much less frequently (for example, if running workloads on the scale of hours/days to reduce eg debug log load)
This is initially driven by a desire to run strategy polling faster in tests: there's no fundamental reason why the previous hard-coded value of 5 seconds needs to set the timescale for test execution. This was demonstrated previously in parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py in PR #3097 performing modification on the internals of a live DFK-private JobStatusPoller. Work I've done on tests elsewhere benefits from strategy polling period reconfiguration too, so this PR makes that facility a publicly exposed feature. This change allows the interval to be set before the job status poller starts running, which means a racy initial first 5s poll in the above mentioned test_scale_down_htex_auto_scale.py is avoided: median runtime of that test on my laptop goes from 11s before this PR to 6s after this PR (dropping by exactly the 5s initial poll that is now avoided). Its reasonable to expect some users to want to use this facility too: perhaps a user doesn't want to wait 5 seconds before the scaling code notices their workload; or perhaps they are more interested in running the strategy code much less frequently (for example, if running workloads on the scale of hours/days to reduce eg debug log load)
The existing scaling-in test parsl/tests/test_scaling/test_scale_down.py only tests full scaling-in which is implemented in a separate code path to partial scaling-in (case 4b vs case 1a in parsl/jobs/strategy.py)
Type of change