Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test htex_auto_scale partial scaling-in #3097

Merged
merged 8 commits into from
Mar 1, 2024

Conversation

benclifford
Copy link
Collaborator

The existing scaling-in test parsl/tests/test_scaling/test_scale_down.py only tests full scaling-in which is implemented in a separate code path to partial scaling-in (case 4b vs case 1a in parsl/jobs/strategy.py)

Type of change

  • Code maintenance/cleanup

The existing scaling-in test parsl/tests/test_scaling/test_scale_down.py
only tests full scaling-in which is implemented in a separate code path to
partial scaling-in (case 4b vs case 1a in parsl/jobs/strategy.py)
@benclifford benclifford force-pushed the benc-ghent-scalein-htex-partial branch from 5864489 to 142a155 Compare February 22, 2024 10:51
Copy link
Collaborator

@khk-globus khk-globus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good test, but I'd like to see if we can speed it up. As it stands, it takes 27s locally, which is a big proportion of time when the hundreds of other tests and configurations take ~10m as a whole.

Comment on lines 81 to 84
while ready_path.read_text().count("\n") < _max_blocks:
time.sleep(0.5)

assert len(dfk.executors['htex_local'].connected_managers()) == _max_blocks
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider (for new code, anyway) use of try_assert:

def test_scale_out(try_assert, ...):
    ...
    try_assert(lambda: ready_path.read_text().count("\n") < _max_blocks)
    ...

It doesn't save much SLOC-wise, but it might prevent a hung test at some point (e.g., file not getting written [for reason]) and removes the mental context of the loop.

Alternatively, I think what matters for this test is that the .connected_managers() rises to _max_blocks? Perhaps just test strictly that:

def test_scale_out(try_assert, ...):
    dfk = parsl.dfk()
    htex = dfk.executors['htex_local']
    ...
    try_assert(lambda: htex.connected_managers() == _max_blocks, "Verify test setup")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched this to try_assert.

On the immediately following connected managers assert, I added a note to future readers who are trying to understand why the assert fails. What's really needed here is to check how many blocks are successfully running at least to the registration stage, but that info isn't so readily available - running one worker per manager and one manager per block gives two different proxies for that.

Comment on lines 103 to 104
timeout_ms=15000,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've mentioned this in Slack, but I think we should do something about these timeouts. That is, clearly they're needed for the test to pass, but I'm wondering if we can engineer the test setup so that (a) we don't need to up the value to 15s and, more importantly, (b) the test spends ~no time unnecessarily waiting. That is, we're clearly waiting for (an amalgamation) of some loops internally ... can we short-circuit those loops somehow and still ensure this test is valid and useful?

To put some meat to my displeasure with this, when running this test locally, it took 27s.

(In my experience, the usual avenues for this kind of request are mocking, subclassing, and judicious test setups.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goes down to about 12 seconds if I set the strategy to run 1s, 0.5s, or 0.1s using direct-poking-into-the-strategy-timer

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I'm not poking at the strategy code timing deeper than this: this PR is part of a sponsored project to bugfix a single bug in scaling in, not refactor it for testability)

@benclifford benclifford merged commit c1e0d2c into master Mar 1, 2024
6 checks passed
@benclifford benclifford deleted the benc-ghent-scalein-htex-partial branch March 1, 2024 08:52
benclifford added a commit that referenced this pull request Mar 1, 2024
PR #3097 introduced a more comprehensive test for the htex_auto_scale
strategy, based on this test, and prior to this PR, test_scale_down only
tested the 'simple' strategy parts of htex_auto_scale.
benclifford added a commit that referenced this pull request Mar 1, 2024
PR #3097 introduced a more comprehensive test for the htex_auto_scale
strategy, based on this test, and prior to this PR, test_scale_down only
tested the 'simple' strategy parts of htex_auto_scale.
benclifford added a commit that referenced this pull request Mar 13, 2024
This is initially driven by a desire to run strategy polling faster in
tests: there's no fundamental reason why the previous hard-coded value
of 5 seconds needs to set the timescale for test execution. This was
demonstrated previously in
parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py in PR #3097
performing modification on the internals of a live DFK-private
JobStatusPoller. Work I've done on tests elsewhere benefits from
strategy polling period reconfiguration too, so this PR makes that
facility a publicly exposed feature.

This change allows the interval to be set before the job status poller
starts running, which means a racy initial first 5s poll in the above
mentioned test_scale_down_htex_auto_scale.py is avoided: median runtime
of that test on my laptop goes from 11s before this PR to 6s after this
PR (dropping by exactly the 5s initial poll that is now avoided).

Its reasonable to expect some users to want to use this facility too:
perhaps a user doesn't want to wait 5 seconds before the scaling code
notices their workload; or perhaps they are more interested in running
the strategy code much less frequently (for example, if running
workloads on the scale of hours/days to reduce eg debug log load)
benclifford added a commit that referenced this pull request Mar 13, 2024
This is initially driven by a desire to run strategy polling faster in
tests: there's no fundamental reason why the previous hard-coded value
of 5 seconds needs to set the timescale for test execution. This was
demonstrated previously in
parsl/tests/test_scaling/test_scale_down_htex_auto_scale.py in PR #3097
performing modification on the internals of a live DFK-private
JobStatusPoller. Work I've done on tests elsewhere benefits from
strategy polling period reconfiguration too, so this PR makes that
facility a publicly exposed feature.

This change allows the interval to be set before the job status poller
starts running, which means a racy initial first 5s poll in the above
mentioned test_scale_down_htex_auto_scale.py is avoided: median runtime
of that test on my laptop goes from 11s before this PR to 6s after this
PR (dropping by exactly the 5s initial poll that is now avoided).

Its reasonable to expect some users to want to use this facility too:
perhaps a user doesn't want to wait 5 seconds before the scaling code
notices their workload; or perhaps they are more interested in running
the strategy code much less frequently (for example, if running
workloads on the scale of hours/days to reduce eg debug log load)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants