Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci][tune][train] Update release test compute configs to not schedule work on head node #48103

Merged

Conversation

justinvyu
Copy link
Contributor

@justinvyu justinvyu commented Oct 18, 2024

Why are these changes needed?

This PR updates the compute configs for benchmark release tests to not schedule workers onto the head node. This reflects the best practice not to schedule heavy work on the head node for cluster stability.

Test results for the flaky air_benchmark_tune_torch_mnist.aws:

train_times = [278.915740425, 269.3746311839999, 271.06450245400015]
train_mean = 273.1182913543334
tune_times = [291.6967464280001, 292.411954035, 290.2445501280001]
tune_mean = 291.4510835303334

291 / 273 = 1.07 << 1.35 (test pass threshold, previously the test was giving around ~1.33 due to the single trial on the head node being a straggler)

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@justinvyu
Copy link
Contributor Author

@justinvyu justinvyu enabled auto-merge (squash) October 28, 2024 23:23
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 28, 2024
@justinvyu justinvyu merged commit d0c6c60 into ray-project:master Oct 29, 2024
7 checks passed
@justinvyu justinvyu deleted the deflake_tune_train_release_test branch October 29, 2024 04:41
can-anyscale added a commit that referenced this pull request Oct 29, 2024
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
… work on head node (ray-project#48103)

This PR updates the compute configs for benchmark release tests to not
schedule workers onto the head node. This reflects the best practice not
to schedule heavy work on the head node for cluster stability.

---------

Signed-off-by: Justin Yu <[email protected]>
Jay-ju pushed a commit to Jay-ju/ray that referenced this pull request Nov 5, 2024
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
… work on head node (ray-project#48103)

This PR updates the compute configs for benchmark release tests to not
schedule workers onto the head node. This reflects the best practice not
to schedule heavy work on the head node for cluster stability.

---------

Signed-off-by: Justin Yu <[email protected]>
JP-sDEV pushed a commit to JP-sDEV/ray that referenced this pull request Nov 14, 2024
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
… work on head node (ray-project#48103)

This PR updates the compute configs for benchmark release tests to not
schedule workers onto the head node. This reflects the best practice not
to schedule heavy work on the head node for cluster stability.

---------

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: mohitjain2504 <[email protected]>
mohitjain2504 pushed a commit to mohitjain2504/ray that referenced this pull request Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants