Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-3.8] Reorder instance to Slurm node assignation steps #595 #597

Merged
merged 5 commits into from
Nov 8, 2023

Conversation

lukeseawalker
Copy link
Contributor

Description of changes

Reorder instance to Slurm node assignation steps

  • from scontrol update node / write on dynamoDB / write on Route53
  • to write on dynamoDB / write on Route53 /scontrol update node

so that the assignation of the instance to Slurm node is done as last step, when all the data needed is set

Tests

  • unit tests added
  • manually tested on running cluster

References

  • Link to impacted open issues.
  • Link to related PRs in other packages (i.e. cookbook, node).
  • Link to documentation useful to understand the changes.

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Signed-off-by: Luca Carrogu <[email protected]>
(cherry picked from commit 25cce71)
Add error code and message on ClientError when launching instances, for RunInstances and CreateFleet API calls

Signed-off-by: Luca Carrogu <[email protected]>
(cherry picked from commit 418fa7b)
Add exp backoff in launch ec2 instances call on throttling.
This is specially useful during all-or-nothing scaling, during all-in optimization call, to avoid quiting the all-in call and enter the job loop.
The longer retry time requires to increases the orphaned_instance_timeout by 1 min, from 120 to 180 secs

Signed-off-by: Luca Carrogu <[email protected]>
(cherry picked from commit 322b376)
Reorder instance to Slurm node assignation steps
* from scontrol update node / write on dynamoDB / write on Route53
* to write on dynamoDB / write on Route53 /scontrol update node

so that the assignation of the instance to Slurm node is done as last step, when all the data needed is set

Signed-off-by: Luca Carrogu <[email protected]>
@lukeseawalker lukeseawalker enabled auto-merge (rebase) November 8, 2023 15:38
@lukeseawalker lukeseawalker merged commit fae5fe7 into aws:release-3.8 Nov 8, 2023
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants