Skip to content

Commit

Permalink
Avoid to set nodes into DOWN if no nodes are passed as input
Browse files Browse the repository at this point in the history
Avoid to set nodes into DOWN, hence avoid calling Slurm scontrol update, if node list is empty
Avoided log line is
```
2023-09-19 10:56:39,439 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x0) [] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes
```

Signed-off-by: Luca Carrogu <[email protected]>
  • Loading branch information
lukeseawalker committed Sep 19, 2023
1 parent 0d033ef commit 125c7c0
Showing 1 changed file with 13 additions and 12 deletions.
25 changes: 13 additions & 12 deletions src/slurm_plugin/resume.py
Original file line number Diff line number Diff line change
Expand Up @@ -157,18 +157,19 @@ def _handle_failed_nodes(node_list, reason="Failure when resuming nodes"):
To save time, should explicitly set nodes to DOWN in ResumeProgram so clustermgtd can maintain failed nodes.
Clustermgtd will be responsible for running full DOWN -> POWER_DOWN process.
"""
try:
log.info(
"Setting following failed nodes into DOWN state %s with reason: %s", print_with_count(node_list), reason
)
set_nodes_down(node_list, reason=reason)
except Exception as e:
log.error(
"Failed to place nodes %s into DOWN for reason %s with exception: %s",
print_with_count(node_list),
reason,
e,
)
if node_list:
try:
log.info(
"Setting following failed nodes into DOWN state %s with reason: %s", print_with_count(node_list), reason
)
set_nodes_down(node_list, reason=reason)
except Exception as e:
log.error(
"Failed to place nodes %s into DOWN for reason %s with exception: %s",
print_with_count(node_list),
reason,
e,
)


def _resume(arg_nodes, resume_config, slurm_resume):
Expand Down

0 comments on commit 125c7c0

Please sign in to comment.