Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[develop] Job Level Scaling for Node Sharing #564

Merged
merged 6 commits into from
Sep 26, 2023

Conversation

lukeseawalker
Copy link
Contributor

@lukeseawalker lukeseawalker commented Sep 18, 2023

Description of changes

  • Reset the failure for nodes that were launched successful, for which it was possible to assign an instance.
    This to cover the node-sharing (oversubscribe) case when nodes that failed in a job call,
    are actually launched (and assigned to instances) in a next iteration of the job loop.

  • Add job-level scaling for the node sharing case.
    Before entering the job loop, perform the same optimizations done for the exclusive job case:

    • scale best-effort for all single node jobs
    • scale all for all multi node jobs
  • Avoid to set nodes into DOWN, hence avoid calling Slurm scontrol update, if node list is empty.
    Avoided log line is

    2023-09-19 10:56:39,439 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x0) [] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes
    

Tests

  • unit tests added
  • manual tests performed on running cluster given the following submission command:
sbatch --wrap "sleep 10000" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4; sbatch --wrap "sleep 10000" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4; sbatch --wrap "sleep 10000" -N 3 --constraint="[(c5.4xlarge)*3]" -p q4

where there is capacity for c5.4xlarge but not for p4d.24xlarge the two scaling strategies were tested:

all_or_nothing_batch = true
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*

resume log:

2023-09-19 10:56:32,549 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-19 10:56:32,550 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-19 10:56:32,551 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=True, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7efe7e5ecd60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-19 10:56:32,551 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 252, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 253, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 254, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[1-3]', 'nodes_resume': 'q4-dy-c4-1-[1-3]', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1'}
2023-09-19 10:56:32,555 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-19 10:56:00.366160+00:00
2023-09-19 10:56:32,556 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-3],q4-dy-c4-2-1
2023-09-19 10:56:32,609 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-19 10:56:32,634 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-19 10:56:32,675 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 10:56:32,676 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 3, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 3, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 10:56:35,637 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x3) ['i-01fa6f17d69b9f86a', 'i-032040429aa3571b1', 'i-0553c576b546f1d1d']
2023-09-19 10:56:35,638 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 10:56:35,638 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 10:56:36,709 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (19e5fdb0-c13d-4634-8c12-81678a5ddb1a): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 252 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 252 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 252 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 10:56:36,810 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 252 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 10:56:37,857 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 252 - Error in CreateFleet request (542a63d6-8e0e-41eb-ad47-ebefe7e49450): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 10:56:37,957 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 252 - Releasing launched and booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-19 10:56:37,957 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 253 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-19 10:56:37,958 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 253 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-19 10:56:37,958 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 253 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 10:56:37,958 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 253 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 10:56:38,992 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 253 - Error in CreateFleet request (7654c2d2-e5fe-4d5f-a8bc-7404045f3618): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 253 - Releasing launched and booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 254 - The nodes_resume list from Slurm Resume File is (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 254 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-19 10:56:39,111 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 254 - Nodes are now configured with instances (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-19 10:56:39,111 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 254 - Saving assigned hostnames in DynamoDB
2023-09-19 10:56:39,146 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 254 - Database update: COMPLETED
2023-09-19 10:56:39,146 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 254 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-19 10:56:39,420 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 254 - DNS records update: COMPLETED
2023-09-19 10:56:39,421 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 254 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 10:56:39,422 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 10:56:39,422 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x1) ['q4-dy-c4-2-1']
2023-09-19 10:56:39,422 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x1) ['q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-19 10:56:39,439 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x0) [] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes
2023-09-19 10:56:39,442 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.

all_or_nothing_batch = false
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*

resume log:

2023-09-19 12:30:03,047 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-19 12:30:03,048 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-19 12:30:03,049 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=False, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f11e1f2fd60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-19 12:30:03,049 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 260, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 261, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 262, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[1-3]', 'nodes_resume': 'q4-dy-c4-1-[1-3]', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1'}
2023-09-19 12:30:03,054 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-19 12:29:03.613945+00:00
2023-09-19 12:30:03,054 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-3],q4-dy-c4-2-1
2023-09-19 12:30:03,109 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-19 12:30:03,135 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-19 12:30:03,176 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 12:30:03,176 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 3, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 12:30:06,736 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x3) ['i-083f7c31d25b7430a', 'i-061dc215a811fe1ed', 'i-0a4d69c19b6ad8322']
2023-09-19 12:30:06,737 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 12:30:06,737 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 12:30:07,799 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (b0c51c67-eed1-4b15-8872-4e390327aca7): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 260 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 260 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 260 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 12:30:07,901 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 260 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 12:30:08,949 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 260 - Error in CreateFleet request (09653663-7ccc-45ac-9366-3fdc0299e86b): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 12:30:09,067 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 260 - Nodes are now configured with instances (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-083f7c31d25b7430a', private_ip='192.168.111.219', hostname='ip-192-168-111-219', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-061dc215a811fe1ed', private_ip='192.168.104.231', hostname='ip-192-168-104-231', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-0a4d69c19b6ad8322', private_ip='192.168.109.180', hostname='ip-192-168-109-180', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-19 12:30:09,067 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 260 - Saving assigned hostnames in DynamoDB
2023-09-19 12:30:09,106 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 260 - Database update: COMPLETED
2023-09-19 12:30:09,106 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 260 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-19 12:30:09,331 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 260 - DNS records update: COMPLETED
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 260 - Successful launched partial instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 261 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-1
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-2
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-3
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-2-1
2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 262 - The nodes_resume list from Slurm Resume File is (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-1
2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-2
2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-3
2023-09-19 12:30:09,333 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 12:30:09,333 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x1) ['q4-dy-c4-2-1']
2023-09-19 12:30:09,333 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x1) ['q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-19 12:30:09,369 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.

References

n/a

Checklist

  • Make sure you are pointing to the right branch.
  • If you're creating a patch for a branch other than develop add the branch name as prefix in the PR title (e.g. [release-3.6]).
  • Check all commits' messages are clear, describing what and why vs how.
  • Make sure to have added unit tests or integration tests to cover the new/modified code.
  • Check if documentation is impacted by this change.

Please review the guidelines for contributing and Pull Request Instructions.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Reset the failure for nodes that were launched successful, for which it was possible to assign an instance.
This to cover the node-sharing (oversubscribe) case when nodes that failed in a job call,
are actually launched (and assigned to instances) in a next iteration of the job loop.

Signed-off-by: Luca Carrogu <[email protected]>
@lukeseawalker lukeseawalker force-pushed the wip/nodeSharingJLS branch 3 times, most recently from c0f3aff to 125c7c0 Compare September 19, 2023 11:05
@codecov
Copy link

codecov bot commented Sep 19, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (e4908ee) 89.73% compared to head (3a6a470) 89.92%.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #564      +/-   ##
===========================================
+ Coverage    89.73%   89.92%   +0.18%     
===========================================
  Files           16       16              
  Lines         2688     2689       +1     
===========================================
+ Hits          2412     2418       +6     
+ Misses         276      271       -5     
Flag Coverage Δ
unittests 89.92% <100.00%> (+0.18%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/slurm_plugin/instance_manager.py 100.00% <100.00%> (ø)
src/slurm_plugin/resume.py 80.83% <100.00%> (+4.16%) ⬆️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Add job-level scaling for the node sharing case.
Before entering the job loop, perform the same optimizations done for the exclusive job case:
* scale best-effort for all single node jobs
* scale all for all multi node jobs

Manual tests performed on running cluster given the following submission command:

```
sbatch --wrap "sleep 10000" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4; sbatch --wrap "sleep 10000" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4; sbatch --wrap "sleep 10000" -N 3 --constraint="[(c5.4xlarge)*3]" -p q4
```

where there is capacity for c5.4xlarge but not for p4d.24xlarge the two scaling strategies were tested:

all_or_nothing_batch = true
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*

resume log:
```
2023-09-19 10:56:32,549 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-19 10:56:32,550 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-19 10:56:32,551 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=True, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7efe7e5ecd60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-19 10:56:32,551 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 252, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 253, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 254, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[1-3]', 'nodes_resume': 'q4-dy-c4-1-[1-3]', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1'}
2023-09-19 10:56:32,555 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-19 10:56:00.366160+00:00
2023-09-19 10:56:32,556 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-3],q4-dy-c4-2-1
2023-09-19 10:56:32,609 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-19 10:56:32,634 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-19 10:56:32,675 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 10:56:32,676 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 3, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 3, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 10:56:35,637 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x3) ['i-01fa6f17d69b9f86a', 'i-032040429aa3571b1', 'i-0553c576b546f1d1d']
2023-09-19 10:56:35,638 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 10:56:35,638 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 10:56:36,709 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (19e5fdb0-c13d-4634-8c12-81678a5ddb1a): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 252 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 252 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 252 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 10:56:36,810 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 252 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 10:56:37,857 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 252 - Error in CreateFleet request (542a63d6-8e0e-41eb-ad47-ebefe7e49450): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 10:56:37,957 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 252 - Releasing launched and booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-19 10:56:37,957 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 253 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-19 10:56:37,958 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 253 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-19 10:56:37,958 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 253 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 10:56:37,958 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 253 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 10:56:38,992 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 253 - Error in CreateFleet request (7654c2d2-e5fe-4d5f-a8bc-7404045f3618): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 253 - Releasing launched and booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 254 - The nodes_resume list from Slurm Resume File is (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 254 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-19 10:56:39,111 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 254 - Nodes are now configured with instances (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-19 10:56:39,111 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 254 - Saving assigned hostnames in DynamoDB
2023-09-19 10:56:39,146 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 254 - Database update: COMPLETED
2023-09-19 10:56:39,146 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 254 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-19 10:56:39,420 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 254 - DNS records update: COMPLETED
2023-09-19 10:56:39,421 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 254 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 10:56:39,422 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 10:56:39,422 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x1) ['q4-dy-c4-2-1']
2023-09-19 10:56:39,422 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x1) ['q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-19 10:56:39,439 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x0) [] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes
2023-09-19 10:56:39,442 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
```

all_or_nothing_batch = false
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*

resume log:
```
2023-09-19 12:30:03,047 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup.
2023-09-19 12:30:03,048 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf
2023-09-19 12:30:03,049 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=False, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f11e1f2fd60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a')
2023-09-19 12:30:03,049 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 260, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 261, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 262, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[1-3]', 'nodes_resume': 'q4-dy-c4-1-[1-3]', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1'}
2023-09-19 12:30:03,054 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-19 12:29:03.613945+00:00
2023-09-19 12:30:03,054 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-3],q4-dy-c4-2-1
2023-09-19 12:30:03,109 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2023-09-19 12:30:03,135 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU
2023-09-19 12:30:03,176 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 12:30:03,176 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 3, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 12:30:06,736 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x3) ['i-083f7c31d25b7430a', 'i-061dc215a811fe1ed', 'i-0a4d69c19b6ad8322']
2023-09-19 12:30:06,737 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 12:30:06,737 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 12:30:07,799 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (b0c51c67-eed1-4b15-8872-4e390327aca7): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 260 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 260 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']:
2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 260 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1']
2023-09-19 12:30:07,901 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 260 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2023-09-19 12:30:08,949 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 260 - Error in CreateFleet request (09653663-7ccc-45ac-9366-3fdc0299e86b): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b.
2023-09-19 12:30:09,067 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 260 - Nodes are now configured with instances (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-083f7c31d25b7430a', private_ip='192.168.111.219', hostname='ip-192-168-111-219', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-061dc215a811fe1ed', private_ip='192.168.104.231', hostname='ip-192-168-104-231', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-0a4d69c19b6ad8322', private_ip='192.168.109.180', hostname='ip-192-168-109-180', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))"]
2023-09-19 12:30:09,067 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 260 - Saving assigned hostnames in DynamoDB
2023-09-19 12:30:09,106 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 260 - Database update: COMPLETED
2023-09-19 12:30:09,106 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 260 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster.
2023-09-19 12:30:09,331 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 260 - DNS records update: COMPLETED
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 260 - Successful launched partial instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 261 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1']
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-1
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-2
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-3
2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-2-1
2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 262 - The nodes_resume list from Slurm Resume File is (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-1
2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-2
2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-3
2023-09-19 12:30:09,333 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']
2023-09-19 12:30:09,333 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x1) ['q4-dy-c4-2-1']
2023-09-19 12:30:09,333 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x1) ['q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes
2023-09-19 12:30:09,369 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.
```

Signed-off-by: Luca Carrogu <[email protected]>
@lukeseawalker lukeseawalker force-pushed the wip/nodeSharingJLS branch 4 times, most recently from 92dc440 to 2819732 Compare September 19, 2023 12:54
Avoid to set nodes into DOWN, hence avoid calling Slurm scontrol update, if node list is empty
Avoided log line is
```
2023-09-19 10:56:39,439 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x0) [] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes
```

Signed-off-by: Luca Carrogu <[email protected]>
Signed-off-by: Luca Carrogu <[email protected]>
@lukeseawalker lukeseawalker marked this pull request as ready for review September 19, 2023 19:11
@lukeseawalker lukeseawalker requested review from a team as code owners September 19, 2023 19:11
Remove temporary resume setting used during development of the node-sharing job-level scaling feature

Signed-off-by: Luca Carrogu <[email protected]>
reason,
e,
)
if node_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, which code path is resulting in us calling the _handle_failed_nodes with an empty nodelist?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when we have reset the node failures with _reset_failed_nodes, it could happen that the error key is still there but there are no more nodes associated to that error

Comment on lines +636 to 637
single_nodes = list(dict.fromkeys([job.nodes_resume[0] for job in job_list]))
self._add_instances_for_nodes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to use list(dict.fromkeys(...)) this instead of list(set(...))? We're expecting only a single node, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to preserve the order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll have only a single node. dict.fromkeys is fine, I just wanted to understand if there was another reason other than the order.

@@ -660,36 +657,24 @@ def _add_instances_for_resume_file(
self._clear_unused_launched_instances()

self._scaling_for_jobs_single_node(
job_list=slurm_resume_data.jobs_single_node_no_oversubscribe,
job_list=slurm_resume_data.jobs_single_node_no_oversubscribe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this stage, can SlurmResumeData contain a property jobs_single_node that already has combines both "oversubscribe" and "no oversubscribe".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

absolutely, I think we can drop the distinction between "oversubscribe" and "no oversubscribe", now that we are able to manage both types. I'm considering this for next PR

job_list=slurm_resume_data.jobs_multi_node_no_oversubscribe,
node_list=slurm_resume_data.multi_node_no_oversubscribe,
job_list=slurm_resume_data.jobs_multi_node_no_oversubscribe
+ slurm_resume_data.jobs_multi_node_oversubscribe,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, having a jobs_multi_node property in SlurmResumeData.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, see other comment

Comment on lines 962 to +963
self._update_dict(self.nodes_assigned_to_instances, nodes_resume_mapping)
self._reset_failed_nodes(set(nodes_resume_list))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: It seems that both best-effort and a successful all-or-nothing handle the successfully launched nodes roughly the same way. Maybe we can have shared behaviour for both cases?

def handle_successfully_launched_nodes:
    - Update the node mapping dictionary
    - Reset the failed nodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, they are different. Let's sync on this

Fix missing parameter assign_node_batch_size for _add_instances_for_nodes

Signed-off-by: Luca Carrogu <[email protected]>
@lukeseawalker lukeseawalker enabled auto-merge (rebase) September 26, 2023 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants