-
Notifications
You must be signed in to change notification settings - Fork 88
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[develop] Job Level Scaling for Node Sharing #564
Conversation
Reset the failure for nodes that were launched successful, for which it was possible to assign an instance. This to cover the node-sharing (oversubscribe) case when nodes that failed in a job call, are actually launched (and assigned to instances) in a next iteration of the job loop. Signed-off-by: Luca Carrogu <[email protected]>
c0f3aff
to
125c7c0
Compare
Codecov ReportAll modified lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #564 +/- ##
===========================================
+ Coverage 89.73% 89.92% +0.18%
===========================================
Files 16 16
Lines 2688 2689 +1
===========================================
+ Hits 2412 2418 +6
+ Misses 276 271 -5
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
Add job-level scaling for the node sharing case. Before entering the job loop, perform the same optimizations done for the exclusive job case: * scale best-effort for all single node jobs * scale all for all multi node jobs Manual tests performed on running cluster given the following submission command: ``` sbatch --wrap "sleep 10000" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4; sbatch --wrap "sleep 10000" -N 4 --constraint="[(c5.4xlarge)*3&(p4d.24xlarge)*1]" -p q4; sbatch --wrap "sleep 10000" -N 3 --constraint="[(c5.4xlarge)*3]" -p q4 ``` where there is capacity for c5.4xlarge but not for p4d.24xlarge the two scaling strategies were tested: all_or_nothing_batch = true expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-* resume log: ``` 2023-09-19 10:56:32,549 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup. 2023-09-19 10:56:32,550 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf 2023-09-19 10:56:32,551 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=True, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7efe7e5ecd60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a') 2023-09-19 10:56:32,551 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 252, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 253, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 254, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[1-3]', 'nodes_resume': 'q4-dy-c4-1-[1-3]', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1'} 2023-09-19 10:56:32,555 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-19 10:56:00.366160+00:00 2023-09-19 10:56:32,556 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-3],q4-dy-c4-2-1 2023-09-19 10:56:32,609 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP')] 2023-09-19 10:56:32,634 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU 2023-09-19 10:56:32,675 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-19 10:56:32,676 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 3, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 3, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-19 10:56:35,637 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x3) ['i-01fa6f17d69b9f86a', 'i-032040429aa3571b1', 'i-0553c576b546f1d1d'] 2023-09-19 10:56:35,638 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1'] 2023-09-19 10:56:35,638 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-19 10:56:36,709 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (19e5fdb0-c13d-4634-8c12-81678a5ddb1a): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 252 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1'] 2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 252 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']: 2023-09-19 10:56:36,810 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 252 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1'] 2023-09-19 10:56:36,810 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 252 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-19 10:56:37,857 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 252 - Error in CreateFleet request (542a63d6-8e0e-41eb-ad47-ebefe7e49450): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-19 10:56:37,957 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 252 - Releasing launched and booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-19 10:56:37,957 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 253 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1'] 2023-09-19 10:56:37,958 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 253 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']: 2023-09-19 10:56:37,958 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 253 - Launching all-or-nothing instances for nodes (x1) ['q4-dy-c4-2-1'] 2023-09-19 10:56:37,958 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 253 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-19 10:56:38,992 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 253 - Error in CreateFleet request (7654c2d2-e5fe-4d5f-a8bc-7404045f3618): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 253 - Releasing launched and booked instances (x3) ["('q4', 'c4-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4', 'c4-1', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 254 - The nodes_resume list from Slurm Resume File is (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-19 10:56:39,093 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 254 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']: 2023-09-19 10:56:39,111 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 254 - Nodes are now configured with instances (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-01fa6f17d69b9f86a', private_ip='192.168.109.104', hostname='ip-192-168-109-104', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-032040429aa3571b1', private_ip='192.168.104.153', hostname='ip-192-168-104-153', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-0553c576b546f1d1d', private_ip='192.168.110.129', hostname='ip-192-168-110-129', launch_time=datetime.datetime(2023, 9, 19, 10, 56, 34, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-19 10:56:39,111 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 254 - Saving assigned hostnames in DynamoDB 2023-09-19 10:56:39,146 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 254 - Database update: COMPLETED 2023-09-19 10:56:39,146 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 254 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster. 2023-09-19 10:56:39,420 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 254 - DNS records update: COMPLETED 2023-09-19 10:56:39,421 - [slurm_plugin.instance_manager:all_or_nothing_node_assignment] - INFO - JobID 254 - Successful launched all instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-19 10:56:39,422 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-19 10:56:39,422 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x1) ['q4-dy-c4-2-1'] 2023-09-19 10:56:39,422 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x1) ['q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes 2023-09-19 10:56:39,439 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x0) [] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes 2023-09-19 10:56:39,442 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished. ``` all_or_nothing_batch = false expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-* resume log: ``` 2023-09-19 12:30:03,047 - [slurm_plugin.resume:main] - INFO - ResumeProgram startup. 2023-09-19 12:30:03,048 - [slurm_plugin.resume:_get_config] - INFO - Reading /etc/parallelcluster/slurm_plugin/parallelcluster_slurm_resume.conf 2023-09-19 12:30:03,049 - [slurm_plugin.resume:main] - INFO - ResumeProgram config: SlurmResumeConfig(region='us-east-1', cluster_name='bootstrap', dynamodb_table='parallelcluster-slurm-bootstrap', hosted_zone='Z09815256PBUS3QRIMRV', dns_domain='bootstrap.pcluster.', use_private_hostname=False, head_node_private_ip='192.168.24.99', head_node_hostname='ip-192-168-24-99.ec2.internal', launch_max_batch_size=500, assign_node_max_batch_size=500, terminate_max_batch_size=1000, update_node_address=True, all_or_nothing_batch=False, job_level_scaling=True, temp_jls_for_node_sharing=False, fleet_config={'q1': {'c1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q2': {'c2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.2xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q3': {'c3': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}, 'q4': {'c4-1': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'c5.4xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}, 'c4-2': {'Api': 'create-fleet', 'CapacityType': 'on-demand', 'AllocationStrategy': 'lowest-price', 'Instances': [{'InstanceType': 'p4d.24xlarge'}], 'Networking': {'SubnetIds': ['subnet-0b48ed99988e56110']}}}}, run_instances_overrides={}, create_fleet_overrides={}, clustermgtd_timeout=300, clustermgtd_heartbeat_file_path='/opt/slurm/etc/pcluster/.slurm_plugin/clustermgtd_heartbeat', _boto3_retry=1, _boto3_config={'retries': {'max_attempts': 1, 'mode': 'standard'}}, boto3_config=<botocore.config.Config object at 0x7f11e1f2fd60>, logging_config='/opt/parallelcluster/pyenv/versions/3.9.16/envs/node_virtualenv/lib/python3.9/site-packages/slurm_plugin/logging/parallelcluster_resume_logging.conf', head_node_instance_id='i-0145afe796a5e375a') 2023-09-19 12:30:03,049 - [slurm_plugin.resume:_get_slurm_resume] - INFO - Slurm Resume File content: {'jobs': [{'extra': None, 'job_id': 260, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 261, 'features': '[(c5.4xlarge)*3&(p4d.24xlarge)*1]', 'nodes_alloc': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}, {'extra': None, 'job_id': 262, 'features': '[(c5.4xlarge)*3]', 'nodes_alloc': 'q4-dy-c4-1-[1-3]', 'nodes_resume': 'q4-dy-c4-1-[1-3]', 'oversubscribe': 'OK', 'partition': 'q4', 'reservation': None}], 'all_nodes_resume': 'q4-dy-c4-1-[1-3],q4-dy-c4-2-1'} 2023-09-19 12:30:03,054 - [slurm_plugin.common:is_clustermgtd_heartbeat_valid] - INFO - Latest heartbeat from clustermgtd: 2023-09-19 12:29:03.613945+00:00 2023-09-19 12:30:03,054 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: q4-dy-c4-1-[1-3],q4-dy-c4-2-1 2023-09-19 12:30:03,109 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('q4-dy-c4-1-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-2', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-1-3', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP'), ('q4-dy-c4-2-1', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP')] 2023-09-19 12:30:03,135 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: bootstrap-RoleHeadNode-NKATKTSA4IIU 2023-09-19 12:30:03,176 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-19 12:30:03,176 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-1', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'c5.4xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 3, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-19 12:30:06,736 - [slurm_plugin.fleet_manager:launch_ec2_instances] - INFO - Launched the following instances (x3) ['i-083f7c31d25b7430a', 'i-061dc215a811fe1ed', 'i-0a4d69c19b6ad8322'] 2023-09-19 12:30:06,737 - [slurm_plugin.instance_manager:_launch_instances] - INFO - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1'] 2023-09-19 12:30:06,737 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-19 12:30:07,799 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - Error in CreateFleet request (b0c51c67-eed1-4b15-8872-4e390327aca7): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 260 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1'] 2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_resize_slurm_node_list] - INFO - JobID 260 - Booking already launched instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3']: 2023-09-19 12:30:07,900 - [slurm_plugin.instance_manager:_launch_instances] - INFO - JobID 260 - Launching best-effort instances for nodes (x1) ['q4-dy-c4-2-1'] 2023-09-19 12:30:07,901 - [slurm_plugin.fleet_manager:create_fleet] - INFO - JobID 260 - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'bootstrap-q4-c4-2', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p4d.24xlarge', 'SubnetId': 'subnet-0b48ed99988e56110'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}} 2023-09-19 12:30:08,949 - [slurm_plugin.fleet_manager:_launch_instances] - ERROR - JobID 260 - Error in CreateFleet request (09653663-7ccc-45ac-9366-3fdc0299e86b): InsufficientInstanceCapacity - We currently do not have sufficient p4d.24xlarge capacity in the Availability Zone you requested (us-east-1d). Our system will be working on provisioning additional capacity. You can currently get p4d.24xlarge capacity by not specifying an Availability Zone in your request or choosing us-east-1a, us-east-1b. 2023-09-19 12:30:09,067 - [slurm_plugin.instance_manager:_update_slurm_node_addrs] - INFO - JobID 260 - Nodes are now configured with instances (x3) ["('q4-dy-c4-1-1', EC2Instance(id='i-083f7c31d25b7430a', private_ip='192.168.111.219', hostname='ip-192-168-111-219', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-2', EC2Instance(id='i-061dc215a811fe1ed', private_ip='192.168.104.231', hostname='ip-192-168-104-231', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))", "('q4-dy-c4-1-3', EC2Instance(id='i-0a4d69c19b6ad8322', private_ip='192.168.109.180', hostname='ip-192-168-109-180', launch_time=datetime.datetime(2023, 9, 19, 12, 30, 5, tzinfo=tzlocal()), slurm_node=None))"] 2023-09-19 12:30:09,067 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 260 - Saving assigned hostnames in DynamoDB 2023-09-19 12:30:09,106 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - JobID 260 - Database update: COMPLETED 2023-09-19 12:30:09,106 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 260 - Updating DNS records for Z09815256PBUS3QRIMRV - bootstrap.pcluster. 2023-09-19 12:30:09,331 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - JobID 260 - DNS records update: COMPLETED 2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:best_effort_node_assignment] - INFO - JobID 260 - Successful launched partial instances for nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 261 - The nodes_resume list from Slurm Resume File is (x4) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3', 'q4-dy-c4-2-1'] 2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-1 2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-2 2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-3 2023-09-19 12:30:09,332 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 261 - Discarding NodeName already assigned to running instance: q4-dy-c4-2-1 2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_scaling_for_jobs] - INFO - JobID 262 - The nodes_resume list from Slurm Resume File is (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-1 2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-2 2023-09-19 12:30:09,333 - [slurm_plugin.instance_manager:_parse_nodes_resume_list] - INFO - JobID 262 - Discarding NodeName already assigned to running instance: q4-dy-c4-1-3 2023-09-19 12:30:09,333 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x3) ['q4-dy-c4-1-1', 'q4-dy-c4-1-2', 'q4-dy-c4-1-3'] 2023-09-19 12:30:09,333 - [slurm_plugin.resume:_resume] - ERROR - Failed to launch following nodes, setting nodes to DOWN: (x1) ['q4-dy-c4-2-1'] 2023-09-19 12:30:09,333 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x1) ['q4-dy-c4-2-1'] with reason: (Code:InsufficientInstanceCapacity)Failure when resuming nodes 2023-09-19 12:30:09,369 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished. ``` Signed-off-by: Luca Carrogu <[email protected]>
92dc440
to
2819732
Compare
Avoid to set nodes into DOWN, hence avoid calling Slurm scontrol update, if node list is empty Avoided log line is ``` 2023-09-19 10:56:39,439 - [slurm_plugin.resume:_handle_failed_nodes] - INFO - Setting following failed nodes into DOWN state (x0) [] with reason: (Code:LimitedInstanceCapacity)Failure when resuming nodes ``` Signed-off-by: Luca Carrogu <[email protected]>
Signed-off-by: Luca Carrogu <[email protected]>
2819732
to
2107880
Compare
Remove temporary resume setting used during development of the node-sharing job-level scaling feature Signed-off-by: Luca Carrogu <[email protected]>
reason, | ||
e, | ||
) | ||
if node_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, which code path is resulting in us calling the _handle_failed_nodes
with an empty nodelist?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we have reset the node failures with _reset_failed_nodes, it could happen that the error key is still there but there are no more nodes associated to that error
single_nodes = list(dict.fromkeys([job.nodes_resume[0] for job in job_list])) | ||
self._add_instances_for_nodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to use list(dict.fromkeys(...))
this instead of list(set(...))
? We're expecting only a single node, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to preserve the order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll have only a single node. dict.fromkeys
is fine, I just wanted to understand if there was another reason other than the order.
@@ -660,36 +657,24 @@ def _add_instances_for_resume_file( | |||
self._clear_unused_launched_instances() | |||
|
|||
self._scaling_for_jobs_single_node( | |||
job_list=slurm_resume_data.jobs_single_node_no_oversubscribe, | |||
job_list=slurm_resume_data.jobs_single_node_no_oversubscribe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this stage, can SlurmResumeData contain a property jobs_single_node
that already has combines both "oversubscribe" and "no oversubscribe".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
absolutely, I think we can drop the distinction between "oversubscribe" and "no oversubscribe", now that we are able to manage both types. I'm considering this for next PR
job_list=slurm_resume_data.jobs_multi_node_no_oversubscribe, | ||
node_list=slurm_resume_data.multi_node_no_oversubscribe, | ||
job_list=slurm_resume_data.jobs_multi_node_no_oversubscribe | ||
+ slurm_resume_data.jobs_multi_node_oversubscribe, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, having a jobs_multi_node
property in SlurmResumeData.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, see other comment
self._update_dict(self.nodes_assigned_to_instances, nodes_resume_mapping) | ||
self._reset_failed_nodes(set(nodes_resume_list)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: It seems that both best-effort
and a successful all-or-nothing
handle the successfully launched nodes roughly the same way. Maybe we can have shared behaviour for both cases?
def handle_successfully_launched_nodes:
- Update the node mapping dictionary
- Reset the failed nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, they are different. Let's sync on this
Fix missing parameter assign_node_batch_size for _add_instances_for_nodes Signed-off-by: Luca Carrogu <[email protected]>
Description of changes
Reset the failure for nodes that were launched successful, for which it was possible to assign an instance.
This to cover the node-sharing (oversubscribe) case when nodes that failed in a job call,
are actually launched (and assigned to instances) in a next iteration of the job loop.
Add job-level scaling for the node sharing case.
Before entering the job loop, perform the same optimizations done for the exclusive job case:
Avoid to set nodes into DOWN, hence avoid calling Slurm scontrol update, if node list is empty.
Avoided log line is
Tests
where there is capacity for c5.4xlarge but not for p4d.24xlarge the two scaling strategies were tested:
all_or_nothing_batch = true
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*
resume log:
all_or_nothing_batch = false
expected nodes running at the end of the resume call: (x3) q4-dy-c4-1-*
resume log:
References
n/a
Checklist
develop
add the branch name as prefix in the PR title (e.g.[release-3.6]
).Please review the guidelines for contributing and Pull Request Instructions.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.