Improved fleet error handling + smaller fixes #388

alexander-veit · 2023-03-21T10:40:16Z

This PR Increases the metrics interval collection by Cloudwatch to 120s (from 60s). When many and long workflows are run a 60s interval is too costly. The plot_metrics command has been adjusted accordingly.
We enforce a new version of Benchmark, which resolves issue Tibanna instance type error with snakemake #382.
instance_type now takes precedence over cpu,mem. When both is specified, cpu,mem will be ignored and Benchmark will not be used. To use Benchmark, instance_type may not be specified in the job description.
Added instance termination commands at various places in the check_task lambda as failsafe in case the instance does not correctly shut down after the workflow is run.
Improved fleet error handling by looking at all returned error codes

alexander-veit · 2023-03-21T10:45:26Z

tibanna/check_task.py

@@ -77,6 +77,8 @@ def run(self):
        if does_key_exist(bucket_name, job_aborted):
            try:
                self.handle_postrun_json(bucket_name, jobid, self.input_json, public_read=public_postrun_json)
+                # Instance should already be terminated here. Sending a second signal just in case
+                boto3.client('ec2').terminate_instances(InstanceIds=[instance_id]) 


I am just adding termination commands here. Since the check_task lambda is independent from the actual workflow, that should be good enough in my opinion.

alexander-veit · 2023-03-21T10:47:26Z

awsf3/cloudwatch_agent_config.json

@@ -19,15 +19,15 @@
 				"measurement": [
 					"usage_active"
 				],
-				"metrics_collection_interval": 60,
+				"metrics_collection_interval": 120,


I chose 120s instead of a larger interval here. For troubleshooting workflows (i.e., memory or storage issues) a 5 min interval might not collect enough data if there are sudden spikes.

willronchetti

I have 2 important comments that should be addressed before merge

willronchetti · 2023-03-21T12:29:40Z

tibanna/ec2_utils.py

+                elif num_unique_errors == 1 and 'InvalidFleetConfiguration' in error_codes:
+                    # This error code includes the "Your requested instance type (xxx) is not supported in your requested Availability Zone (xxx)" error
+                    # In this case there must be an issue with the general setup, otherwise we would get additional error codes, e.g., InsufficientInstanceCapacity
+                    self.delete_launch_template()
+                    raise Exception(f"Invalid fleet configuration. Result from create_fleet command: {json.dumps(fleet_result)}")


Not sure I'm understanding the desired behavior here. I think a valid configuration could see only this error, and in fact may be a semi-common case in which you don't want to fail? I think you only want to fail in this case if you got invalid fleet configuration for every instance type in every available subnet. So this would fail unnecessarily quite frequently?

When there is no successful instance launch (checked first) and we get into the error branch, I would expect that the reason has to do with Spot availability or capacity. I would not expect only InvalidFleetConfiguration errors in the response.

willronchetti · 2023-03-21T12:46:34Z

tibanna/ec2_utils.py

+                    continue
+
+                elif 'InvalidLaunchTemplate' in error_codes and invalid_launch_template_retries >= 5:
+                    self.delete_launch_template()


To be very clear, deleting the launch template has the effect of deleting the spot request, right? It does not appear to do so directly and it is not obvious to me that is what happens. Seems like you may also need to call: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CancelSpotFleetRequests.html

Deleting the launch template does not delete the fleet. I do this separately in l. 558 whenever there is any error (and no instance)

alexander-veit added 4 commits March 20, 2023 11:18

Improved error handing

48f31c9

Version bump

ec92925

Fix tests

9c7b634

Enforce newer Benchmark

2b41267

alexander-veit changed the title ~~Error handling~~ Improved fleet error handling + smaller fixes Mar 21, 2023

Revert idle timeout parameters

a157003

alexander-veit commented Mar 21, 2023

View reviewed changes

alexander-veit requested a review from willronchetti March 21, 2023 10:48

willronchetti reviewed Mar 21, 2023

View reviewed changes

willronchetti approved these changes Mar 21, 2023

View reviewed changes

alexander-veit merged commit d9da66d into master Mar 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved fleet error handling + smaller fixes #388

Improved fleet error handling + smaller fixes #388

alexander-veit commented Mar 21, 2023

alexander-veit Mar 21, 2023

alexander-veit Mar 21, 2023

willronchetti left a comment

willronchetti Mar 21, 2023

alexander-veit Mar 21, 2023

willronchetti Mar 21, 2023

alexander-veit Mar 21, 2023

Improved fleet error handling + smaller fixes #388

Improved fleet error handling + smaller fixes #388

Conversation

alexander-veit commented Mar 21, 2023

alexander-veit Mar 21, 2023

Choose a reason for hiding this comment

alexander-veit Mar 21, 2023

Choose a reason for hiding this comment

willronchetti left a comment

Choose a reason for hiding this comment

willronchetti Mar 21, 2023

Choose a reason for hiding this comment

alexander-veit Mar 21, 2023

Choose a reason for hiding this comment

willronchetti Mar 21, 2023

Choose a reason for hiding this comment

alexander-veit Mar 21, 2023

Choose a reason for hiding this comment