Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved fleet error handling + smaller fixes #388

Merged
merged 5 commits into from
Mar 22, 2023
Merged

Conversation

alexander-veit
Copy link
Member

  • This PR Increases the metrics interval collection by Cloudwatch to 120s (from 60s). When many and long workflows are run a 60s interval is too costly. The plot_metrics command has been adjusted accordingly.
  • We enforce a new version of Benchmark, which resolves issue Tibanna instance type error with snakemake #382.
  • instance_type now takes precedence over cpu,mem. When both is specified, cpu,mem will be ignored and Benchmark will not be used. To use Benchmark, instance_type may not be specified in the job description.
  • Added instance termination commands at various places in the check_task lambda as failsafe in case the instance does not correctly shut down after the workflow is run.
  • Improved fleet error handling by looking at all returned error codes

@alexander-veit alexander-veit changed the title Error handling Improved fleet error handling + smaller fixes Mar 21, 2023
@@ -77,6 +77,8 @@ def run(self):
if does_key_exist(bucket_name, job_aborted):
try:
self.handle_postrun_json(bucket_name, jobid, self.input_json, public_read=public_postrun_json)
# Instance should already be terminated here. Sending a second signal just in case
boto3.client('ec2').terminate_instances(InstanceIds=[instance_id])
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just adding termination commands here. Since the check_task lambda is independent from the actual workflow, that should be good enough in my opinion.

@@ -19,15 +19,15 @@
"measurement": [
"usage_active"
],
"metrics_collection_interval": 60,
"metrics_collection_interval": 120,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose 120s instead of a larger interval here. For troubleshooting workflows (i.e., memory or storage issues) a 5 min interval might not collect enough data if there are sudden spikes.

Copy link
Member

@willronchetti willronchetti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 important comments that should be addressed before merge

Comment on lines +570 to +574
elif num_unique_errors == 1 and 'InvalidFleetConfiguration' in error_codes:
# This error code includes the "Your requested instance type (xxx) is not supported in your requested Availability Zone (xxx)" error
# In this case there must be an issue with the general setup, otherwise we would get additional error codes, e.g., InsufficientInstanceCapacity
self.delete_launch_template()
raise Exception(f"Invalid fleet configuration. Result from create_fleet command: {json.dumps(fleet_result)}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm understanding the desired behavior here. I think a valid configuration could see only this error, and in fact may be a semi-common case in which you don't want to fail? I think you only want to fail in this case if you got invalid fleet configuration for every instance type in every available subnet. So this would fail unnecessarily quite frequently?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there is no successful instance launch (checked first) and we get into the error branch, I would expect that the reason has to do with Spot availability or capacity. I would not expect only InvalidFleetConfiguration errors in the response.

continue

elif 'InvalidLaunchTemplate' in error_codes and invalid_launch_template_retries >= 5:
self.delete_launch_template()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be very clear, deleting the launch template has the effect of deleting the spot request, right? It does not appear to do so directly and it is not obvious to me that is what happens. Seems like you may also need to call: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CancelSpotFleetRequests.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleting the launch template does not delete the fleet. I do this separately in l. 558 whenever there is any error (and no instance)

@alexander-veit alexander-veit merged commit d9da66d into master Mar 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants