-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved fleet error handling + smaller fixes #388
Conversation
@@ -77,6 +77,8 @@ def run(self): | |||
if does_key_exist(bucket_name, job_aborted): | |||
try: | |||
self.handle_postrun_json(bucket_name, jobid, self.input_json, public_read=public_postrun_json) | |||
# Instance should already be terminated here. Sending a second signal just in case | |||
boto3.client('ec2').terminate_instances(InstanceIds=[instance_id]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just adding termination commands here. Since the check_task
lambda is independent from the actual workflow, that should be good enough in my opinion.
@@ -19,15 +19,15 @@ | |||
"measurement": [ | |||
"usage_active" | |||
], | |||
"metrics_collection_interval": 60, | |||
"metrics_collection_interval": 120, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose 120s instead of a larger interval here. For troubleshooting workflows (i.e., memory or storage issues) a 5 min interval might not collect enough data if there are sudden spikes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have 2 important comments that should be addressed before merge
elif num_unique_errors == 1 and 'InvalidFleetConfiguration' in error_codes: | ||
# This error code includes the "Your requested instance type (xxx) is not supported in your requested Availability Zone (xxx)" error | ||
# In this case there must be an issue with the general setup, otherwise we would get additional error codes, e.g., InsufficientInstanceCapacity | ||
self.delete_launch_template() | ||
raise Exception(f"Invalid fleet configuration. Result from create_fleet command: {json.dumps(fleet_result)}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I'm understanding the desired behavior here. I think a valid configuration could see only this error, and in fact may be a semi-common case in which you don't want to fail? I think you only want to fail in this case if you got invalid fleet configuration for every instance type in every available subnet. So this would fail unnecessarily quite frequently?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When there is no successful instance launch (checked first) and we get into the error branch, I would expect that the reason has to do with Spot availability or capacity. I would not expect only InvalidFleetConfiguration
errors in the response.
continue | ||
|
||
elif 'InvalidLaunchTemplate' in error_codes and invalid_launch_template_retries >= 5: | ||
self.delete_launch_template() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be very clear, deleting the launch template has the effect of deleting the spot request, right? It does not appear to do so directly and it is not obvious to me that is what happens. Seems like you may also need to call: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_CancelSpotFleetRequests.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleting the launch template does not delete the fleet. I do this separately in l. 558 whenever there is any error (and no instance)
plot_metrics
command has been adjusted accordingly.instance_type
now takes precedence overcpu
,mem
. When both is specified,cpu
,mem
will be ignored and Benchmark will not be used. To use Benchmark,instance_type
may not be specified in the job description.check_task
lambda as failsafe in case the instance does not correctly shut down after the workflow is run.