-
Notifications
You must be signed in to change notification settings - Fork 9.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: aws_sagemaker_endpoint maximum_execution_timeout_in_seconds value is ignored #39040
Comments
Community NoteVoting for Prioritization
Volunteering to Work on This Issue
|
@r-archer37 I am not too familiar with SageMaker, but looking at the documentation it seems that The timeout error you see pertains specifically to the creation of the endpoint itself and is dictated by the provider. According to the code, the wait time for the endpoint to become in service is set/hardcoded to 10 minutes. Same goes for waiting for endpoint deletion. I am not sure how long creating an endpoint generally takes but we have an option to increase it to a value within reason. Is there any way you can try to create the same endpoint in the Management Console or CLI just to see how much longer than 10 minutes it takes? |
I'm still facing the same issue when trying to create an endpoint through terraform ( using last version of aws provider ) . It takes more than 10 minutes, and as a result, the GitHub Actions workflow fails. However, the endpoint is already created on the AWS side. If I execute my Terraform plan, it will attempt to recreate the resources because it failed and it wasnot saved in the state, even though the resource already exists. |
Hmmm I tried to match the acceptance test case as closely to @r-archer37's config as much as possible, but I still can't get it to time out. Unless someone can help create the endpoint manually and try to measure the time, it might be best to just double the timeout to wait for endpoint to become in service from 10 minutes to 20 minutes. I'll submit a PR shortly. |
Hi @acwwat, thanks for your reply! Also, sorry that mine is so delayed. Referring back to your first comment, I wanted to be sure I was being clear: the
I don't think this is the same timeout as you're referring to, or if it is then it should be being overridden by the value we're supplying to this argument, and increasing it would not help. I am still able to recreate the error by going up versions from 5.62.0 to 5.63.1, and to hopefully help demonstrate that, I am attaching two files showing as much. You can see in each of them the deprecation warning that demonstrates which version of the provider is in use. Additionally, you can see in the file with the error that before running the apply, we upgrade to 5.63.1 |
@r-archer37 Thanks for your reply. To confirm, I understand that you are changing My earlier tests were aligned with configuration that includes these two arguments in the It also looks like that other folks are having similar timeout issue but for creation, that it took slightly longer than 10 minutes to complete. I would thus recommend that we see if 20 minutes is sufficient for now. If not, I think we should look into configuration timeout for more flexibility since it's YMMV. I hope this assessment of the situation is correct, but let me know if I am not viewing it from the right angle. Thanks. |
@acwwat Sorry for the confusion. I was referring to the creation of the endpoint, not the blue-green setting. The first two times I tried to create the endpoint through GH Actions, it took more than 10 minutes and timed out. However, the API call was still ongoing on the AWS side, so after a few moments, the endpoint was created. Unfortunately, in the Terraform state, it was marked as tainted. As a result, if I retried the job, it would attempt to create the endpoint again, but since it was already present in AWS, the attempt would fail. On the third attempt, the creation of the endpoint took just 7 minutes. I’m not sure why it took longer on the first two occasions and was quicker on the third one. I guess it's luck? Everything was tried on us-east-1 ... |
@admirationmr Thanks, your scenario is just another scenario that triggers the same problem with timeout being too short :) #39090 is another similar case I spotted while working on the PR. Perhaps the 10-min wait for creation is at the border line and YMMV depending on time of use, region, endpoint configuration, etc. We probably won't know so we just need to pad the timeout a bit. For update 10 minutes is likely not sufficient if existing compute resources are being shifted/rotated. I hope 20 minutes can address both cases, but if it doesn't then we'll need configurable timeouts to address it once and for all. Hopefully the low-effort fix is sufficient for the time being. |
Thanks @acwwat, yes you have the correct understanding of my situation. Now please help me check my own understanding! It looks like the value for But it seems you weren't able to reproduce my error? You can see in the text files in my last comment that with v5.62.0 my endpoint took ~18 minutes to update, but that with v5.63.1 it timed out right at 10 minutes. Regardless of the specific value, I believe this is because we are overriding the user-supplied Apologies for not being very go-literate, but can we confirm via code whether that parameter is being supplied correctly? |
I did some GPT-assisted digging and I am more confident in the conclusion that the original title I gave this report is correct: the value supplied in Instead, This doesn't align with AWS or Terraform's documentation, and should be considered a bug. |
Terraform Core Version
1.19
AWS Provider Version
5.63.0,5.63.1
Affected Resource(s)
Expected Behavior
Modifying a SageMaker endpoint should time out if
maximum_execution_timeout_in_seconds
is reached, and not before.Actual Behavior
Modifying a SageMaker endpoint times out after 600 seconds, regardless of the value of
maximum_execution_timeout_in_seconds
Relevant Error/Panic Output Snippet
Terraform Configuration Files
Sample
main.tf
Sample
vars.tf
Sample
terragrunt.hcl
Steps to Reproduce
Using AWS provider v5.63.0 or v5.63.1, run an update of a SageMaker endpoint with Blue-Green deploy configured like so:
The text was updated successfully, but these errors were encountered: