-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error creating resources: unable to wait for service state change: context deadline exceeded #1054
Comments
hey, @AaronFriel! 👋 thanks for submitting this! I have performed an evaluation of this issue, and it is currently unclear how to reproduce it :( I will be checking internally to see if we're able to confirm whether these API endpoints were taking longer than usual to set a service up
looks like this issue is irreproducible this way, everything works flawlessly on our side—we use Terraform internally in Aiven, and have never received this kind of feedback from our internal customers, nor has this ever been submitted to our GitHub Issues previously by anyone else
as for that, we're not sure what would be the impact for us in adapting Pulumi as a way to perform acceptance tests as a Hashicorp Partner, we are adhering to their documentation in regards to testing the provider, and we have full acceptance tests coverage for all the resources provided via our Terraform Provider those are daily run jobs, that perform testing of all kind of resources that our provider implements, you should be able to see the log of these runs here: https://github.com/aiven/terraform-provider-aiven/actions/workflows/acceptance-tests.yml n.b. they currently fail for a while due to API flakiness, but it seems not to be related to your case since the resources aren't created at all during this failure also, looking forward to hear back from you and your ideas on reproducing this in a more reliable manner |
Hey @Serpentiel, Thanks for looking into this. I have an email thread ongoing with folks at Aiven as well, just to make sure we don't get our wires crossed. Re: adapting our code, I mean having the same or identical test but in HCL/TF. Nothing about how Pulumi works should - or could, as far as I know - change how long a network call can take. I believe the network timeout setting you've set is much too low, and timeouts should be set to very long values that are a multiple of 100th percentile times. Looking through your acceptance test history, I can find several tests that ran for in excess of 10 to 15 minutes, though not always. If the timeout is 20 minutes, your users are a bad roll of the die away from failing and resulting in unnecessary user pain. Nothing about this recommendation to increase timeouts should violate any best practices from Hashicorp. Best, Friel |
thanks for your reply, @AaronFriel!
I'm in that email thread too :) we are, of course, able to increase the timeout—is that something you'd be able to test on your own with the Pulumi setup if I prepare a branch with the updated timeout? |
Yes, though I think we can do that without you making a branch, we have a system for vendoring providers in and applying patches. If that suffices to validate this for you folks, we can do that. |
hey again! great to hear that, looking forward then for hearing back about this matter once you guys check it! also, is there anything to do from our side to provide with an ability to configure timeouts? iirc, timeouts are also configurable via Terraform manifests |
Hi @AaronFriel pls let us know if configurable timeout is an acceptable solution to this issue. thanks. |
Hey @Serpentiel sorry for not getting back to you. By configure timeouts, I'm not sure. I think two options are:
|
hey, @AaronFriel! 👋
Terraform protocol provides with a way to define custom timeouts when creating resources all our resources now support those user-definable timeouts; when using from user-mode, it looks like this: resource "aiven_pg" "pg1" {
project = data.aiven_project.m3db-project1.project
cloud_name = "google-europe-west1"
service_name = "alekses-postgres1"
plan = "startup-4"
pg_user_config {
pg_version = 14
}
timeouts {
create = "30m"
read = "30m"
}
} could you please check if this will work for your case?
timeout values are declared here in the code, will this be sufficient for you to be able to patch them during compile time? iirc, you mentioned something about being able to patch some things on compile time we could as well move the thank you! |
Moving default timeout minutes to a global would be perfect - that's easier for us to patch in one place. |
As a maintainer of Pulumi Aiven, we run tests against the Aiven source code in our provider regularly on cronjobs and on pull requests. Beginning in November, we began seeing intermittent failures with the error:
This error occurred 20 minutes after the start of execution, and it appears this is due to a timeout set in the Aiven provider.
Suggested resolution
Pulumi's perspective: we recommend removing or significantly increasing timeouts to at least a large multiple of observed 100th percentile times.
Hypothetically, setting latencies to values such as the 99th percentile ensures that 1 in 100 programs with one resource will fail, and a program managing ten resources fails 1 in 10 times (1-.9910). When a provider times out, the result is usually an inconsistent state left in the service it manages and this leads to surprising behavior and manual steps to resolve. We believe infra-as-code tools should optimize for reducing manual work, even at the expense of potentially long running CI jobs.
Because of this, Pulumi defaults to not setting a timeout on the providers we author, relying on the user or the CI environment to decide when they have waited too long. This helps keep Pulumi resilient to long tail latencies.
Steps to reproduce
I believe that either infra-as-code tool would reproduce this error, if you're more familiar with HCL, it's possible to adapt our acceptance test code below to create a similar TF configuration.
This is one of the examples tests that we saw intermittent errors on:
From https://github.com/pulumi/pulumi-aiven/blob/master/examples/service/index.ts
To get started with this example, install Pulumi and then in an empty directory run:
This will create the TypeScript project and its dependencies. You'll need to set the environment variables
AIVEN_PROJECT_NAME
andAIVEN_TOKEN
, and then you can run:Expected behavior
The program should create resources, which can then be destroyed via
pulumi destroy
. The total billing time for these resources should be close to the smallest possible billing unit (1 hour, 1 minute, etc.)Actual behavior
Intermittently, as we observed on the following dates, the program will fail with the error "unable to wait for service state change: context deadline exceeded". In the month of November, we saw this error thirteen times, involving resources created at:
These resources then created outside of the test loop and we were billed for the remainder of the month.
The text was updated successfully, but these errors were encountered: