Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error creating resources: unable to wait for service state change: context deadline exceeded #1054

Closed
AaronFriel opened this issue Feb 8, 2023 · 9 comments · Fixed by #1098
Assignees
Labels
bug Something isn't working

Comments

@AaronFriel
Copy link

AaronFriel commented Feb 8, 2023

As a maintainer of Pulumi Aiven, we run tests against the Aiven source code in our provider regularly on cronjobs and on pull requests. Beginning in November, we began seeing intermittent failures with the error:

unable to wait for service state change: context deadline exceeded

This error occurred 20 minutes after the start of execution, and it appears this is due to a timeout set in the Aiven provider.

Suggested resolution

Pulumi's perspective: we recommend removing or significantly increasing timeouts to at least a large multiple of observed 100th percentile times.

Hypothetically, setting latencies to values such as the 99th percentile ensures that 1 in 100 programs with one resource will fail, and a program managing ten resources fails 1 in 10 times (1-.9910). When a provider times out, the result is usually an inconsistent state left in the service it manages and this leads to surprising behavior and manual steps to resolve. We believe infra-as-code tools should optimize for reducing manual work, even at the expense of potentially long running CI jobs.

Because of this, Pulumi defaults to not setting a timeout on the providers we author, relying on the user or the CI environment to decide when they have waited too long. This helps keep Pulumi resilient to long tail latencies.

Steps to reproduce

I believe that either infra-as-code tool would reproduce this error, if you're more familiar with HCL, it's possible to adapt our acceptance test code below to create a similar TF configuration.

This is one of the examples tests that we saw intermittent errors on:

From https://github.com/pulumi/pulumi-aiven/blob/master/examples/service/index.ts

import * as pulumi from "@pulumi/pulumi";
import * as aiven from "@pulumi/aiven";

const config = new pulumi.Config();
const projectName = config.require("projectName");


const service = new aiven.Grafana("my-new-service", {
    project: projectName,
    cloudName: "google-europe-west1",
    plan:"startup-4",
    grafanaUserConfig: {
        publicAccess: {
            grafana: "true",
        }
    }
});

export const serviceName = service.serviceName;

To get started with this example, install Pulumi and then in an empty directory run:

pulumi new https://github.com/pulumi/pulumi-aiven/examples/service

This will create the TypeScript project and its dependencies. You'll need to set the environment variables AIVEN_PROJECT_NAME and AIVEN_TOKEN, and then you can run:

pulumi up

Expected behavior

The program should create resources, which can then be destroyed via pulumi destroy. The total billing time for these resources should be close to the smallest possible billing unit (1 hour, 1 minute, etc.)

Actual behavior

Intermittently, as we observed on the following dates, the program will fail with the error "unable to wait for service state change: context deadline exceeded". In the month of November, we saw this error thirteen times, involving resources created at:

  • 2022-11-02T06:17Z
  • 2022-11-04T06:11Z
  • 2022-11-07T06:13Z
  • 2022-11-08T06:11Z
  • 2022-11-09T06:13Z
  • 2022-11-10T06:15Z
  • 2022-11-11T06:12Z
  • ...

These resources then created outside of the test loop and we were billed for the remainder of the month.

@Serpentiel Serpentiel self-assigned this Feb 16, 2023
@Serpentiel Serpentiel added needs fix This issue has been confirmed and requires to be worked on needs info This issue is missing the necessary information and removed needs fix This issue has been confirmed and requires to be worked on labels Feb 16, 2023
@Serpentiel
Copy link
Contributor

Serpentiel commented Feb 16, 2023

hey, @AaronFriel! 👋

thanks for submitting this!

I have performed an evaluation of this issue, and it is currently unclear how to reproduce it :(

I will be checking internally to see if we're able to confirm whether these API endpoints were taking longer than usual to set a service up

I believe that either infra-as-code tool would reproduce this error

looks like this issue is irreproducible this way, everything works flawlessly on our side—we use Terraform internally in Aiven, and have never received this kind of feedback from our internal customers, nor has this ever been submitted to our GitHub Issues previously by anyone else

it's possible to adapt our acceptance test code

as for that, we're not sure what would be the impact for us in adapting Pulumi as a way to perform acceptance tests

as a Hashicorp Partner, we are adhering to their documentation in regards to testing the provider, and we have full acceptance tests coverage for all the resources provided via our Terraform Provider

those are daily run jobs, that perform testing of all kind of resources that our provider implements, you should be able to see the log of these runs here: https://github.com/aiven/terraform-provider-aiven/actions/workflows/acceptance-tests.yml

n.b. they currently fail for a while due to API flakiness, but it seems not to be related to your case since the resources aren't created at all during this failure

also, looking forward to hear back from you and your ideas on reproducing this in a more reliable manner

@Serpentiel Serpentiel added the bug Something isn't working label Feb 16, 2023
@AaronFriel
Copy link
Author

Hey @Serpentiel,

Thanks for looking into this. I have an email thread ongoing with folks at Aiven as well, just to make sure we don't get our wires crossed.

Re: adapting our code, I mean having the same or identical test but in HCL/TF.

Nothing about how Pulumi works should - or could, as far as I know - change how long a network call can take.

I believe the network timeout setting you've set is much too low, and timeouts should be set to very long values that are a multiple of 100th percentile times.

Looking through your acceptance test history, I can find several tests that ran for in excess of 10 to 15 minutes, though not always. If the timeout is 20 minutes, your users are a bad roll of the die away from failing and resulting in unnecessary user pain.

Nothing about this recommendation to increase timeouts should violate any best practices from Hashicorp.

Best,

Friel

@Serpentiel
Copy link
Contributor

Serpentiel commented Feb 16, 2023

thanks for your reply, @AaronFriel!

I have an email thread ongoing with folks at Aiven as well

I'm in that email thread too :)

we are, of course, able to increase the timeout—is that something you'd be able to test on your own with the Pulumi setup if I prepare a branch with the updated timeout?

@AaronFriel
Copy link
Author

Yes, though I think we can do that without you making a branch, we have a system for vendoring providers in and applying patches. If that suffices to validate this for you folks, we can do that.

@Serpentiel
Copy link
Contributor

hey again!

great to hear that, looking forward then for hearing back about this matter once you guys check it!

also, is there anything to do from our side to provide with an ability to configure timeouts?

iirc, timeouts are also configurable via Terraform manifests

@dewan-ahmed
Copy link
Contributor

Hi @AaronFriel pls let us know if configurable timeout is an acceptable solution to this issue. thanks.

@AaronFriel
Copy link
Author

Hey @Serpentiel sorry for not getting back to you. By configure timeouts, I'm not sure.

I think two options are:

  • Increasing the default timeouts, which you may not want to impose on your users
  • Declare the timeout values in a global variable, so that we can override with ldflags at compile time

@Serpentiel
Copy link
Contributor

hey, @AaronFriel! 👋

By configure timeouts, I'm not sure.

Terraform protocol provides with a way to define custom timeouts when creating resources

all our resources now support those user-definable timeouts; when using from user-mode, it looks like this:

resource "aiven_pg" "pg1" {
  project      = data.aiven_project.m3db-project1.project
  cloud_name   = "google-europe-west1"
  service_name = "alekses-postgres1"
  plan         = "startup-4"

  pg_user_config {
    pg_version = 14
  }

  timeouts {
    create = "30m"
    read   = "30m"
  }
}

could you please check if this will work for your case?

Declare the timeout values in a global variable, so that we can override with ldflags at compile time

timeout values are declared here in the code, will this be sufficient for you to be able to patch them during compile time? iirc, you mentioned something about being able to patch some things on compile time

we could as well move the DefaultTimeoutMinutes constant one level above to be able to modify it externally, as well as changing it to be a variable

thank you!

@AaronFriel
Copy link
Author

Moving default timeout minutes to a global would be perfect - that's easier for us to patch in one place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants