Error creating resources: unable to wait for service state change: context deadline exceeded #1054

AaronFriel · 2023-02-08T16:34:56Z

As a maintainer of Pulumi Aiven, we run tests against the Aiven source code in our provider regularly on cronjobs and on pull requests. Beginning in November, we began seeing intermittent failures with the error:

unable to wait for service state change: context deadline exceeded

This error occurred 20 minutes after the start of execution, and it appears this is due to a timeout set in the Aiven provider.

Suggested resolution

Pulumi's perspective: we recommend removing or significantly increasing timeouts to at least a large multiple of observed 100th percentile times.

Hypothetically, setting latencies to values such as the 99th percentile ensures that 1 in 100 programs with one resource will fail, and a program managing ten resources fails 1 in 10 times (1-.99¹⁰). When a provider times out, the result is usually an inconsistent state left in the service it manages and this leads to surprising behavior and manual steps to resolve. We believe infra-as-code tools should optimize for reducing manual work, even at the expense of potentially long running CI jobs.

Because of this, Pulumi defaults to not setting a timeout on the providers we author, relying on the user or the CI environment to decide when they have waited too long. This helps keep Pulumi resilient to long tail latencies.

Steps to reproduce

I believe that either infra-as-code tool would reproduce this error, if you're more familiar with HCL, it's possible to adapt our acceptance test code below to create a similar TF configuration.

This is one of the examples tests that we saw intermittent errors on:

From https://github.com/pulumi/pulumi-aiven/blob/master/examples/service/index.ts

import * as pulumi from "@pulumi/pulumi";
import * as aiven from "@pulumi/aiven";

const config = new pulumi.Config();
const projectName = config.require("projectName");


const service = new aiven.Grafana("my-new-service", {
    project: projectName,
    cloudName: "google-europe-west1",
    plan:"startup-4",
    grafanaUserConfig: {
        publicAccess: {
            grafana: "true",
        }
    }
});

export const serviceName = service.serviceName;

To get started with this example, install Pulumi and then in an empty directory run:

pulumi new https://github.com/pulumi/pulumi-aiven/examples/service

This will create the TypeScript project and its dependencies. You'll need to set the environment variables AIVEN_PROJECT_NAME and AIVEN_TOKEN, and then you can run:

pulumi up

Expected behavior

The program should create resources, which can then be destroyed via pulumi destroy. The total billing time for these resources should be close to the smallest possible billing unit (1 hour, 1 minute, etc.)

Actual behavior

Intermittently, as we observed on the following dates, the program will fail with the error "unable to wait for service state change: context deadline exceeded". In the month of November, we saw this error thirteen times, involving resources created at:

2022-11-02T06:17Z
2022-11-04T06:11Z
2022-11-07T06:13Z
2022-11-08T06:11Z
2022-11-09T06:13Z
2022-11-10T06:15Z
2022-11-11T06:12Z
...

These resources then created outside of the test loop and we were billed for the remainder of the month.

The text was updated successfully, but these errors were encountered:

Serpentiel · 2023-02-16T14:40:29Z

hey, @AaronFriel! 👋

thanks for submitting this!

I have performed an evaluation of this issue, and it is currently unclear how to reproduce it :(

I will be checking internally to see if we're able to confirm whether these API endpoints were taking longer than usual to set a service up

I believe that either infra-as-code tool would reproduce this error

looks like this issue is irreproducible this way, everything works flawlessly on our side—we use Terraform internally in Aiven, and have never received this kind of feedback from our internal customers, nor has this ever been submitted to our GitHub Issues previously by anyone else

it's possible to adapt our acceptance test code

as for that, we're not sure what would be the impact for us in adapting Pulumi as a way to perform acceptance tests

as a Hashicorp Partner, we are adhering to their documentation in regards to testing the provider, and we have full acceptance tests coverage for all the resources provided via our Terraform Provider

those are daily run jobs, that perform testing of all kind of resources that our provider implements, you should be able to see the log of these runs here: https://github.com/aiven/terraform-provider-aiven/actions/workflows/acceptance-tests.yml

n.b. they currently fail for a while due to API flakiness, but it seems not to be related to your case since the resources aren't created at all during this failure

also, looking forward to hear back from you and your ideas on reproducing this in a more reliable manner

AaronFriel · 2023-02-16T16:15:09Z

Hey @Serpentiel,

Thanks for looking into this. I have an email thread ongoing with folks at Aiven as well, just to make sure we don't get our wires crossed.

Re: adapting our code, I mean having the same or identical test but in HCL/TF.

Nothing about how Pulumi works should - or could, as far as I know - change how long a network call can take.

I believe the network timeout setting you've set is much too low, and timeouts should be set to very long values that are a multiple of 100th percentile times.

Looking through your acceptance test history, I can find several tests that ran for in excess of 10 to 15 minutes, though not always. If the timeout is 20 minutes, your users are a bad roll of the die away from failing and resulting in unnecessary user pain.

Nothing about this recommendation to increase timeouts should violate any best practices from Hashicorp.

Best,

Friel

Serpentiel · 2023-02-16T16:20:30Z

thanks for your reply, @AaronFriel!

I have an email thread ongoing with folks at Aiven as well

I'm in that email thread too :)

we are, of course, able to increase the timeout—is that something you'd be able to test on your own with the Pulumi setup if I prepare a branch with the updated timeout?

AaronFriel · 2023-02-16T23:00:16Z

Yes, though I think we can do that without you making a branch, we have a system for vendoring providers in and applying patches. If that suffices to validate this for you folks, we can do that.

Serpentiel · 2023-02-17T11:12:37Z

hey again!

great to hear that, looking forward then for hearing back about this matter once you guys check it!

also, is there anything to do from our side to provide with an ability to configure timeouts?

iirc, timeouts are also configurable via Terraform manifests

dewan-ahmed · 2023-03-03T02:05:15Z

Hi @AaronFriel pls let us know if configurable timeout is an acceptable solution to this issue. thanks.

AaronFriel · 2023-03-15T21:38:53Z

Hey @Serpentiel sorry for not getting back to you. By configure timeouts, I'm not sure.

I think two options are:

Increasing the default timeouts, which you may not want to impose on your users
Declare the timeout values in a global variable, so that we can override with ldflags at compile time

Serpentiel · 2023-03-16T11:25:52Z

hey, @AaronFriel! 👋

By configure timeouts, I'm not sure.

Terraform protocol provides with a way to define custom timeouts when creating resources

all our resources now support those user-definable timeouts; when using from user-mode, it looks like this:

resource "aiven_pg" "pg1" {
  project      = data.aiven_project.m3db-project1.project
  cloud_name   = "google-europe-west1"
  service_name = "alekses-postgres1"
  plan         = "startup-4"

  pg_user_config {
    pg_version = 14
  }

  timeouts {
    create = "30m"
    read   = "30m"
  }
}

could you please check if this will work for your case?

Declare the timeout values in a global variable, so that we can override with ldflags at compile time

timeout values are declared here in the code, will this be sufficient for you to be able to patch them during compile time? iirc, you mentioned something about being able to patch some things on compile time

we could as well move the DefaultTimeoutMinutes constant one level above to be able to modify it externally, as well as changing it to be a variable

thank you!

AaronFriel · 2023-03-17T03:42:44Z

Moving default timeout minutes to a global would be perfect - that's easier for us to patch in one place.

Serpentiel self-assigned this Feb 16, 2023

Serpentiel added needs fix This issue has been confirmed and requires to be worked on needs info This issue is missing the necessary information and removed needs fix This issue has been confirmed and requires to be worked on labels Feb 16, 2023

Serpentiel added the bug Something isn't working label Feb 16, 2023

Serpentiel mentioned this issue Feb 17, 2023

feat(service): unify timeouts #1063

Merged

Serpentiel mentioned this issue Mar 17, 2023

refactor(schemautil): global default timeout minutes #1098

Merged

ivan-savciuc closed this as completed in #1098 Mar 17, 2023

Serpentiel removed the needs info This issue is missing the necessary information label Mar 17, 2023

AaronFriel mentioned this issue Mar 23, 2023

Resources not cleaned up from test, incurring ongoing charges in Aiven account pulumi/pulumi-aiven#235

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error creating resources: unable to wait for service state change: context deadline exceeded #1054

Error creating resources: unable to wait for service state change: context deadline exceeded #1054

AaronFriel commented Feb 8, 2023 •

edited

Loading

Serpentiel commented Feb 16, 2023 •

edited

Loading

AaronFriel commented Feb 16, 2023

Serpentiel commented Feb 16, 2023 •

edited

Loading

AaronFriel commented Feb 16, 2023

Serpentiel commented Feb 17, 2023

dewan-ahmed commented Mar 3, 2023

AaronFriel commented Mar 15, 2023

Serpentiel commented Mar 16, 2023

AaronFriel commented Mar 17, 2023

Error creating resources: unable to wait for service state change: context deadline exceeded #1054

Error creating resources: unable to wait for service state change: context deadline exceeded #1054

Comments

AaronFriel commented Feb 8, 2023 • edited Loading

Suggested resolution

Steps to reproduce

Expected behavior

Actual behavior

Serpentiel commented Feb 16, 2023 • edited Loading

AaronFriel commented Feb 16, 2023

Serpentiel commented Feb 16, 2023 • edited Loading

AaronFriel commented Feb 16, 2023

Serpentiel commented Feb 17, 2023

dewan-ahmed commented Mar 3, 2023

AaronFriel commented Mar 15, 2023

Serpentiel commented Mar 16, 2023

AaronFriel commented Mar 17, 2023

AaronFriel commented Feb 8, 2023 •

edited

Loading

Serpentiel commented Feb 16, 2023 •

edited

Loading

Serpentiel commented Feb 16, 2023 •

edited

Loading