Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS Record state discrepancy #910

Open
mikocot opened this issue Sep 19, 2024 · 7 comments
Open

DNS Record state discrepancy #910

mikocot opened this issue Sep 19, 2024 · 7 comments
Labels
awaiting/core Blocked on a missing bug or feature in pulumi/pulumi (except codegen) awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). kind/enhancement Improvements or new features

Comments

@mikocot
Copy link

mikocot commented Sep 19, 2024

Describe what happened

Yesterday cloudflare had an outage that caused an error when creating a DNS record.

Upon retry Pulumi seems to not be aware that this record was actually created and tries to add it again, which causes the update to fail.

Refresh did not help.

Sample program

new Pulumi.Cloudflare.Record(name, new RecordArgs
            {
                ZoneId = zoneId,
                Name = name,
                Content = domainVerificationId,
                Type = "TXT",
                Ttl = 1,
                Proxied = false,
            }

Log output

original error:

  • cloudflare:index:Record asuid.app-ingestion.eastus2.pr1242 creating (1s) error: error reading from server: EOF
  • cloudflare:index:Record asuid.app-ingestion.eastus2.pr1242 creating failed error: error reading from server: EOF

some other records we tried toc reate gave slightly different error:

  • cloudflare:index:Record asuid.app-ingestion.pr1242 creating (1s) error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:39483: connect: connection refused"
  • cloudflare:index:Record asuid.app-ingestion.pr1242 creating failed error: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:39483: connect: connection refused"

now, every time we retry update (before and after refresh):

cloudflare:index:Record (asuid.app-ingestion.eastus2.pr1242):
error: sdk-v2/provider2.go:385: sdk.helper_schema: expected DNS record to not already be present but already exists: [email protected]
error: 1 error occurred:
* expected DNS record to not already be present but already exists

Affected Resource(s)

cloudflare:index:Record

Output of pulumi about

we use https://github.com/pulumi/actions so it's irrelevant

Additional context

This bug might be not reproducible in an easy way, without mocking CF responses, but it shows a bigger underlying issue that we have discussed with pulumi numerous times. It seems that when pulumi fails to get a successful response from a provider it assumes the resource was simply not created, while often it was or will be soon after. Instead of verifying this the next time around (or doing refresh) pulumi then tries to create that resource again, which either fails or causes a duplicate. We were assured that this scenario has been taken into account but we still see cases like this one were clearly it's a problem.

Unfortunately in such situation there is really little we can do, as the resource is created, deployment won't work, and pulumi also won't clean it as it's not aware of its existence (not that we can destroy the stack in many cases). We're aware that we could import resource, etc etc but those deployments are all part of an automated solution that deploy simtimes hundreds of times per day and we use pulumi to exactly avoid having to manage all such cases.

Contributing

Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).

@mikocot mikocot added kind/bug Some behavior is incorrect or out of spec needs-triage Needs attention from the triage team labels Sep 19, 2024
@iwahbe iwahbe added awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). and removed needs-triage Needs attention from the triage team labels Sep 20, 2024
@iwahbe
Copy link
Member

iwahbe commented Sep 20, 2024

Hi @mikocot. I'm sorry your hitting this error. Unfortunately, this is almost definitely a bug in the upstream provider: cloudflare/terraform-provider-cloudflare.

A pulumi level fix would be very similar to the resolution of pulumi/pulumi#3388.

@mikocot
Copy link
Author

mikocot commented Sep 24, 2024

I agree it's comes from an upstream issue, but the very reason for using Pulumi is to become independent from such events. If we need to workaround pulumi getting desynchronized with the underlying provider it becomes easier to just do it directly.

I don't understand why Pulumi can't register the identifier of the resource it creates when the request does not succeed, and check for its existence the next time around instead of trying to create it again. It probably doesn't make sense when the response is 404, or bad request, but it's something absolutely expactable if the response is not clear about the actual result.

@mikocot
Copy link
Author

mikocot commented Sep 26, 2024

Also, looking from a customer perspective, think that if we did not use pulumi, such incident with a provider like CloudFlar or Azure, would be a no event, we'd just retry a creation. If this happens with Pulumi, we're actually stuck. There isn't any good way to deal with such state discrepancies, we had to do it before, and it's a super error prone and tricky operation making the whole deployment fragile.

Therefore Pulumi should do anything possible to avoid such state discrepancies, and it does not seem to. The fact that such issues are not fixed with a 'refresh' are even more worrying. And they are not becaue Pulumi doesn't seem to have any record of trying to create such resource first of all.

@mjeffryes
Copy link
Member

I totally agree; when providers are not resilient to create failures and the resource was actually created in the cloud, the declarative model breaks down a bit. To be fair, if you we're not using pulumi, you would still get an error when you retry and have to decide how to handle it. (but maybe the tools to resolve it would be more familiar?)

The problem is fundamentally tricky to solve at the Pulumi layer for a few reasons:

  1. The (upstream, terraform) provider doesn't hand Pulumi any id/handle for the thing it was trying to create when it fails. This is why refresh can't work; we don't have enough information to try to read the state in the cloud.
  2. When you run pulumi up a second time, the service returns a hard error, but not enough information for the provider to determine that its state matches the resource you're trying to create.

The best solution would be for the upstream provider to fix 1, so that the subsequent refresh would just work. (This is not always possible for all resources depending on the way the service API works, but based on the create error message it seems likely to me that it would be for this resource.)

At the Pulumi level, I think @iwahbe is right that pulumi/pulumi#3388 is the best solution we've come up with. That option would let you instruct the provider to try to check to see if the resource already exists before attempting a create. (We're also very open to your ideas for what would be an intuitive way to address these sorts of errors!)

I hope that is helpful. We can also try to help with urging maintainers of https://github.com/cloudflare/terraform-provider-cloudflare see if they can fix the return on error for this resource as well, but your concern sounds like it may be more about create errors in the declarative IaC model generally rather than the behavior of this specific resource.

@mjeffryes mjeffryes added kind/enhancement Improvements or new features awaiting/core Blocked on a missing bug or feature in pulumi/pulumi (except codegen) and removed kind/bug Some behavior is incorrect or out of spec labels Sep 30, 2024
@mikocot
Copy link
Author

mikocot commented Oct 15, 2024

Sorry, didn't notice your response.

To be fair, if you we're not using pulumi, you would still get an error when you retry and have to decide how to handle it. (but maybe the tools to resolve it would be more familiar?)

yup, but you CAN handle it, with Pulumi we just get broken state and it becomes both difficult and risky to move on.

The (upstream, terraform) provider doesn't hand Pulumi any id/handle for the thing it was trying to create when it fails. This is why refresh can't work; we don't have enough information to try to read the state in the cloud.

I see, but the same happens e.g. with azure-native provider that does not use TF, right? if creation times out or its process gets killed, from my experience, Pulumi just pretends the resource has never existed.

The best solution would be for the upstream provider to fix 1, so that the subsequent refresh would just work. (This is not always possible for all resources depending on the way the service API works, but based on the create error message it seems likely to me that it would be for this resource.)

But would that really be enough? Does Pulumi have logic to persist resource ids it tried to create but 'potentially failed' so it can verify later if they exist or not? Because we've gone through several issues with azure native provider where resources would be either duplicated in such cases or they'd cause similar conflicts. Unless something has changed in this field in the recent months, I don't think just 'being able to known those ids' is enough for pulumi to solve the problem we've faced.

your concern sounds like it may be more about create errors in the declarative IaC model generally rather than the behavior of this specific resource.

Definitely, I find it a general issue in Pulumi that I don't see being addressed. Although I might be wrong, and it might be a unique case that is failing due to provider limitations.

I guess pulumi/pulumi#3388 could help, but a) it's been there, open, since 2019 b) AFAIU it suggests a declarative, per resource solution.


Here I'm talking about a system that should work on every resource, basically:

  1. Generate resource ID (if not set by user), add to state in 'draft' status
  2. Try to create resource
  3. a) if success, add to state
    b) if failed beyond doubt (e.g. 400 bad request), fail deployment, set status as failed / remove from state
    c) if failed, but the result is inconclusive (e.g. timeout, server error, network errror), fail deployment, set status as 'inconclusive'
    d) pulumi process killed, action timeout out or cancelled, hardware issue, whatever -> do nothing

Then the next time Pulumi tries to create a resource and sees a resource in the state with status draft or 'inconclusive it first queries provider to see if resource exists and matches what was requested. If yes, the status is updated to created and pulumi moves on to the next. If not, it retries creating it. @mjeffryes, or does Pulumi follow a similar workflow already?

@mjeffryes
Copy link
Member

mjeffryes commented Oct 30, 2024

Here I'm talking about a system that should work on every resource, basically:

1. Generate resource ID (if not set by user), add to state in 'draft' status

2. Try to create resource

3. a) if success, add to state
   b) if failed beyond doubt (e.g. 400 bad request), fail deployment, set status as failed / remove from state
   c) if failed, but the result is inconclusive (e.g. timeout, server error, network errror), fail deployment, set status as 'inconclusive'
   d) pulumi process killed, action timeout out or cancelled, hardware issue, whatever -> do nothing

Then the next time Pulumi tries to create a resource and sees a resource in the state with status draft or 'inconclusive it first queries provider to see if resource exists and matches what was requested. If yes, the status is updated to created and pulumi moves on to the next. If not, it retries creating it. @mjeffryes, or does Pulumi follow a similar workflow already?

This is unfortunately quite difficult to do generically. The problem is that it's not universally possible to generate an id in step 1, that can be used to read the resource after an inconclusive create. The id required for a read is often only returned on a successful create and can't be predicted for many resource types.

Now in specific cases like this DNS Record resource, it's likely that the provider could predict the id before the call to create. We currently have 2 options for addressing this situation in the provider:

  1. Attempt the create, detect that the error means the resource already exists and try to read state.

  2. Attempt to read before creating; skipping the create if the read succeeds

Both options can cause problems if the resource was actually created by a different stack. You end up with two stacks thinking they own this resource. Support declaring that a resource should be created only if it does not already exist· pulumi/3388 is intended as an explicit opt in to mitigate this side effect.

I think you are right that the core engine could be updated to give the provider a third option:

  1. Before attempting the create, the provider returns the id it expects to use. The engine then stores this in state as you suggested and if the create fails in an ambiguous way; the engine would attempt a read before create when the update is retried.

It's worth noting that this is not a perfect fix though:

  • individual resources create behavior would have to be updated and this would only be possible if the id can be predicted ahead of time
  • It would still be possible to read a resource that you didn't actually create and get two stacks trying to manage the same resource (though with this third option that should be very rare)

FWIW, we are exploring how to support communicating these expected id to the engine. It's relevant to problems like Resource Create methods may be idempotent· pulumi/9925 and Structural resources· pulumi/918 as well. I just don't want to over promise what we can do here.

@mikocot
Copy link
Author

mikocot commented Dec 13, 2024

true, the issue of the resource possibly being created by another stack could be a problem, although as you have said, in many cases this is acceptable, so it could be configurable.

as for the ids not being deterministic, that's more of a problem, with the solution you've mentioned being a potential fix. On the other hand I believe that with many providers, like azure, you already know what the id is going to be, and azure is going to respect it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting/core Blocked on a missing bug or feature in pulumi/pulumi (except codegen) awaiting-upstream The issue cannot be resolved without action in another repository (may be owned by Pulumi). kind/enhancement Improvements or new features
Projects
None yet
Development

No branches or pull requests

3 participants