-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS Record state discrepancy #910
Comments
Hi @mikocot. I'm sorry your hitting this error. Unfortunately, this is almost definitely a bug in the upstream provider: cloudflare/terraform-provider-cloudflare. A pulumi level fix would be very similar to the resolution of pulumi/pulumi#3388. |
I agree it's comes from an upstream issue, but the very reason for using Pulumi is to become independent from such events. If we need to workaround pulumi getting desynchronized with the underlying provider it becomes easier to just do it directly. I don't understand why Pulumi can't register the identifier of the resource it creates when the request does not succeed, and check for its existence the next time around instead of trying to create it again. It probably doesn't make sense when the response is 404, or bad request, but it's something absolutely expactable if the response is not clear about the actual result. |
Also, looking from a customer perspective, think that if we did not use pulumi, such incident with a provider like CloudFlar or Azure, would be a no event, we'd just retry a creation. If this happens with Pulumi, we're actually stuck. There isn't any good way to deal with such state discrepancies, we had to do it before, and it's a super error prone and tricky operation making the whole deployment fragile. Therefore Pulumi should do anything possible to avoid such state discrepancies, and it does not seem to. The fact that such issues are not fixed with a 'refresh' are even more worrying. And they are not becaue Pulumi doesn't seem to have any record of trying to create such resource first of all. |
I totally agree; when providers are not resilient to create failures and the resource was actually created in the cloud, the declarative model breaks down a bit. To be fair, if you we're not using pulumi, you would still get an error when you retry and have to decide how to handle it. (but maybe the tools to resolve it would be more familiar?) The problem is fundamentally tricky to solve at the Pulumi layer for a few reasons:
The best solution would be for the upstream provider to fix 1, so that the subsequent refresh would just work. (This is not always possible for all resources depending on the way the service API works, but based on the create error message it seems likely to me that it would be for this resource.) At the Pulumi level, I think @iwahbe is right that pulumi/pulumi#3388 is the best solution we've come up with. That option would let you instruct the provider to try to check to see if the resource already exists before attempting a create. (We're also very open to your ideas for what would be an intuitive way to address these sorts of errors!) I hope that is helpful. We can also try to help with urging maintainers of https://github.com/cloudflare/terraform-provider-cloudflare see if they can fix the return on error for this resource as well, but your concern sounds like it may be more about create errors in the declarative IaC model generally rather than the behavior of this specific resource. |
Sorry, didn't notice your response.
yup, but you CAN handle it, with Pulumi we just get broken state and it becomes both difficult and risky to move on.
I see, but the same happens e.g. with azure-native provider that does not use TF, right? if creation times out or its process gets killed, from my experience, Pulumi just pretends the resource has never existed.
But would that really be enough? Does Pulumi have logic to persist resource ids it tried to create but 'potentially failed' so it can verify later if they exist or not? Because we've gone through several issues with azure native provider where resources would be either duplicated in such cases or they'd cause similar conflicts. Unless something has changed in this field in the recent months, I don't think just 'being able to known those ids' is enough for pulumi to solve the problem we've faced.
Definitely, I find it a general issue in Pulumi that I don't see being addressed. Although I might be wrong, and it might be a unique case that is failing due to provider limitations. I guess pulumi/pulumi#3388 could help, but a) it's been there, open, since 2019 b) AFAIU it suggests a declarative, per resource solution. Here I'm talking about a system that should work on every resource, basically:
Then the next time Pulumi tries to create a resource and sees a resource in the state with status |
This is unfortunately quite difficult to do generically. The problem is that it's not universally possible to generate an id in step 1, that can be used to read the resource after an inconclusive create. The id required for a Now in specific cases like this DNS Record resource, it's likely that the provider could predict the id before the call to create. We currently have 2 options for addressing this situation in the provider:
Both options can cause problems if the resource was actually created by a different stack. You end up with two stacks thinking they own this resource. Support declaring that a resource should be created only if it does not already exist· pulumi/3388 is intended as an explicit opt in to mitigate this side effect. I think you are right that the core engine could be updated to give the provider a third option:
It's worth noting that this is not a perfect fix though:
FWIW, we are exploring how to support communicating these expected id to the engine. It's relevant to problems like Resource Create methods may be idempotent· pulumi/9925 and Structural resources· pulumi/918 as well. I just don't want to over promise what we can do here. |
true, the issue of the resource possibly being created by another stack could be a problem, although as you have said, in many cases this is acceptable, so it could be configurable. as for the ids not being deterministic, that's more of a problem, with the solution you've mentioned being a potential fix. On the other hand I believe that with many providers, like azure, you already know what the id is going to be, and azure is going to respect it |
Describe what happened
Yesterday cloudflare had an outage that caused an error when creating a DNS record.
Upon retry Pulumi seems to not be aware that this record was actually created and tries to add it again, which causes the update to fail.
Refresh did not help.
Sample program
Log output
original error:
some other records we tried toc reate gave slightly different error:
now, every time we retry update (before and after refresh):
Affected Resource(s)
cloudflare:index:Record
Output of
pulumi about
we use https://github.com/pulumi/actions so it's irrelevant
Additional context
This bug might be not reproducible in an easy way, without mocking CF responses, but it shows a bigger underlying issue that we have discussed with pulumi numerous times. It seems that when pulumi fails to get a successful response from a provider it assumes the resource was simply not created, while often it was or will be soon after. Instead of verifying this the next time around (or doing refresh) pulumi then tries to create that resource again, which either fails or causes a duplicate. We were assured that this scenario has been taken into account but we still see cases like this one were clearly it's a problem.
Unfortunately in such situation there is really little we can do, as the resource is created, deployment won't work, and pulumi also won't clean it as it's not aware of its existence (not that we can destroy the stack in many cases). We're aware that we could import resource, etc etc but those deployments are all part of an automated solution that deploy simtimes hundreds of times per day and we use pulumi to exactly avoid having to manage all such cases.
Contributing
Vote on this issue by adding a 👍 reaction.
To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already).
The text was updated successfully, but these errors were encountered: