Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TF tries to destroy/recreate older projects with GAE enabled upon upgrade #1561

Closed
danawillow opened this issue May 29, 2018 · 25 comments
Closed
Assignees
Labels

Comments

@danawillow
Copy link
Contributor

See report at #1503 (comment)

Answer is probably just to set app_engine to computed in the project resource.

@danawillow
Copy link
Contributor Author

(assigned to both paddy and vincent since vincent is onduty this week, up to both of you who wants to take it)

@endemics
Copy link

Hi, I'm having the same issue as @stevewolter in #1503 (comment).

Could the BACKWARDS INCOMPATIBILITIES / NOTES section of the changelog for 1.13.0 be updated with a link to this issue as a warning?

@danawillow
Copy link
Contributor Author

Good idea, will do!

@paddycarver
Copy link
Contributor

Hi @stevewolter / @endemics! This is just happening because there's no App Engine block defined inside your google_project resource, but App Engine is enabled on the project. From Terraform's perspective, this is "drift", and should be corrected. It should be resolved if you just add the block to your config. From my perspective, I'm assuming this is a quick/easy fix, but is there a reason this is harder/more difficult than I'm assuming? I'd definitely love to hear about it, if so!

@rosbo rosbo removed their assignment Jun 6, 2018
@stevewolter
Copy link

Hi @paddycarver, got it. I knew how to fix this, and the fix is fine, but my problem is rather the surprise impact. My Terraform GCP plugin updated automatically, my config didn't change, and TF would have suddenly deleted all my projects because it thought it needed to disable GAE. That is an unpleasant surprise, and any mitigation of the surprise would be appreciated.

@paddycarver
Copy link
Contributor

Hi @stevewolter, thanks for the clarification. And my sincerest apologies for the surprise. I'll be honest, it was just an oversight on my part, and had I properly considered that scenario, I probably would have done the release a bit differently (either holding it for 2.0.0, or messaging it much more strongly). I can definitely understand how that's an unpleasant surprise to find after an upgrade.

As for mitigation of the surprise, I'm weighing my options, but I'm not seeing any super good ones:

  • I could pull App Engine support, which would reset things to how they were, but would be an unpleasant surprise for anyone who started using it or added it to their config, so it'd take some convincing for us to decide that's the best course.
  • I could mark App Engine as computed, so it wouldn't treat it missing from your config as a diff, and would just accept the API response as fine. Unfortunately, that means there's no way to disable an App Engine application once you've added it to your project. (I mean, technically, you can't disable an App Engine application, you have to destroy and recreate the project, but at least Terraform can tell you that, today.) It also means that if someone enables App Engine outside of Terraform, we have no way to message that drift for people or give them a way to remediate it. I'm more open to this path forward, but I'm still not sold on making behavior intentionally incorrect to correct for a mistake.
  • I could do some really complicated stuff to make sure App Engine is only treated as a diff if it has ever been in your config, but that still has the problem of ignoring drift. We've also done that in the past, and it always ends up being buggy and a mess to debug and maintain. I'd need a lot of convincing that this is the correct behavior in this case for that to happen.
  • We could change tack and pull App Engine applications out to be their own resource. This has the same drawbacks as the first option, unless we wait for 2.0.0 to do this, but at least keeps the functionality intact. Unfortunately, because App Engine has no way to delete an application, we'd have no way of telling people when a new project is necessary, and would just have to silently fail with some messages in the logs, and that's not ideal. I have an issue open with the App Engine team about this to see if we can find a way to support that, because I think that would be the ideal solution here, and then at least version 2.0.0 of the provider could correct this.

If there's a possibility I'm missing, I'd definitely love to hear about it. And again, my sincerest apologies for overlooking that scenario during the release, that was a far more unpleasant surprise than I had intended.

@endemics
Copy link

endemics commented Jun 8, 2018

Same for us here, the main issue is that it is a breaking change in a minor release, and was not advertised.
I have not yet tried to fix our config to check if that would fix it in our case (will try next week), and have resorted to lock the provider version for now.

@stevewolter
Copy link

@paddycarver My apologies if I came across too strongly. My world is a much better world with the App Engine support, I'm seriously happy about it being launched, and I just had some feedback about the rollout. Thank you for working on this. On my end, a well-placed "prevent_destroy" flag on the project prevented real damage.

I realize that all of the options have some drawbacks. Might it be feasible to select between two options with a flag? e.g. have a "turn_down_unexpected" flag that controls the behavior when GAE is found but not configured in TF? The flag might be false by default for now and the default could be switched to true after some announcement and lead time. I'm not sure whether that's an option with the way TF plugins interoperate with TF's core, though.

@endemics
Copy link

Hi,

I've checked adding

  app_engine {
    location_id = "us-central"
  }

to my google_project resource and it seems that terraform apply would be happy now.

While that breaking change was unfortunate I understand that fixing the situation in code at this point is not desirable, especially since the affected user population is probably quite limited.

Hence, I wouldn't mind having this bug closed provided that the CHANGELOG's BACKWARDS INCOMPATIBILITIES / NOTES is updated to mention that the project resource needs to be updated before upgrade if one uses app engine.

While not directly related to this issue, but somewhat tangent since related to #1503, I also believe that the resource documentation itself should be updated with a note about the need to add the "App Engine Admin API" to the list of enabled APIs if one uses google_project_services as it would be easy to overlook.

@danawillow
Copy link
Contributor Author

Hey @endemics, I updated the wording of the warning in the CHANGELOG. Let me know what you think.

@paddycarver
Copy link
Contributor

My apologies if I came across too strongly. My world is a much better world with the App Engine support, I'm seriously happy about it being launched, and I just had some feedback about the rollout. Thank you for working on this.

No apologies necessary :) Honestly, this is just a really bad oversight to have on my part, because the consequences can be catastrophic. I feel bad, and want to fix the situation, but I'm not quite sure I can glue this vase that I broke back together. I may have to settle for just being more careful around vases in the future.

On my end, a well-placed "prevent_destroy" flag on the project prevented real damage.

I am interested in this. Did terraform plan (or apply, which now shows the plan output and asks for confirmation) not show the re-creation? Are you using a setup that doesn't show you plan output, effectively blindly applying? (That sounds accusatory, but I don't mean it to be, I'm just trying to make sure my assumptions about what feedback users are seeing are accurate.)

To be clear, the question isn't to absolve us of responsibility for not shipping surprising things, but just so my mental model of user workflows is correct.

I realize that all of the options have some drawbacks. Might it be feasible to select between two options with a flag? e.g. have a "turn_down_unexpected" flag that controls the behavior when GAE is found but not configured in TF? The flag might be false by default for now and the default could be switched to true after some announcement and lead time. I'm not sure whether that's an option with the way TF plugins interoperate with TF's core, though.

I'd be open to that, but that kind of exists today? That's what lifecycle.ignore_changes does. And I don't know that it would have prevented the surprise, because you'd need to edit your config to take advantage of it. Am I misunderstanding?

@paddycarver
Copy link
Contributor

While not directly related to this issue, but somewhat tangent since related to #1503, I also believe that the resource documentation itself should be updated with a note about the need to add the "App Engine Admin API" to the list of enabled APIs if one uses google_project_services as it would be easy to overlook.

I could see an argument for this, but we don't really document any of the other APIs that our resources need, partially because they could change over time, and keeping on top of that is tricky. Is there something about the App Engine Admin API you believe makes it different and deserving of special handling?

@stevewolter
Copy link

stevewolter commented Jun 20, 2018 via email

@paddycarver
Copy link
Contributor

So what I'm thinking is: Is there a way to make ignore_changes or
prevent_destroy the default behavior only for the case where the absence of
a GAE config would recreate the project? So that a legacy config that
doesn't mention GAE or additional flags at all wouldn't try to destroy a
project that has GAE enabled? I realize that the answer might be "no".

There is, but it changes the default behavior again, and I'm unsure about the potential for benefit here; I think, as you mentioned, the damage was kind of done at this point, unfortunately.

Happy to explain. We are using a setup that applies blindly. For background, We are a dev team of ~20 people, each of which runs a personal dev project, and then we have a couple of nightly setups for trusted testers. We have no customer-facing prod setup yet. Of the devs, ~3 know enough about Terraform to actually check the plan, and the rest just learned to type "yes" and ask no questions, so we disabled the confirmations and started to apply blindly both in the developer-run update
scripts and in the CI.

I do want to be really explicit that this is playing with fire and while it may be appropriate for a dev setup where downtime is tolerable, it's really not an appropriate or supported practice where occasional downtime is not acceptable. (No judgement on which your situation is, that's for you to decide, I just want to be really explicit on this point.) Terraform's plan output is a really important guardrail that we rely on for the safe operation of Terraform. We definitely have a responsibility to avoid surprising breaking changes like this, and I don't want to detract from that at all, but infrastructure can be complicated, and things can surprise you sometimes even without releases, and verifying the changes that are going to happen are what you actually want is an important step.

All this being said, I'm struggling to come up with an action item for this issue, so I'm going to close it. Again, my sincerest apologies for the oversight. Thank you all for your patience and understanding.

@stevewolter
Copy link

Sounds good to me, and thank you for the warning. Thank you so much for implementing this feature in the first place (the bumps in the rollout are really a minor issue in a great big picture), and for listening to me. I really appreciate it.

@sjungwirth
Copy link
Contributor

sjungwirth commented Aug 30, 2018

I don't think recreating the project in order to "remove" app engine will actually work, as projects don't really get deleted from GCP for maybe ~30 days. When TF "destroys" a project it just gets moved to a pending deletion status, and the real kicker, you can't create a new project with the same ID, because the old project with that same ID still exists.

I don't think the current behavior of trying to recreate a project in order to disable/remove app engine makes sense in any context.

I could mark App Engine as computed, so it wouldn't treat it missing from your config as a diff, and would just accept the API response as fine.

This seems like the most reasonable action to me, although I agree it is not ideal, just that the other options make even less sense.

[edit new issue added here #1973]

@paddycarver
Copy link
Contributor

Hey @sjungwirth, a new issue would be the best place to discuss this.

@roman-cnd
Copy link

@paddycarver thanks for the detailed explanation, I came here with the same problem of terraform suddenly trying to wipe out every project.

Where's google take on this? Are they even interested or incentivized to make their cloud resources more suited for config management? terraforming resources in google cloud is a real pain atm. Its a mess with IAM, with "enabled" APIs, now mess with the projects and app engine.

As a terraform user, I want as many resources managed in the provider as possible with the code. However, I'd much prefer that the features that can blow up high-level resources are not getting into the provider.

@roman-cnd
Copy link

roman-cnd commented Sep 26, 2018

Also, terraform behaves properly (not trying to re-create projects) after this snippet is added to project, terraform plan and then deleted.

#   app_engine {
#     location_id = "us-central"
#   }

PS never mind, it had app engine somehow enabled outside of terraform

@paddycarver
Copy link
Contributor

Where's google take on this? Are they even interested or incentivized to make their cloud resources more suited for config management? terraforming resources in google cloud is a real pain atm. Its a mess with IAM, with "enabled" APIs, now mess with the projects and app engine.

Google's been really receptive to making their resources work well with infrastructure as code tools, and has a team internally working on this provider to make it great. They have a tight working relationship with the people that build the products and APIs, and because of their involvement, we're able to do a lot more than we could with just HashiCorp and outside contributors working on this. A lot of our wins go unnoticed, but that tight partnership has warded off a bunch of user pain before it ever got released, and opened a lot of doors that we'd otherwise consider non-starters.

That said, some of these APIs--for example, App Engine--predate Google Cloud itself, and so don't get to take advantage of some of the lessons learned there. Work is being done there, and the teams are doing what they can, but APIs are by nature slow-moving things, because it's important to not break consumer implementations. Overall, however, we're trending in the right direction.

As a terraform user, I want as many resources managed in the provider as possible with the code. However, I'd much prefer that the features that can blow up high-level resources are not getting into the provider.

I think that's fair, and is feedback we'll take under advisement. I'm currently circulating a proposal internally for another approach to this, which I'll post here as an issue to gather community feedback on. I'd love to hear your feedback on it. I'll try to link it to this issue when it's posted.

@leighmhart
Copy link

For reference, if your terraform binary isn't updated to the latest, the provider may silently destroy your project without mentioning the change/intent in a plan. Just happened to me with 0.10.7

@morgante
Copy link

morgante commented Oct 2, 2018

if your terraform binary isn't updated to the latest, the provider may silently destroy your project without mentioning the change/intent in a plan.

Is there a bug related to this? That is very surprising and sounds very different from this issue (where projects get recreated, but that is absolutely mentioned in the plan).

@paddycarver
Copy link
Contributor

I'm very interested in how that could happen as well, that's not intended behaviour at all. I'm sorry for the inconvenience, I had no idea that would happen.

I'm curious, but I think we've also established the current design is not working. #2118 looks to be the way we're moving forward. I need to update the thread with the results of implementation, so #2147 isn't an entirely faithful implementation of that proposal, but it's the closest I could get. I'm hoping it's part of the next release.

@leighmhart
Copy link

@morgante see #2118 (comment) for RCA. Likely the only bug relates to targeted apply not doing what was expected, and some PEBKAC on my part.

@ghost
Copy link

ghost commented Nov 16, 2018

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 16, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

9 participants