-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve container engine / node pool handling #285
Comments
Adding/removing node pools is really complicated with regards to the default node pool. Let's say I have only the default node pool in my cluster, and I want to add another one (assume we supported update on node pools in clusters for now). But our terraform config doesn't have the default node pool in it, so terraform would want to delete it, which isn't our intent. This means that everyone that wants to make changes to node pools would then have to add the default one back into their config, which is kind of gross. If we removed node pools entirely from clusters and left them as separate resources, then we lose the ability to create a node pool that isn't the default one on cluster creation. So no solution from me right now either, but I'll be thinking about this since it's important. Happy to have this conversation though, it's something I've been meaning to get to for a while :) |
Okay, here are my thoughts, after more study of the API available and the current code we have. We create a schemaNodePool in the same spirit as schemaNodeConfig. For create: For get: For update: Loop through nodePools[] from returned Cluster object
If machine_type for a nodePool has changed: forceNew. For delete: Looking forward to hearing where my logic is wrong, or any arguments for/against this approach :) |
Let's table the conversation about Update for now, since there's not much of a design decision to happen there- as time has gone on, GKE has allowed updating of more fields and they just haven't made it into Terraform yet because we've been more focused on other things people have asked for :) (feel free to file a separate issue about it though!) With regards to node pools- your approach is pretty similar to the one that I came up with when I started thinking about this. I had actually planned on deprecating the separate node pools resource and then got pulled on to other things and never swung back to finish it up. I actually do like the idea of insisting the user create a node pool rather than allowing the API to create a default one, since the user is still specifying the same attributes they would with the default one- it just appears elsewhere. And then we don't run into diffs because the number of node pools server-side is different from what's in our config file. |
#299 is going in that direction. |
Regarding update - I will create separate issues for the fields that don't need to force recreation anymore, and maybe also take a stab at some PRs to fix those issues :) Sounds good to me, to deprecate separate node pools and insist the user specify them in the cluster resource :) Can we break this down in to smaller steps, or does it need to be one big patch? |
You can probably break it down into smaller steps- give it a try and see how it goes :) Here are some docs about contributing, let me know if you have questions or need help: https://github.com/hashicorp/terraform/blob/master/.github/CONTRIBUTING.md |
Actually, let's back up a step to make sure we need to do this. What exactly are you trying to do with the current infrastructure? It sounds like you want to make container cluster importable and you're running into issues there because node pools on clusters in terraform don't accept a node config. That's going to be fixed in #299. Once that's fixed, what's the next problem that leads container clusters to not be importable? Or am I misunderstanding somewhere? |
Mostly my purpose is to be able to manage our GCP infrastructure with Terraform. There used to be a bunch of problems where the Google provider made assumptions about how people use GCP, which did not follow how we had setup our infrastructure. But the Provider seems to have been vastly improved over the last couple of months (kudos for that!), along with us having changed our setup a bit, means that we are down to mostly not being able to import certain of our resources, most important of them being Container Engine Clusters / Node Pools. |
One more thing that I guess this ticket might also address - changes to node_pools should not need recreate the entire cluster :) |
So it seems like there are a few different things getting conflated here. On the topic of Terraform not having certain functionality that the API exposes, that's almost always the case of "nobody got around to implementing it yet" vs us assuming you won't use certain features. Likewise, certain things get added to the API later (like all the properties of container cluster that are now updatable), which can easily lead to resources that are out-of-date solely due to the passage of time and not because of any decisions explicitly made to keep them that way. So now in terms of deprecating the node pool resource, is there a scenario that having both makes impossible? If so, it'll be much easier to just leave things the way they are so users don't have to change their existing configs. Having node pools as a separate resource also allows users to easily add/remove node pools after the default one has been created. There have been a lot of situations in other resources where we've implemented similar behavior (allowing two different ways of specifying the same thing) in order to handle different use cases (see #171 for a good example). We can certainly update our docs to make sure people are using the right resource from the beginning, but if there's not broken behavior I'd rather not have us spend the effort on this. As for removing |
I think the current pair of resources to manage a GKE cluster makes things confusing. The default pool config can only be configured through Personally I prefer to manage everything through I mean, upgrading node pools with version 1:
version 2:
version 3:
could be achieved with
version 2:
as long as the resource creates the new node pools before destroying the old ones during the delta. (Pods will be rescheduled on the new node pool assuming the scheduling constraints are met, assuming the capacity of pool-v2 is at least the same as pool-v1.) As a second step, we can change the resource to update the node pools when possible (changing node versions is a different operation than changing node pool size, so updates to the two cannot be done at the same type. The resource needs to detect these and warn the user. |
Hey everyone, this looks like it's getting lengthy and confusing. Let's see if we can't all get on the same page here. @drzero42, it sounds like you're looking for GKE clusters and node pools to be importable. Is that accurate? It doesn't really matter what steps we take to get there, you're happy as long as the cluster and node pools are importable, right? (Assuming we don't lose current functionality, etc.) There seems to be a separate, parallel conversation here about whether it makes sense to have a Finally, there seems to be a conversation around this, as well:
Why can't the default pool config be managed through Managing only non-default pools through google_container_node_pool (until #299 gets merged) does seem to be the case, and is unfortunate. But there seems to be a straightforward fix for that pending. So at the end of the day, I'm considering this issue to be mostly about "how do we get My thoughts on that are pretty straightforward: I think we can make them importable just as we would any other resource. Users shouldn't be using Of course, that comes with some problems. The biggest I can see right now is (e.g.) one team creating a cluster using Terraform, specifying no node pools, and another team creating a node pool under that cluster, using Terraform. The first team clearly owns the cluster and its administration, but wants to delegate the node pool ownership to the second team. One way around this limitation would be for the cluster to set This is, admittedly, all pretty confusing, and we should probably do a better job of documenting how these resources relate to each other. Hopefully that clears some things up, and doesn't add to the confusion. And hopefully I actually have an accurate understanding of what's being discussed here. Please correct me if I don't. :) |
First, I should apologize for my latency - life called and demanded my attendence :) @danawillow Yes, missing functionality definitely related to the API evolving, and nobody having gotten around to implement it yet. I didn't open this ticket to complain - I just wanted to get the train rolling on getting these things fixed and added :) Wrt the separate node pool resources debate, if we can let the default pool be handled by the cluster resource and then add node pools with separate node pool resources, I don't have any issues with that. I just have a feeling it will be problematic to implement, since a describe/get call to the API for a cluster, will return a node pools list, which includes all node pools in the cluster. How does the code actually map that to resources, if those node pools could be either specified inside the cluster resource or as separate node pools? The way I see it, it would seem to be quite error-prone if we don't mandate either one or the other approach. I have no particular feelings towards one or the other - I just want to be able to import my current setups and start managing everything through Terraform ;) Wrt changing node pool size, there is a limitation in the API that makes it difficult for us. I haven't found a way to actually retrieve the current number of nodes in a pool, only the initial count. Does anybody here have a way to contact the Container Engine team and request this information be added to the API? :) @paddycarver Yes, this is getting lengthy and a bit confusing, but I think a lot of very good points have come from this debate so far :) |
For changing node pool size, I'm already working on it- the issue for that is #25 (please claim issues if you're working on them so we don't duplicate work!) |
I'm not sure it makes much sense to only manage some of the infrastructure. Is the pattern you're suggesting that users create clusters manually and then use terraform to manage node pools? The expectation should be that terraform manages both because it should be able to manage an entire infrastructure. I think the confusion is that gcloud and the console UI both do something automatic that isn't intuitive: they create a node pool. In my opinion, terraform should attempt to avoid that kind of automatic/hidden behavior. The least surprising, most declarative implementation is that |
@coreypobrien I kind of agree, but if |
I'm not so much suggesting a pattern, to be honest. I think more in terms of what users are trying to do:
If there's something people want that's not covered by that, I definitely want to hear about it, and we'll want to work out a solution for it. I also know that last suggestion isn't very pretty, and we're working on ways around it, but the ideal way is currently held up pending some core improvements.
Could you help me understand why? What could you do if it was that you cannot do today?
This, to me, is the crux of what this issue is asking for. And as far as I can tell, we can make Apologies, I think I'm just confused about why these changes are being proposed in the first place and what the goals are. I'd love to have more info on that. |
My use case is
Right now that isn't possible because of this idea that you can either manage node pools and ignore the cluster state, or managed the cluster but ignore the node pools. In my opinion, designing Terraform to only partially manage infrastructure is counter-intuitive. The default for Terraform providers should be to expect to manage everything. Anything that isn't managed is an exception to the default. It doesn't matter to me whether |
@paddycarver The only reason I would vote to remove |
I feel like keeping I agree with @drzero42 however that any of the 2 approaches is better than the current broken behaviour. |
I understand that's not possible: one can't create a container_cluster without the default node_pool, due to how the API works. |
According to the API docs i don't think you have to have the default node pool. I think the key is that if you specify the |
You're right, I misunderstood @rochdev wanted to create a cluster without any pool, and manage the pools through |
I just made #473, which might be yet another reason to change the current way of doing it. The node pools created by google_container_cluster don't have the same set of options available as the ones created by google_container_node_pool resources - in this case autoscaling. Seems to me it would be less confusing for everybody (including devs who want to add new features to node pools) to just choose one way of doing it - maintain all node pools for a cluster as proper google_container_node_pool resources (might be problematic due to the way Terraform works?) or maintain full config for all node pools inside of the google_container_cluster resource they are a part of. |
@drzero42, I just want to clarify something at a high level. When a resource in Terraform doesn't support a certain set of features, the reason is almost always going to be that we just haven't gotten around to it yet. Please give us the benefit of the doubt, and be assured that it is much easier to add features than it is to deprecate large portions of our codebase that likely already have users that would need to migrate 😃 |
I second what @drzero42 has said. I think having both I think that the I also think that keeping edit edit 2 😄 |
Hi folks, My gosh, this issue has gotten long and gone on a long time. I'm starting to have trouble following what exactly people are asking for, and I think it's because a lot of related, but different things are being asked for in this thread. In an effort to make sure everyone's feedback is taken into account and to make sure we can track actionable chunks of work, I'm going to close this issue and split it into several related issues, each tracking a more granularly-scoped problem. Here's a breakdown of the new issues:
If there's a problem you're having that I didn't cover, feel free to open an issue for it. As you do so, keep in mind that it's definitely easiest for us to get on the same page if you tell us about the problem you're having. Things like "I want to be able to XXX but I can't because YYY" are incredibly helpful in this regard. Finally, I notice a lot of proposals and votes to deprecate resources in this issue. I'm not going to comment on any specific resources, or any specific comments, but I think it's worth mentioning that our goal has always been to support the infrastructure design that makes sense to each org, not declare one right way to do it. We value everyone's feedback and ideas, and are listening carefully, but do also keep in mind that there are a lot of diverse organizations that use Terraform, and we want to meet everyone's needs the best we can. |
@danawillow I always give you the benefit of the doubt - I didn't mean anything disrespectful. My point was more that it seems like extra work to maintain the same kinds of features in multiple places :) |
<!-- This change is generated by MagicModules. --> /cc @chrisst
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks! |
Terraform Version
v0.10.0
Affected Resource(s)
During my work with trying to bring our Google Cloud Platform infrastructure under Terraform administration, I ran into what seems like an artificial limitation in the provider. We run Container Engine clusters with multiple node pools, which have different node configurations. Currently the google_container_cluster resource only allows me to set names and initial_node_count for the node pools I want to create, which will all be created with the same node config. I see that google_container_node_pool resources in the current master branch actually support seperate node configs, even though the documentation on terraform.io doesn't reflect it yet. This is a step in the right direction, as I can let google_container_cluster create the default pool and add a second one next to it. This, however, still makes it difficult to make container_cluster resources importable, as the get/describe call will return all of the node pools attached to a cluster.
I am not experienced enough with Terraform or GCP yet, to suggest a perfect solution for this, but my (maybe naive) thought for handling this would be to have a node_pools argument for google_container_cluster resources, which is a list of hash references to google_container_node_pool resources. I think this leaves us with a chicken/egg or circular dependency problem, where there doesn't exist a hash ref for the google_container_node_pool resources until they are created, but they can't be created until a cluster exists. The API expects the call to create a cluster to include at least one node pool (or node config, which will be used to created a pool called default-pool). This leads me to think that maybe node pools shouldn't be independent resources, but rather compound arguments on google_container_cluster resources. This would simplify import, but would require additional code to handle changes to node pools.
Maybe this ticket could be a discussion where a consensus for how to handle this could be found?
The text was updated successfully, but these errors were encountered: