Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

container: fix missing updates for google_container_cluster.node_config subfields #12014

Merged

Conversation

wyardley
Copy link
Contributor

@wyardley wyardley commented Oct 16, 2024

This resolves an issue where many subfields of node_config in a cluster (which affects the default node-pool "default-pool" when remove_default_node_pool is not set to false) don't support updates properly, and also allows 3 subfields which had been previously set to force recreation of the default node pool (because updates were non-functional) to be updated in-place.

Some acceptance tests are added, and some existing tests have been adjusted to confirm that the behavior is the same between google_container_cluster.node_config and google_container_node_pool.node_config.

Fixes hashicorp/terraform-provider-google#19225
Fixes hashicorp/terraform-provider-google#18208
Fixes hashicorp/terraform-provider-google#16054
Fixes hashicorp/terraform-provider-google#13872

Possible fix for hashicorp/terraform-provider-google#17522 roaks3: This helps but I don't think it's a fix per se
Partial / possible fix for hashicorp/terraform-provider-google#12966 roaks3: Agreed, might fix, but we shouldn't close it yet

Followup to #11826 where the code used by the regular node pool update code was broken out.

Release Note Template for Downstream PRs (will be copied)

container: fixed missing in-place updates for some `google_container_cluster.node_config` subfields
container: added in-place update support for `labels`, `resource_manager_tags` and `workload_metadata_config` in `google_container_cluster.node_config` 

References

See above for some other issues that I believe may be fully or partially resolved by this.

b/361634104

@modular-magician modular-magician added the awaiting-approval Pull requests that needs reviewer's approval to run presubmit tests label Oct 16, 2024
@wyardley wyardley changed the title container: fix missing updates for many google_container_cluster.node_config subfields container: fix missing updates for google_container_cluster.node_config subfields Oct 16, 2024
@wyardley wyardley marked this pull request as ready for review October 16, 2024 05:27
Copy link

Hello! I am a robot. Tests will require approval from a repository maintainer to run.

@roaks3, a repository maintainer, has been assigned to review your changes. If you have not received review feedback within 2 business days, please leave a comment on this PR asking them to take a look.

You can help make sure that review is quick by doing a self-review and by running impacted tests locally.

@github-actions github-actions bot requested a review from roaks3 October 16, 2024 05:28
@modular-magician modular-magician removed the awaiting-approval Pull requests that needs reviewer's approval to run presubmit tests label Oct 17, 2024
Copy link
Contributor

@roaks3 roaks3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've kicked off tests, but @wyardley this is a little tough to understand out of context, and seems potentially dangerous. Could you possibly either:

  1. Describe what you've actually changed in the code, and why that solves the issue, or
  2. Loop in someone else who may have prior context on these changes

@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 1 file changed, 5 insertions(+), 123 deletions(-))
google-beta provider: Diff ( 1 file changed, 5 insertions(+), 123 deletions(-))

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 212
Passed tests: 199
Skipped tests: 13
Affected tests: 0

Click here to see the affected service packages
  • container

🟢 All tests passed!

View the build log

@wyardley
Copy link
Contributor Author

wyardley commented Oct 17, 2024

@roaks3 Sure!

  1. The node_config subfield of google_container_cluster controls the settings for the default node-pool when remove_default_node_pool is not set to false. The linked issue mentions how many of these fields previously didn't support updates when they should, causing an inability to control those settings (which would be updatable if they were on a standalone nodepool). It's actually the exact same resource. As suggested, I laid the foundation for this change in container: refactor node_config node pool update #11826 by moving the code for when node_cofig in the container_node_pool resource has changes into node_config into a standalone function. container: add support for kubelet read only port #11272 also has some background / context. In theory, the tests for updates on some of these fields in the node pool resource should behave the same way, and some of the existing tests (including TestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates and, I believe, TestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates` should also cover this scenario), but I'm happy to add or update some existing tests (within reason) if that helps create more confidence.
  2. Have you read through the linked issues / PRs? Those have got a lot of discussion that hopefully outlines both what the issue is, and some other PRs that have addressed part of this issue. Some of the folks at Google who have been involved in previous discussions / PRs related to this are @rileykarson and @melinath, who probably have the most context here. In particular, see this comment. I've linked to some related issues directly in the description now as well under "References".
  3. I think the biggest thing is: there are obviously some test gaps here, most of which relate to things that never worked in this context (and maybe in some cases, aren't tested in the container_node_pool resource either for the use case of updating). Again, within reason, I am happy to add some additional test coverage if it helps improve confidence in this change.

So, tl;dr this should make the behavior of the default node pool within the container_cluster resource more consistent with the behavior of independently defined or additional node pools, and fix some cases where an existing setting for default-pool could not be updated in-place with failure / permadiff.

Let me know if that helps in terms of what you were looking for.

A bit broader context: google_container_cluster having the ability to define a node pool resource within the cluster resource (or outside of it) has been a persistent cause of headaches with this resource (one reason that many people choose to delete the default pool and manage nodepools separately).

@github-actions github-actions bot requested a review from roaks3 October 17, 2024 17:57
}

nodePoolNodeConfigUpdate(d, config, nodePoolInfo, "", defaultPool, d.Timeout(schema.TimeoutUpdate))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happy to adjust this to either define timeout above and / or hard-code "default-pool" vs defining it above

@roaks3
Copy link
Contributor

roaks3 commented Oct 18, 2024

@wyardley thanks for the response!

Yea, I think I understand the intent to update this config, and some of the nuance where this is shared between resources. But I am still wondering if you could give a brief description of the actual code changes here. (I could probably deduce the answer by reading all of the related references, but it would take quite a bit longer for me to wrap my head around everything for this review). For example, why are all of these lines being removed? Are they effectively already copied under the function you are replacing them with, or do I need to review those functions as well to determine what behavior has changed? In short, what could break?

Acknowledged on the testing. Once I know a bit more about what is changing that will help me weigh in, but I'm tempted to say this is the type of change that would be covered well enough by existing tests. We may want to add some of these fields to existing tests though, if they are any that aren't already included.

@wyardley
Copy link
Contributor Author

wyardley commented Oct 18, 2024

Yea, I think I understand the intent to update this config, and some of the nuance where this is shared between resources. But I am still wondering if you could give a brief description of the actual code changes here.

So the bug here is that not all of the fields that support update in place were being updated when defined for the default pool via node_config. Some of the code, such as the fields I added in #11272 had the update implemented and had tests for it (because @melinath asked me to add them). In those cases, they would have been separately implemented in both places already, but in other cases, the code was already missing from one place.

The plan therefore was to first extract out the node_config updates in container_node_pool to a separate function as one step, and then call that same function for node_config within container_cluster.

So the simple answer to your question is that the code being pulled out should be a functionally equivalent subset of what's now being called. Some of the code should be identical or close to identical (other than the semantics around prefix being a little different and so on) to the code in nodePoolNodeConfigUpdate() that's now being called (e.g., node_config.0.image_type, node_config.0.kubelet_config, which I fixed in #11697 after adding part of it for #11272, and gcfs_config, fixed in #11717) , whereas some of the code (essentially, missing code to do the remainder of the updates) was simply not present here previously (e.g., node_config.0.tags, node_config.0.resource_manager_tags, the block with disk_size_gb, disk_type, etc.), which is the part that's being fixed. But the actual "thing" being operated on (a node pool) is the same in both cases (a node pool).

For the parts that were already implemented in both places, you should be able to compare the code and see that those sections are pretty similar.

For example, compare:

if d.HasChange("node_config.0.kubelet_config") {
defaultPool := "default-pool"
timeout := d.Timeout(schema.TimeoutCreate)
nodePoolInfo, err := extractNodePoolInformationFromCluster(d, config, clusterName)
if err != nil {
return err
}
// Acquire write-lock on nodepool.
npLockKey := nodePoolInfo.nodePoolLockKey(defaultPool)
// Still should be further consolidated / DRYed up
// See b/361634104
it := d.Get("node_config.0.kubelet_config")
// While we're getting the value from fields in
// node_config.kubelet_config, the actual setting that needs to be
// updated is on the default nodepool.
req := &container.UpdateNodePoolRequest{
Name: defaultPool,
KubeletConfig: expandKubeletConfig(it),
}
updateF := func() error {
clusterNodePoolsUpdateCall := config.NewContainerClient(userAgent).Projects.Locations.Clusters.NodePools.Update(nodePoolInfo.fullyQualifiedName(defaultPool), req)
if config.UserProjectOverride {
clusterNodePoolsUpdateCall.Header().Add("X-Goog-User-Project", nodePoolInfo.project)
}
op, err := clusterNodePoolsUpdateCall.Do()
if err != nil {
return err
}
// Wait until it's updated
return ContainerOperationWait(config, op, nodePoolInfo.project, nodePoolInfo.location,
"updating GKE node pool kubelet_config", userAgent, timeout)
}
if err := retryWhileIncompatibleOperation(timeout, npLockKey, updateF); err != nil {
return err
}
log.Printf("[INFO] GKE cluster %s: kubelet_config updated", d.Id())
}

to
if d.HasChange(prefix + "node_config.0.kubelet_config") {
req := &container.UpdateNodePoolRequest{
NodePoolId: name,
KubeletConfig: expandKubeletConfig(
d.Get(prefix + "node_config.0.kubelet_config")),
}
if req.KubeletConfig == nil {
req.KubeletConfig = &container.NodeKubeletConfig{}
req.ForceSendFields = []string{"KubeletConfig"}
}
updateF := func() error {
clusterNodePoolsUpdateCall := config.NewContainerClient(userAgent).Projects.Locations.Clusters.NodePools.Update(nodePoolInfo.fullyQualifiedName(name),req)
if config.UserProjectOverride {
clusterNodePoolsUpdateCall.Header().Add("X-Goog-User-Project", nodePoolInfo.project)
}
op, err := clusterNodePoolsUpdateCall.Do()
if err != nil {
return err
}
// Wait until it's updated
return ContainerOperationWait(config, op,
nodePoolInfo.project,
nodePoolInfo.location,
"updating GKE node pool kubelet_config", userAgent,
timeout)
}
if err := retryWhileIncompatibleOperation(timeout, npLockKey, updateF); err != nil {
return err
}
log.Printf("[INFO] Updated kubelet_config for node pool %s", name)
}

I checked and when the default node pool is defined via node_pool {} within container_cluster, that already functions as it should.

linux_node_config and the disk size etc. ones may need to be double-checked (I can play with it a little), though I think this code should handle it correctly too - in my understanding, if updating it for externally defined nodepools works, it should also work for the default pool. And, to whatever extent these things are tested under container_node_pool, my understanding is that those tests should more or less cover their use here.

However, some of these resources may not have tests in either place that test the path where the resource is updated in-place.

@melinath
Copy link
Member

Drive-by comment: @wyardley will existing tests catch this error? or is this fixing bugs that were not caught by the existing tests?

@wyardley
Copy link
Contributor Author

wyardley commented Oct 18, 2024

@melinath: There were presumably no (update) tests for the things that were broken before, at least not directly within container_cluster.node_config, since they’d be failing, and I think when we discussed, there was a preference to not add (skipped) failing tests.

I’m not sure how comprehensive the tests under container_node_pool.node_config are - I’d guess there are missing ones there as well. But in any cases where they do exist, presumably testing one should test the other once they’re using the same code.

I ran through the tests that do have update tests, but agree there are probably gaps.

Edit: See summary before of what I found (and added) test-wise.

@wyardley wyardley force-pushed the wyardley/19225/node_config_updates branch from 53caf85 to 62e2a1b Compare October 19, 2024 00:05
@modular-magician modular-magician added the awaiting-approval Pull requests that needs reviewer's approval to run presubmit tests label Oct 19, 2024
@wyardley
Copy link
Contributor Author

wyardley commented Oct 19, 2024

Ok @roaks3 @melinath: tried to do a quick audit of existing test coverage. It's kind of a patchwork, though actually a reasonable amount of coverage already, esp. if we assume the same behavior between the default pool and other nodepools once this change is in place.

I'm adding a simple tweak to an existing test that will already cover tags and labels for container_cluster.node_config.

Here's what I found (may not be perfect)

  • logging_variant update tests: TestAccContainerCluster_withLoggingVariantUpdates and TestAccContainerNodePool_withLoggingVariantUpdates
  • disk_size_gb, disk_type, machine_type, storage_pools: Seems like TestAccContainerNodePool_withMachineDiskStoragePoolsUpdate tests this? And TestAccContainerCluster_storagePoolsWithNodePool and TestAccContainerNodePool_storagePools have storage_pools parameterized but only called once. Added a change here to TestAccContainerCluster_withNodeConfig for disk_size_gb in another commit, and probably would be pretty easy to change one or two of the other ones here too).
  • taint: Tested in TestAccContainerNodePool_withTaintsUpdate. I adjusted TestAccContainerCluster_withNodeConfig so it should now test updates too.
  • tags: Not sure updates are tested in either place, but could be missing it? Adding a change in TestAccContainerCluster_withNodeConfig
  • resource_manager_tags: TestAccContainerCluster_resourceManagerTags exists but doesn't test updates; TestAccContainerNodePool_resourceManagerTags does test updates: Edit: Added an update in this PR that I think will work to testAccContainerCluster_resourceManagerTags
  • resource_labels: Looks like this is tested in TestAccContainerCluster_misc and TestAccContainerNodePool_withNodeConfig
  • labels: TestAccContainerCluster_withNodeConfig (updates in config testAccContainerCluster_withNodeConfigUpdate) - I think labels didn't change, but I'm pushing a commit so that it does change. This is a pretty good test where we could test some other small / generic stuff pretty easily since it's just two different configs vs. being parameterized.
  • image_type: Update covered in TestAccContainerNodePool_withNodeConfig and TestAccContainerCluster_withNodeConfig
  • workload_metadata_config: Update tested in TestAccContainerNodePool_withWorkloadIdentityConfig. Don't believe update is / was tested in container_cluster. Edit: added an update in this PR to test updates in TestAccContainerCluster_withWorkloadMetadataConfig too. Came across an interesting quirk there, and filed node pool node_config update doesn't notice or fail on 400 API response hashicorp/terraform-provider-google#19935 for it
  • gcfs_config: (already covered, see above)
  • kubelet_config: (already covered, see above)
  • linux_node_config: Updates tested in TestAccContainerNodePool_withLinuxNodeConfig. Don't think they're currently covered in container_cluster.
  • fast_socket: Updates tested in TestAccContainerNodePool_fastSocket. Don't think they're currently covered in container_cluster. Edit: Since this also requires gvnic to be set, I added a new test for this one, TestAccContainerCluster_withNodeConfigFastSocket vs. piling yet more stuff onto the existing test with a bunch of settings.

@wyardley wyardley force-pushed the wyardley/19225/node_config_updates branch 2 times, most recently from d0d2365 to 0b5b6a1 Compare October 19, 2024 04:14
@wyardley
Copy link
Contributor Author

I was able to extend TestAccContainerCluster_withNodeConfig to include a bunch of fields that previously would have failed on update before like taint, machine_type, disk_type, disk_size_gb, tags, and labels, which dramatically reduces any testing gaps with only minimal added / modified code. I also added a ConfigPlanChecks block.

With the changes, things pass, and, even with all the separate node pool updates, it's not horribly fast:

--- PASS: TestAccContainerCluster_withNodeConfig (808.25s)
PASS
ok  	github.com/hashicorp/terraform-provider-google/google/services/container	809.408s

The same test fails as expected without the code changes.

Some of the fields (labels, workload_metadata_config, resource_manager_tags) were originally marked as fields to force recreation, presumably because of the lack of code to update those fields; I thought about trying to remove that separately, either before or after, but I think it's all intertwined, so I suggest we pull this stuff out as well as part of these changes, and then maybe add a second release note if necessary:
https://github.com/GoogleCloudPlatform/magic-modules/blob/3f06b2ed38e474ba08cca2d8553d29a741ec1810/mmv1/third_party/terraform/services/container/resource_container_cluster.go.tmpl#L107C2-L110

This is a good example where the tests directly for container_cluster and adding the plancheck were useful and (assuming this change is safe and makes sense) will also help make everything more consistent.

@wyardley
Copy link
Contributor Author

wyardley commented Oct 19, 2024

I've added / adjusted tests for almost all of the fields directly in google_container_cluster (see above for more detailed notes on where tests were added / adjusted). Since the gvnic setting is required, and since there are so many different ones all checked in the same test for TestAccContainerCluster_withNodeConfig, I did the fast_socket tests in a new test. The resource manager update tests are a little tricky - if someone knows more about this than I do, feel free to double-check that it makes sense (basically, I made two tag / value sets, and then swapped from 1 to 2 for the update, which was simplest without making 2 entirely separate configs the way the node pool test does it).

I can add linux_node_config tests if desired as well, though the existing tests do seem to confirm that the behavior is consistent between node pool and cluster defined node pool.

All of these passed when I ran them manually; would be great if someone can kick off the full suite again Monday.

I left the commits mostly atomic for easier review, but let me know if I should squash something.

--- PASS: TestAccContainerCluster_resourceManagerTags (763.72s)
--- PASS: TestAccContainerCluster_withNodeConfig (776.55s)
--- PASS: TestAccContainerCluster_withWorkloadMetadataConfig (963.77s)
--- PASS: TestAccContainerCluster_withLoggingVariantUpdates (570.99s)
--- PASS: TestAccContainerCluster_withNodeConfigFastSocket (846.52s)
--- PASS: TestAccContainerCluster_withNodeConfigKubeletConfigSettingsUpdates (1280.76s)
--- PASS: TestAccContainerCluster_misc (1257.44s)

forceNewClusterNodeConfigFields = []string{
"labels",
"workload_metadata_config",
"resource_manager_tags",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this list had been perfectly maintained / up to date, I think this "bug" wouldn't have existed [because the default pool would have gotten recreated if any of those fields changed), but also, it seems better to update the fields in-place if we can, which it seems like we can, and I'm pretty confident at least some fields were missing from this list.

@wyardley wyardley force-pushed the wyardley/19225/node_config_updates branch 2 times, most recently from 6625353 to 036c349 Compare October 19, 2024 18:14
@wyardley wyardley force-pushed the wyardley/19225/node_config_updates branch from 10658fe to 91c0c4d Compare October 24, 2024 18:08
@modular-magician modular-magician added the awaiting-approval Pull requests that needs reviewer's approval to run presubmit tests label Oct 24, 2024
@wyardley
Copy link
Contributor Author

  • Tested and then pushed up a fix for TestAccContainerCluster_withNodeConfigLinuxNodeConfig. Also, UNSPECIFIED seems to be the default so I went from unset => v2 => v1, and checked the latter two explicitly.
  • Tried a run in RECORDING mode and a couple of runs in REPLAYING mode locally (and then another run in RECORDING mode) for TestAccContainerCluster_resourceManagerTags, and seems to be Ok.

@modular-magician modular-magician removed the awaiting-approval Pull requests that needs reviewer's approval to run presubmit tests label Oct 25, 2024
@modular-magician
Copy link
Collaborator

Hi there, I'm the Modular magician. I've detected the following information about your changes:

Diff report

Your PR generated some diffs in downstreams - here they are.

google provider: Diff ( 3 files changed, 312 insertions(+), 181 deletions(-))
google-beta provider: Diff ( 3 files changed, 312 insertions(+), 181 deletions(-))

Missing test report

Your PR includes resource fields which are not covered by any test.

Resource: google_container_cluster (415 total tests)
Please add an acceptance test which includes these fields. The test should include the following:

resource "google_container_cluster" "primary" {
  node_config {
    resource_manager_tags = # value needed
  }
}

@modular-magician
Copy link
Collaborator

Tests analytics

Total tests: 214
Passed tests: 200
Skipped tests: 13
Affected tests: 1

Click here to see the affected service packages
  • container

Action taken

Found 1 affected test(s) by replaying old test recordings. Starting RECORDING based on the most recent commit. Click here to see the affected tests
  • TestAccContainerCluster_withNodeConfigLinuxNodeConfig

Get to know how VCR tests work

@modular-magician
Copy link
Collaborator

🟢 Tests passed during RECORDING mode:
TestAccContainerCluster_withNodeConfigLinuxNodeConfig [Debug log]

🟢 No issues found for passed tests after REPLAYING rerun.


🟢 All tests passed!

View the build log or the debug log for each test

@wyardley
Copy link
Contributor Author

wyardley commented Oct 25, 2024

@roaks3 any idea why it didn’t re-run TestAccContainerCluster_resourceManagerTags this time? I guess because it already ran successfully in recording mode once, right, despite the replay failure last time?

As far as why it’s still complaining about the missing test, I've normally seen that when the test doesn't compile at all, but maybe here it's because it's not the first attribute in the nested block?

@roaks3
Copy link
Contributor

roaks3 commented Oct 25, 2024

Yea, TestAccContainerCluster_resourceManagerTags was recorded properly, but didn't pass replay earlier because it was waiting for the bootstrap permissions to be fully available. On subsequent replay, the permissions were correct and it succeeded.

I'm not sure on the missing test detector. It is reporting a correct requirement, but we also clearly have it covered in a test. It might be hitting some edge case it isn't expecting.

@christhegrand
Copy link

I thought this change might fix hashicorp/terraform-provider-google#18208, but I tried to add the sysctl config again with version 6.12.0 of the Google provider, and it's still not working for me.

amanMahendroo pushed a commit to amanMahendroo/magic-modules that referenced this pull request Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants