Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE node pool default for logging variant is not respected due to terraform's client-side defaulting of the logging_variant field #13706

Open
giuliano-sider opened this issue Feb 10, 2023 · 8 comments
Labels
persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/container service/terraform

Comments

@giuliano-sider
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment or link the pull request to this issue.

Terraform Version

Terraform v1.3.1
on linux_amd64
+ provider registry.terraform.io/hashicorp/google v4.44.0

v4.44.0 is the first google terraform provider version to carry the logging_variant field

Affected Resource(s)

  • google_container_cluster
  • google_container_node_pool

Terraform Configuration Files (if applicable)

Consider the following experiment:

resource "google_container_cluster" "with_logging_variant_node_pool_default" {
  provider           = google
  name               = "example-cluster-with-htl"
  location           = "us-central1-a"
  initial_node_count = 1
  release_channel {
    channel = "RAPID" # Provision a cluster which has the high throughput logging agent available.
  }
  remove_default_node_pool = true # We want to manage our node pools separately.

  node_pool_defaults {
    node_config_defaults {
      logging_variant = "MAX_THROUGHPUT"
    }
  }
}

# OK: We expect this to create a MAX_THROUGHPUT node pool due to the node_pool_defaults for this cluster, which is what happens. Since node_config is entirely missing, the node_config.logging_variant default value of DEFAULT is not filled in.
resource "google_container_node_pool" "with_logging_variant_from_node_pool_defaults-2" {
  provider = google
  name     = "example-node-pool-with-htl-3"
  cluster  = google_container_cluster.with_logging_variant_node_pool_default.name
  location = "us-central1-a"
  node_count = 1
}

# WHOOPS: We expect this to create a MAX_THROUGHPUT node pool due to the node_pool_defaults for this cluster. But instead, this will create a node pool with the DEFAULT logging_variant due to the client-side Terraform default value of DEFAULT filled in for the node_config.logging_variant.
resource "google_container_node_pool" "with_logging_variant_from_node_pool_defaults" {
  provider = google
  name     = "example-node-pool-with-htl-2"
  cluster  = google_container_cluster.with_logging_variant_node_pool_default.name
  location = "us-central1-a"
  node_count = 1

  node_config { 
    machine_type = "e2-standard-8" # Provide extra CPU for the high throughput logging agent.
  }
}


# OK: We expect this to create a MAX_THROUGHPUT `default-pool` node pool due to the node_pool_defaults for this cluster, which is what happens. Since node_config is entirely missing, the node_config.logging_variant default value of DEFAULT is not filled in.
resource "google_container_cluster" "test-node-pool-default" {
  provider           = google
  name               = "test-node-pool-default"
  location           = "us-central1-a"
  initial_node_count = 1
  release_channel {
    channel = "RAPID" # Provision a cluster which has the high throughput logging agent available.
  }
  
  node_pool_defaults {
    node_config_defaults {
      logging_variant = "MAX_THROUGHPUT"
    }
  }
}

# WHOOPS: We expect this to create a MAX_THROUGHPUT `default-pool` node pool due to the node_pool_defaults for this cluster. But instead, this will create a node pool with the DEFAULT logging_variant due to the client-side Terraform default value of DEFAULT filled in for the node_config.logging_variant.
resource "google_container_cluster" "test-node-pool-default-2" {
  provider           = google
  name               = "test-node-pool-default-2"
  location           = "us-central1-a"
  initial_node_count = 1
  release_channel {
    channel = "RAPID" # Provision a cluster which has the high throughput logging agent available.
  }

  node_config { 
    machine_type = "e2-standard-8"
  }

  node_pool_defaults {
    node_config_defaults {
      logging_variant = "MAX_THROUGHPUT"
    }
  }
}

The terraform plan for the above resources shows that node_config.logging_variant is defaulted with a value of DEFAULT when any other field in the node_config (e.g. machine_type) is specified: https://gist.github.com/giuliano-sider/2064705c25f608f55f8b1a4377996ed3. This is undesirable because the cluster resource specifies a default value in its node_pool_defaults.node_config_defaults that ought to be respected.

Issue Description

The google_container_cluster resource's node_pool_defaults.node_config_defaults.logging_variant field mirrors an existing field in the GKE API (https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters#Cluster.NodePoolDefaults) that provides a default value for the logging variant to use in the creation of new node pools in that cluster. However, in certain cases, this google_container_cluster field is not being respected because terraform provides a client-side default value for the node_config.logging_variant field of google_container_node_pool (see https://github.com/GoogleCloudPlatform/magic-modules/blob/f319c33c2aea0610c27a4cefb51f499ac91d8ce5/mmv1/third_party/terraform/utils/node_config.go.erb#L33). When this client-side defaulting doesn't happen, then the default from node_pool_defaults.node_config_defaults.logging_variant is respected. See the terraform config and its terraform plan from the previous section for an example of when this doesn't work as intended (i.e. when any other field in node_config is set, the terraform client will apply a default of DEFAULT for the logging_variant field).

In practice, this bug makes the node_pool_defaults.node_config_defaults.logging_variant unusable, because one should always set a node_config.machine_type with enough CPU for a node pool that has a MAX_THROUGHPUT logging_variant. With this bug in place, customers should refrain from using node_pool_defaults.node_config_defaults.logging_variant and instead always specify the node_config.logging_variant they want in the google_container_node_pool.

Would it be possible to fix this? Would we remove this default from https://github.com/GoogleCloudPlatform/magic-modules/blob/f319c33c2aea0610c27a4cefb51f499ac91d8ce5/mmv1/third_party/terraform/utils/node_config.go.erb#L33 ? Or would this be considered backward incompatible? Note that if nothing is specified for logging_variant in either the node pool defaults or in the node pool, the server will default the logging variant of the node pool to DEFAULT (but maybe not explicitly).

Important Factoids

References

@edwardmedia
Copy link
Contributor

edwardmedia commented Feb 10, 2023

@giuliano-sider I found below words under node_pool_defaults. Do you see a different behavior from what it says?

These settings (under node_pool_defaults) are overridden if specified on the specific NodePool object.

I noticed your below words.

This is undesirable because the cluster resource specifies a default value in its node_pool_defaults.node_config_defaults that ought to be respected.

Help me to understand where the bug is. If you propose a new behavior, maybe file an enhancement instead?

@giuliano-sider
Copy link
Author

giuliano-sider commented Feb 14, 2023

To sum it up, the bug is that the google_container_cluster resource's node_pool_defaults.node_config_defaults.logging_variant field isn't being respected because of the fact that the google_container_node_pool's node_config.logging_variant has a Terraform default value. This means that if you specify any other field in a google_container_node_pool's node_config (e.g. machine_type), the logging_variant field will assume a Terraform (that is, a client-side) default value, and the field node_pool_defaults.node_config_defaults.logging_variant will be perfectly useless.

I think my mistake in the implementation (see GoogleCloudPlatform/magic-modules#6744 and https://github.com/GoogleCloudPlatform/magic-modules/blob/f319c33c2aea0610c27a4cefb51f499ac91d8ce5/mmv1/third_party/terraform/utils/node_config.go.erb#L33) was to specify a Terraform default for this field.

@giuliano-sider
Copy link
Author

Documentation in https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#node_pool_defaults says:

The node_config_defaults block supports:

[logging_variant](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_cluster#logging_variant) (Optional) The type of logging agent that is deployed by default for newly created node pools in the cluster. Valid values include DEFAULT and MAX_THROUGHPUT. See [Increasing logging agent throughput](https://cloud.google.com/stackdriver/docs/solutions/gke/managing-logs#throughput) for more information.

but the value of node_pool_defaults.node_config_defaults.logging_variant, as I explained above, is not respected.

@edwardmedia
Copy link
Contributor

@giuliano-sider I see what you mean. Since you are the original owner of that feature, assign this to you then.

@rileykarson rileykarson added breaking-change persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work and removed bug labels Mar 16, 2023
@rileykarson rileykarson added this to the 5.0.0 Under Consideration milestone Mar 16, 2023
@rileykarson
Copy link
Collaborator

Terraform's not great at distinguishing between unset and default values. I'd agree with your initial assumption, @giuliano-sider, that this would be a breaking change to fix given the clientside default. This shouldn't be too hard to change in 5.0.0.

@c2thorn
Copy link
Collaborator

c2thorn commented Sep 26, 2023

closed with GoogleCloudPlatform/magic-modules#8967

@c2thorn c2thorn closed this as completed Sep 26, 2023
@c2thorn
Copy link
Collaborator

c2thorn commented Sep 29, 2023

Reopening as I don't think removing the Terraform default of node_config.logging_variant actually enables the node_pool_defaults.node_config_defaults.logging_variant to take effect.

After removing the default, I tried the provided configuration. The nodepool POST request was sent with the following body:

 {
  "nodePool": {
   "config": {
    "machineType": "e2-standard-8"
   },
   "initialNodeCount": 1,
   "name": "example-node-pool-with-htl-3",
   "networkConfig": {}
  }
 }

This looks correct as we are no longer sending a field for logging variant. However, when the nodepool is created, the API does not return a value for logging variant, thus Terraform interprets it as DEFAULT still.

@giuliano-sider are you able to confirm via the API that creating a nodepool while node_pool_defaults.node_config_defaults.logging_variant is set in the parent cluster will correctly inherit the logging variant default?

@c2thorn c2thorn reopened this Sep 29, 2023
@c2thorn c2thorn modified the milestones: 5.0.0, Near-Term Goals Sep 29, 2023
@c2thorn
Copy link
Collaborator

c2thorn commented Sep 29, 2023

Updating the labels/milestone as the "breaking-change" portion is still going into 5.0.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
persistent-bug Hard to diagnose or long lived bugs for which resolutions are more like feature work than bug work service/container service/terraform
Projects
None yet
Development

No branches or pull requests

5 participants