Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intra-egress firewall rule depends on google_container_cluster.primary #1124

Closed
tomasgareau opened this issue Jan 12, 2022 · 8 comments
Closed
Labels
bug Something isn't working P4 low priority issues triaged Scoped and ready for work

Comments

@tomasgareau
Copy link
Contributor

tomasgareau commented Jan 12, 2022

Summary

The intra-cluster egress firewall rule appears to depend on google_container_cluster.primary. This means it will not be created until after the default node pool is created and deleted. If this node pool requires the firewall rule in order to report back to the control plane (as is the case in a private cluster in a VPC network with a default-deny-all rule), the default node pool can never get created.

Context

I'm deploying a safer-cluster in a VPC network that enforces a default deny-all rule. Exceptions are granted through the use of network tags.

The safer-cluster module includes an add_cluster_firewall_rules variable that adds firewall rules for this use-case (in particular, one allowing intra-cluster egress, which is required for node pools to communicate back to the control plane).

Through #1118 & #1123 I added the gke-${var.name} network tag to the default pool so that this firewall rule would apply to it.

Expected behaviour

I would now expect the following sequence of actions:

  • the intra-cluster egress firewall rule is created
  • the google_container_cluster.primary resource is created
  • the default node pool gets created with network tags marking it as a target of the previously created firewall rule (I'm running the branch from PR 1123 which adds the gke-${var.name} tag to the default pool)
  • the default node pool reports back to the control plane
  • the default node pool gets deleted
  • Terraform continues creating my custom node pool

Actual behaviour

  • the google_container_cluster.primary resource starts being created
  • the intra-cluster egress firewall rule is waiting on the google_container_cluster.primary resource to be created
  • the default node pool is spun up but there is no firewall rule allowing it to report back to the control plane
  • GKE waits for the default node pool to check in and eventually times out

Suspected root cause

fff0078 added TPU support to beta clusters. Part of this change included adding google_container_cluster.primary.tpu_ipv4_cidr_block to the destination_ranges of the gke-intra-egress firewall rule:

{% if beta_cluster %}
destination_ranges = compact([
local.cluster_endpoint_for_nodes,
local.cluster_subnet_cidr,
local.cluster_alias_ranges_cidr[var.ip_range_pods],
google_container_cluster.primary.tpu_ipv4_cidr_block,
])

I suspect this creates a Terraform dependency on the google_container_cluster.primary resource. In my case, I have not enabled TPU (and have no need for it). I reverted fff0078 on my test branch and was able to bring up my safer-cluster instance successfully (i.e., the intra-cluster egress firewall rule was created before the default pool).

Potential fix

This could maybe be addressed by creating two firewall rules -- one that can be created before the cluster (without the TPU IPv4 block) and one that is created after the cluster (for the TPU IPv4 block). Don't know if this is zooming in too narrowly on my use-case, however (-:

@morgante
Copy link
Contributor

Honestly, this seems like a bit of an edge case. Since you already have a way to grant the correct network tag, shouldn't your existing firewall rule be able to allow the traffic?

@tomasgareau
Copy link
Contributor Author

My problem is that the firewall rule is never getting created -- there's a Terraform dependency causing it to wait for the google_container_cluster.primary resource. The google_container_cluster.primary resource, however, needs the firewall rule in order to finish creating, so my apply operation blocks until GKE times out.

This comment from the firewall.tf.tmpl seems to suggest that this is the intended use case for add_cluster_firewall_rules (creating a cluster in a VPC network with a default deny all rule). I'm not 100% sure if/where my use-case is different from expected.

That said -- I can certainly copy these firewall rules and create them myself (without the implicit dependency on the google_container_cluster.primary resource) but it feels a bit strange to be duplicating this functionality from the terraform-google-kubernetes-engine modules.

@morgante morgante added bug Something isn't working P4 low priority issues triaged Scoped and ready for work labels Jan 12, 2022
@morgante
Copy link
Contributor

morgante commented Jan 12, 2022

Hmm, the firewall rules were originally meant as as "shadow" rules to enable logging. #741 has more context.

@rux616 Since you added the TPU support in #810, can you confirm this current setup is actually working for you? I'm actually unsure why it would.

This could maybe be addressed by creating two firewall rules -- one that can be created before the cluster (without the TPU IPv4 block) and one that is created after the cluster (for the TPU IPv4 block). Don't know if this is zooming in too narrowly on my use-case, however (-:

We could probably split into a rule for the cluster overall, then a rule specific to just TPUs.

@tomasgareau
Copy link
Contributor Author

I think the intra-egress firewall rule was introduced in #470 (git blame pointed me to 16bdd6e) -- not sure that it's necessarily related to the "shadow" rules.

If the rule split would work I'd be happy to take a crack at a PR tomorrow!

@rux616
Copy link
Contributor

rux616 commented Jan 13, 2022

@rux616 Since you added the TPU support in #810, can you confirm this current setup is actually working for you? I'm actually unsure why it would.

I'm currently on holiday, but will dig into this when I get back next week. Just off the top of my head though, I believe we haven't had to touch the code on our end where the cluster is defined in a while as it has been working for us.

@legal90
Copy link
Contributor

legal90 commented Dec 21, 2023

Since #1126 is merged (thanks to @tomasgareau!), this issue can be closed as fixed.

@DrFaust92
Copy link
Contributor

@apeabody ^ can we close this one?

@apeabody
Copy link
Contributor

Completed in #1126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P4 low priority issues triaged Scoped and ready for work
Projects
None yet
Development

No branches or pull requests

6 participants