Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_pools within same subnet are created sequentially #26933

Closed
1 task done
aescrob opened this issue Aug 5, 2024 · 4 comments · Fixed by #27583
Closed
1 task done

node_pools within same subnet are created sequentially #26933

aescrob opened this issue Aug 5, 2024 · 4 comments · Fixed by #27583
Labels
enhancement service/kubernetes-cluster upstream/microsoft/waiting-on-service-team This label is applicable when waiting on the Microsoft Service Team v/3.x

Comments

@aescrob
Copy link

aescrob commented Aug 5, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave comments along the lines of "+1", "me too" or "any updates", they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment and review the contribution guide to help.

Terraform Version

1.5.7

AzureRM Provider Version

3.114.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster_node_pool

Terraform Configuration Files

resource "azurerm_kubernetes_cluster_node_pool" "nodepool1" {
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks_cluster.id
  name                  = "az1${local.nodepool_name_suffix}"
  vm_size               = local.nodepool_immutable_config.vm_size
  vnet_subnet_id        = local.nodepool_immutable_config.subnet_id
  zones                 = ["1"]
  enable_auto_scaling   = true
  min_count             = var.agents_min_count
  max_count             = var.agents_max_count
  node_count            = var.agents_node_count
  orchestrator_version  = var.cluster_version
  snapshot_id           = local.nodepool_immutable_config.nodepool_snapshot_id
  kubelet_config {
    container_log_max_size_mb = local.kubelet_config.container_log_max_size_mb
  }

  upgrade_settings {
    max_surge = var.upgrade_settings_max_surge
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes = [
      node_count, # managed by cluster autoscaler
    ]

  }
  depends_on = [azurerm_kubernetes_cluster.aks_cluster]
}

resource "azurerm_kubernetes_cluster_node_pool" "nodepool2" {
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks_cluster.id
  name                  = "az2${local.nodepool_name_suffix}"
  vm_size               = local.nodepool_immutable_config.vm_size
  vnet_subnet_id        = local.nodepool_immutable_config.subnet_id
  zones                 = ["2"]
  enable_auto_scaling   = true
  min_count             = var.agents_min_count
  max_count             = var.agents_max_count
  node_count            = var.agents_node_count
  orchestrator_version  = var.cluster_version
  snapshot_id           = local.nodepool_immutable_config.nodepool_snapshot_id
  kubelet_config {
    container_log_max_size_mb = local.kubelet_config.container_log_max_size_mb
  }
  upgrade_settings {
    max_surge = var.upgrade_settings_max_surge
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes = [
      node_count, # managed by cluster autoscaler
    ]

  }
  depends_on = [azurerm_kubernetes_cluster.aks_cluster]
}

resource "azurerm_kubernetes_cluster_node_pool" "nodepool3" {
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks_cluster.id
  name                  = "az3${local.nodepool_name_suffix}"
  vm_size               = local.nodepool_immutable_config.vm_size
  vnet_subnet_id        = local.nodepool_immutable_config.subnet_id
  zones                 = ["3"]
  enable_auto_scaling   = true
  min_count             = var.agents_min_count
  max_count             = var.agents_max_count
  node_count            = var.agents_node_count
  orchestrator_version  = var.cluster_version
  snapshot_id           = local.nodepool_immutable_config.nodepool_snapshot_id
  kubelet_config {
    container_log_max_size_mb = local.kubelet_config.container_log_max_size_mb
  }
  upgrade_settings {
    max_surge = var.upgrade_settings_max_surge
  }

  lifecycle {
    create_before_destroy = true
    ignore_changes = [
      node_count, # managed by cluster autoscaler
    ]

  }
  depends_on = [azurerm_kubernetes_cluster.aks_cluster]
}

#----------------------------
# Dedicated node pool for ingress nginx
#----------------------------
resource "azurerm_kubernetes_cluster_node_pool" "nodepool_ingress_nginx" {
  kubernetes_cluster_id = azurerm_kubernetes_cluster.aks_cluster.id
  name                  = "ingress${substr(local.nodepool_name_suffix, 0, 5)}"
  vm_size               = local.nodepool_immutable_config.vm_size
  vnet_subnet_id        = local.nodepool_immutable_config.subnet_id
  zones                 = ["1", "2", "3"]
  enable_auto_scaling   = false
  node_count            = local.stages[local.cluster_stage].ingress_count
  orchestrator_version  = var.cluster_version
  snapshot_id           = local.nodepool_immutable_config.nodepool_snapshot_id
  kubelet_config {
    container_log_max_size_mb = local.kubelet_config.container_log_max_size_mb
  }

  # Label (for node selector) and taint to only schedule the ingress on this node pool
  node_labels = {
    "kubernetes.post.ch/ingress-node" = "true"
  }
  node_taints = [
    "kubernetes.post.ch/ingress-node=true:NoSchedule"
  ]

  upgrade_settings {
    max_surge = "50%"
  }

  lifecycle {
    create_before_destroy = true
  }
  depends_on = [azurerm_kubernetes_cluster.aks_cluster]
}

Debug Output/Panic Output

TestDeploy/Terraform_Init&Apply 2024-08-05T08:56:05+02:00 logger.go:66: module.aks_post.azurerm_kubernetes_cluster_node_pool.nodepool3: Creating...
TestDeploy/Terraform_Init&Apply 2024-08-05T08:56:05+02:00 logger.go:66: module.aks_post.azurerm_kubernetes_cluster_node_pool.nodepool1: Creating...
TestDeploy/Terraform_Init&Apply 2024-08-05T08:56:05+02:00 logger.go:66: module.aks_post.azurerm_kubernetes_cluster_node_pool.nodepool2: Creating...
TestDeploy/Terraform_Init&Apply 2024-08-05T08:56:05+02:00 logger.go:66: module.aks_post.azurerm_kubernetes_cluster_node_pool.nodepool_ingress_nginx: Creating...

:

TestDeploy/Terraform_Init&Apply 2024-08-05T09:03:15+02:00 logger.go:66: module.aks_post.azurerm_kubernetes_cluster_node_pool.nodepool1: Creation complete after 7m10s [id=/subscriptions/****/resourceGroups/rg-aks-ci-m98ln0012-j6l58k/providers/Microsoft.ContainerService/managedClusters/aks-ci-m98ln0012-j6l58k/agentPools/az1e7897d581]

TestDeploy/Terraform_Init&Apply 2024-08-05T09:10:05+02:00 logger.go:66: module.aks_post.azurerm_kubernetes_cluster_node_pool.nodepool3: Creation complete after 14m0s [id=/subscriptions/****/resourceGroups/rg-aks-ci-m98ln0012-j6l58k/providers/Microsoft.ContainerService/managedClusters/aks-ci-m98ln0012-j6l58k/agentPools/az3e7897d581]

TestDeploy/Terraform_Init&Apply 2024-08-05T09:17:18+02:00 logger.go:66: module.aks_post.azurerm_kubernetes_cluster_node_pool.nodepool_ingress_nginx: Creation complete after 21m13s [id=/subscriptions/****/resourceGroups/rg-aks-ci-m98ln0012-j6l58k/providers/Microsoft.ContainerService/managedClusters/aks-ci-m98ln0012-j6l58k/agentPools/ingresse7897]

TestDeploy/Terraform_Init&Apply 2024-08-05T09:24:30+02:00 logger.go:66: module.aks_post.azurerm_kubernetes_cluster_node_pool.nodepool2: Creation complete after 28m24s [id=/subscriptions/****/resourceGroups/rg-aks-ci-m98ln0012-j6l58k/providers/Microsoft.ContainerService/managedClusters/aks-ci-m98ln0012-j6l58k/agentPools/az2e7897d581]

Expected Behaviour

node_pools are created in parallel with nearly the same 'Creation complete after ' timestamp/duration

Actual Behaviour

All three node_pool 'Creating...' starting at the same time but are processed sequentially causing a longer execution time for our various tests

Steps to Reproduce

na

Important Factoids

No response

References

possibly caused by locks on subnet in kubernetes_cluster_node_pool_resource.go...introduced with kubernetes_cluster_node_pool: Fix race condition with virtual network status when creating node pool #25888

@github-actions github-actions bot added the v/3.x label Aug 5, 2024
@zioproto
Copy link
Contributor

zioproto commented Aug 5, 2024

Cc: @lonegunmanb @ms-henglu @stephybun

@aescrob
Copy link
Author

aescrob commented Aug 6, 2024

Hi @ms-henglu - thank you for your PR #26939 - @zioproto fyi

We use the same vnet_subnet_id for all our node_pools.
I doubt that this change will allow them to be built in parallel as you still use locks.ByID(subnetID.ID()) - obviously identical for all node_pools.

@rcskosir
Copy link
Contributor

Adding the link to the upstream issue here: Azure/AKS#4522

@rcskosir rcskosir added upstream/microsoft/waiting-on-service-team This label is applicable when waiting on the Microsoft Service Team and removed question labels Sep 25, 2024
@stephybun stephybun linked a pull request Oct 15, 2024 that will close this issue
14 tasks
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement service/kubernetes-cluster upstream/microsoft/waiting-on-service-team This label is applicable when waiting on the Microsoft Service Team v/3.x
Projects
None yet
4 participants