Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

azurerm_network_security_group_association - prevent deadlock between association and network interface creation #4501

Merged
merged 3 commits into from
Oct 5, 2019

Conversation

nitzanm
Copy link
Contributor

@nitzanm nitzanm commented Oct 3, 2019

I believe we have found a bug in the Azure Resource Manager provider for Terraform, having to do with the azurerm_subnet_network_security_group_association resource and its interaction with azurerm_network_interface.

The issue is that the azurerm_subnet_network_security_group_association resource, when being created, locks the following resources, in the following order:

  1. Network security group
  2. Virtual network
  3. Subnet

However, the azurerm_network_interface resource, when being created, locks the following resources, in the following order:

  1. Network security group
  2. Subnet
  3. Virtual network

You will notice that both lock the virtual network and subnet, but in the opposite order. This means that if both resources happen to be created at the same time in two threads, the following could happen:

  1. The association successfully locks the virtual network.
  2. The network interface successfully locks the subnet.
  3. The association now tries to lock the subnet, but it can't, because the network interface already locked it.
  4. The network interface now tries to lock the virtual network, but it can't, because the association already locked it.

In this situation, the two resources will wait indefinitely for each other until Terraform is terminated - a deadlock. Our trigger for finding this bug was that Terraform would be trying to create interfaces and associations forever. Of course, this only happens intermittently, because it depends on the exact order above happening. For example (resource names redacted):

azurerm_network_interface.interface1: Still creating... [58m30s elapsed]
azurerm_network_interface.interface2: Still creating... [58m30s elapsed]
azurerm_network_interface.interface3: Still creating... [58m30s elapsed]
azurerm_subnet_network_security_group_association.association1: Still creating... [58m20s elapsed]

To fix this, we simply need to lock the three resources in the same order across both resources. As a workaround, we are currently having all network interfaces depend on our NSG association, since that will ensure both are never created at the same time.

As an aside, is there any way to review all locks across the provider, to ensure this situation doesn't happen elsewhere? Perhaps there could be a policy to always lock resources in alphabetical order so that in the future this doesn't happen again?

…_interface, caused by them locking the same resources, but in a different order.
@ghost ghost added the size/XS label Oct 3, 2019
Copy link
Collaborator

@katbyte katbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic spot @nitzanm!

Thank you very much for this fix, LGTM and I'll review all other lock orders in the provider now.

@katbyte katbyte changed the title Fix deadlock between NSG association and network interface creation azurerm_network_security_group_association - prevent deadlock between association and network interface creation Oct 5, 2019
@katbyte katbyte merged commit 4c93e42 into hashicorp:master Oct 5, 2019
katbyte added a commit that referenced this pull request Oct 5, 2019
abhinavdahiya added a commit to abhinavdahiya/installer that referenced this pull request Oct 8, 2019
…ndle private_dns zone

Using the upstream azurerm provider is not possible for now because of following reasons:

1) There is not srv record resource for private dns zone

2) The version of provider that has the private dns zone resources `1.34.0` has a lot of bugs like
    * hashicorp/terraform-provider-azurerm#4452
    * hashicorp/terraform-provider-azurerm#4453
    * hashicorp/terraform-provider-azurerm#4501
    Some of these bugs are fixed, and some are in flight.

    Another reason moving to `1.36.0` which might have all the fixes we need is the provider has moved to using
    `standalone terraform plugin SDK v1.1.1` [1]. Because we vendor both terraform and providers, this causes errors like
    `panic: gob: registering duplicate types for "github.com/zclconf/go-cty/cty.primitiveType": cty.primitiveType != cty.primitiveType`

   Therefore, we would have to move towards a single vendor for terraform and plugins for correct inter-operation, which is tricker due to conflicts elsewhere

A simple 4 resource plugin that re-uses the already vendored azurerm provider as library and carries over the required resources seems like an easy fix for now.

[1]: hashicorp/terraform-provider-azurerm#4474
@ghost
Copy link

ghost commented Oct 29, 2019

This has been released in version 1.36.0 of the provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. As an example:

provider "azurerm" {
    version = "~> 1.36.0"
}
# ... other configuration ...

@ghost
Copy link

ghost commented Nov 4, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants