Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chef provisioner hangs when provisioning more than one resource #22722

Closed
mcascone opened this issue Sep 6, 2019 · 10 comments
Closed

Chef provisioner hangs when provisioning more than one resource #22722

mcascone opened this issue Sep 6, 2019 · 10 comments
Labels
bug needs-maintainer This code is currently unmaintained. Please submit a PR against our CODEOWNERS to volunteer. provisioner/chef v0.12 Issues (primarily bugs) reported against v0.12 releases

Comments

@mcascone
Copy link

mcascone commented Sep 6, 2019

Terraform Version

Terraform v0.12.8
+ provider.null v2.1.2
+ provider.vra7 v0.4.1

Terraform Configuration Files

  • Provisioner inside the resource:
resource "vra7_deployment" "vra-vm" {
 ...
  resource_configuration = {
    "vSphere_Machine_1.name" = ""
    "vSphere_Machine_1.ip_address" = ""
    "vSphere_Machine_1.description" = "Terraform ICE SQL"
  }
  ...

  provisioner "chef" {
    # This is for TF to talk to the new node
    connection {
      host = self.resource_configuration["vSphere_Machine_1.ip_address"]
      type = "winrm"
      user = var.KT_USER
      password = var.KT_PASS
      insecure = true
    }
  
    # This is for TF to talk to the chef_server
    # Note! the version constraint doesn't work
    server_url = var.chef_server_url
    node_name  = "ICE-SQL-${self.resource_configuration["vSphere_Machine_1.name"]}"
    run_list   = var.sql_run_list
    recreate_client = true
    environment = "_default"
    ssl_verify_mode = ":verify_none"
    version = "~> 12"
    user_name  = local.username
    user_key   = file("${local.user_key_path}")
  }
  • Provisioner using null_resource block:
resource "vra7_deployment" "ICE-SQL" {
  count = var.sql_count # will be 1/on or 0/off
  ...
  resource_configuration = {
    "vSphere_Machine_1.name" = ""
    "vSphere_Machine_1.ip_address" = ""
    "vSphere_Machine_1.description" = "Terraform ICE SQL"
  }
}

locals {
    sql_ip   = vra7_deployment.ICE-SQL[0].resource_configuration["vSphere_Machine_1.ip_address"]
    sql_name = vra7_deployment.ICE-SQL[0].resource_configuration["vSphere_Machine_1.name"]
  }

resource "null_resource" "sql-chef" { 
  # we can use count to switch creating this on or off for testing
  count = 1

  provisioner "chef" {
    # This is for TF to talk to the new node
    connection {
      host = local.sql_ip
      type = "winrm"
      user = var.KT_USER
      password = var.KT_PASS
      insecure = true
    }
  
    # This is for TF to talk to the chef_server
    # Don't use the local var here, so TF knows to create the dependency
    server_url = var.chef_server_url
    node_name  = "ICE-SQL-${vra7_deployment.ICE-SQL[0].resource_configuration["vSphere_Machine_1.name"]}"
    run_list   = var.sql_run_list
    recreate_client = true
    environment = "_default"
    ssl_verify_mode = ":verify_none"
    version = "12"
    user_name  = local.username
    user_key   = file("${local.user_key_path}")
    client_options = var.chef_client_options
  }
}
  • modules
### main.tf
module "SQL" {
  source   = "./modules/vra-chef"
  VRA_USER = var.VRA_USER
  VRA_PASS = var.VRA_PASS
  KT_USER  = var.KT_USER
  KT_PASS  = var.KT_PASS

  description = "ICE SQL"
  run_list    = var.sql_run_list
}

### modules/vra-chef/main.tf
resource "vra7_deployment" "vra-chef" {
  count = var.server_count
...
  resource_configuration = {
    "vSphere_Machine_1.name"       = var.resource_name
    "vSphere_Machine_1.ip_address"  = var.resource_ip
    "vSphere_Machine_1.description" = "${var.description}-${count.index}"
  }

  provisioner "chef" {
    # This is for TF to talk to the new node
    connection {
      host = self.resource_configuration["vSphere_Machine_1.ip_address"]
      type = "winrm"
      user = var.KT_USER
      password = var.KT_PASS
      insecure = true
    }
  
    # This is for TF to talk to the chef_server
    server_url = var.chef_server_url
    node_name  = self.resource_configuration["vSphere_Machine_1.name"]
    run_list   = var.run_list
    recreate_client = true
    environment = "_default"
    ssl_verify_mode = ":verify_none"
    version = "~> 12"
    user_name  = local.username
    user_key   = file(local.user_key_path)
    client_options = [ "chef_license  'accept'" ]

    # pass custom attributes to the new node
    attributes_json = var.input_json
  }
}

Debug Output

https://gist.github.com/mcascone/41d1514b05ebd675a700f00ce5948995
NOTE: The REMOTE machine provisions just fine. The chef provisioner dies on the MASTER machine at line 9701.

Crash Output

Expected Behavior

Chef provisions all machines in the apply run.

Actual Behavior

It doesn't even actually "fail", it just stops responding:

vra7_deployment.ICE-SQL[0]: Creating...
vra7_deployment.ICE-MASTER[0]: Creating...
vra7_deployment.ICE-REMOTE[0]: Creating...
...
vra7_deployment.ICE-REMOTE[0]: Still creating... [9m30s elapsed]
vra7_deployment.ICE-SQL[0]: Still creating... [9m30s elapsed]
vra7_deployment.ICE-MASTER[0]: Still creating... [9m30s elapsed]
vra7_deployment.ICE-MASTER[0]: Creation complete after 9m39s [id=feecf983-48d5-425e-b713-65a1a05fa3ba]
vra7_deployment.ICE-REMOTE[0]: Still creating... [9m40s elapsed]
vra7_deployment.ICE-SQL[0]: Still creating... [9m40s elapsed]
...
vra7_deployment.ICE-SQL[0]: Still creating... [12m10s elapsed]
vra7_deployment.ICE-REMOTE[0]: Still creating... [12m10s elapsed]
vra7_deployment.ICE-REMOTE[0]: Creation complete after 12m11s [id=df64f5ab-af12-4493-8e7d-d7debd93780d]
vra7_deployment.ICE-SQL[0]: Still creating... [12m20s elapsed]
...
vra7_deployment.ICE-SQL[0]: Still creating... [13m10s elapsed]
vra7_deployment.ICE-SQL[0]: Creation complete after 13m11s [id=08ec31f4-124d-470e-b2ba-1833a6f22792]
null_resource.sql-chef[0]: Creating...
null_resource.master-chef[0]: Creating...
null_resource.remote-chef[0]: Creating...
null_resource.sql-chef[0]: Provisioning with 'chef'...
null_resource.master-chef[0]: Provisioning with 'chef'...
null_resource.remote-chef[0]: Provisioning with 'chef'...
null_resource.master-chef[0] (chef): Connecting to remote host via WinRM...
null_resource.master-chef[0] (chef):   Host: my-ip 
null_resource.master-chef[0] (chef):   Port: 5985
null_resource.master-chef[0] (chef):   User: my-user 
null_resource.master-chef[0] (chef):   Password: true
null_resource.master-chef[0] (chef):   HTTPS: false
null_resource.master-chef[0] (chef):   Insecure: true
null_resource.master-chef[0] (chef):   NTLM: false
null_resource.master-chef[0] (chef):   CACert: false
null_resource.sql-chef[0] (chef): Connecting to remote host via WinRM...
null_resource.sql-chef[0] (chef):   Host: my-ip 
null_resource.sql-chef[0] (chef):   Port: 5985
null_resource.sql-chef[0] (chef):   User: my-user
null_resource.sql-chef[0] (chef):   Password: true
null_resource.sql-chef[0] (chef):   HTTPS: false
null_resource.sql-chef[0] (chef):   Insecure: true
null_resource.sql-chef[0] (chef):   NTLM: false
null_resource.sql-chef[0] (chef):   CACert: false
null_resource.remote-chef[0] (chef): Connecting to remote host via WinRM...
null_resource.remote-chef[0] (chef):   Host: my-ip 
null_resource.remote-chef[0] (chef):   Port: 5985
null_resource.remote-chef[0] (chef):   User: my-user
null_resource.remote-chef[0] (chef):   Password: true
null_resource.remote-chef[0] (chef):   HTTPS: false
null_resource.remote-chef[0] (chef):   Insecure: true
null_resource.remote-chef[0] (chef):   NTLM: false
null_resource.remote-chef[0] (chef):   CACert: false
null_resource.sql-chef[0] (chef): Connected!
null_resource.remote-chef[0] (chef): Connected!
null_resource.master-chef[0] (chef): Connected!
null_resource.remote-chef[0] (chef): Downloading Chef Client...
null_resource.sql-chef[0] (chef): Downloading Chef Client...
null_resource.remote-chef[0] (chef): Installing Chef Client...
null_resource.sql-chef[0] (chef): Installing Chef Client...
null_resource.remote-chef[0]: Still creating... [10s elapsed]
null_resource.master-chef[0]: Still creating... [10s elapsed]
null_resource.sql-chef[0]: Still creating... [10s elapsed]
null_resource.sql-chef[0] (chef): Creating configuration files...
null_resource.remote-chef[0] (chef): Creating configuration files...
null_resource.master-chef[0] (chef): Downloading Chef Client...
null_resource.master-chef[0] (chef): Installing Chef Client...
null_resource.master-chef[0] (chef): Creating configuration files...
null_resource.remote-chef[0]: Still creating... [20s elapsed]
null_resource.master-chef[0]: Still creating... [20s elapsed]
null_resource.sql-chef[0]: Still creating... [20s elapsed]
null_resource.remote-chef[0]: Still creating... [30s elapsed]
null_resource.sql-chef[0]: Still creating... [30s elapsed]
null_resource.master-chef[0]: Still creating... [30s elapsed]
null_resource.remote-chef[0]: Still creating... [40s elapsed]
null_resource.sql-chef[0]: Still creating... [40s elapsed]
null_resource.master-chef[0]: Still creating... [40s elapsed]
null_resource.remote-chef[0]: Still creating... [50s elapsed]
null_resource.sql-chef[0]: Still creating... [50s elapsed]
null_resource.master-chef[0]: Still creating... [50s elapsed]
null_resource.remote-chef[0]: Still creating... [1m0s elapsed]
null_resource.sql-chef[0]: Still creating... [1m0s elapsed]
null_resource.master-chef[0]: Still creating... [1m0s elapsed]
...
loops waiting forever...
...

Steps to Reproduce

  1. terraform init
  2. terraform plan
  3. terraform apply

Additional Context

#21194

@hashibot hashibot added bug provisioner/chef v0.12 Issues (primarily bugs) reported against v0.12 releases labels Sep 6, 2019
@mcascone
Copy link
Author

mcascone commented Sep 9, 2019

I've been testing this some more. I split my resources into the vra7 deployments, and the chef provisioning of them by using null_resource. So I'm able to iterate a lot faster on the chef part, because I can re-provision a machine without having to spin a new one.

What i've found is that it seems to not like chef-provisioning more than one machine at a time. So far I've found cases where 1 out of 4 machines will provision perfectly, and the others just hang after they all print the creating configuration files... status. Leaving the first one active, on the next run, the other three will all hang again at the same place. Finally. i tweaked the code to only re-provision one of the machines, and it worked perfectly. Let's be clear: the same exact code that hangs on a prior run, will execute perfectly when run by itself. I think that's a critical clue to debugging this.

To reiterate: When it gets stuck, the chef provisioning always hangs at the creating configuration files... step. If it gets past that, it always works.

Here is a gist of a chef run using null_provisioner on two resources, both of which hang: https://gist.github.com/mcascone/0b71948f50d52648389e661d00c8e31c

And this is one of a successful, 1-resource run: https://gist.github.com/mcascone/858855b5bd9d5d1cf655d5e10df67801

@mcascone
Copy link
Author

I keep thinking this is an issue with the same provisioner being called multiple times in the same main.tf file. I'm calling the chef provisioner 3+ times in one apply run. Could it be that the multiple instances of the provisioner are colliding with each other, or there isn't actually support for multiple runs of the same provisioner, and they're all getting instantiated in the same instance and corrupting each other?

@mcascone mcascone changed the title Chef provisioner hangs intermittently Chef provisioner hangs when provisioning more than one resource Sep 13, 2019
@mcascone
Copy link
Author

Some more testing. This time i tried using a single chef provisioner block, with a count of 2 iterator:

resource "null_resource" "remote-chef" {
  count = var.remote_count # == 2

  provisioner "chef" {
    # This is for TF to talk to the new node
    connection {
      host = vra7_deployment.ICE-REMOTE[count.index].resource_configuration["vSphere_Machine_1.name"]
      type = "winrm"
      user = var.KT_USER
      password = var.KT_PASS
      insecure = true
    }
  
    # This is for TF to talk to the chef_server
    # use sql name to differentiate between environments
    server_url = var.chef_server_url
    node_name  = "ICE-REMOTE-${local.sql_num}-${count.index}"
    run_list   = var.remote_run_list
    recreate_client = true
    environment = "_default"
    ssl_verify_mode = ":verify_none"
    # version = "12"
    user_name  = local.username
    user_key   = file("${local.user_key_path}")
    client_options = var.chef_client_options # needed to accept the new license
    attributes_json = <<EOF
    {
      "ice": {
        "SQL_SERVER": "${local.sql_name}"
      }
    }
    EOF
  }
}

Same result. It hangs on creating configuration files... for both connections.

I did notice one new thing: The configuration files are indeed getting created on both target VMs - EXCEPT one of them does NOT create the first-boot.json file.
Other than that, the client.rb is properly created, the .pem files are moved over, and the first-boot.json that does get created is correct.

@mcascone
Copy link
Author

This appears to only affect v0.12. I reverted all the syntax and downgraded to v0.11.14, and it provisions in parallel like a charm. Something got broken in the provisioners in v0.12.

@wardbryan
Copy link

Any updates on this issue?

@ndunn990
Copy link

ndunn990 commented Apr 8, 2020

We were experiencing this issue, as well, when trying to build Windows VMs in VMWare. We use Terragrunt along with Terraform, mostly against VMWare with some Azure here and there. I noticed a possible workaround and I'm curious if it works for others.

Environment

VSphere: 6.7.0.42000
Terraform: v0.12.24
provider.vsphere: 1.16.0
Provisioner: Chef

Summary

Basically, I just stopped using Terraform's 'count' functionality and split my various servers into separate modules. I realize this is not ideal, but I was curious to see if I would encounter the same issue. I was surprised to find that it might be a workaround and possibly a helpful indicator of where the issue might lie?

Snippet of Original Configuration:

This would result in the issue everyone's describing

main.tf

module "WindowsServer" {
source = "../../../../modules/core-non-prod"
servercount = 4
#...ETC
}

module source

resource "vsphere_virtual_machine" "WindowsVM" {
count = var.servercount
name = var.vmname
#...ETC
}

Snippet of Alteration:

This seems to actually succeed as expected.

main.tf

module "WindowsServer1" {
source = "../../../../modules/core-non-prod"
vmname = "Server1"
#...ETC
}
module "WindowsServer2" {
source = "../../../../modules/core-non-prod"
vmname = "Server2"
#...ETC
}

module source

resource "vsphere_virtual_machine" "WindowsVM" {
name = var.vmname
#...ETC
}

@mcascone
Copy link
Author

mcascone commented Apr 9, 2020

@ndunn990, I agree with your hunch, i believe something gets lost when there is a count > 1; somewhere an iterator is not iterating correctly or at all, or something like that.

@pkolyvas pkolyvas added the needs-maintainer This code is currently unmaintained. Please submit a PR against our CODEOWNERS to volunteer. label Apr 14, 2020
@darrens280
Copy link

FYI - Using Terraform 0.12.26 fixes it for me (with multiple file provisioners) as per #22006 (comment)

Hope it helps you
Cheers

@danieldreier
Copy link
Contributor

I'm closing this issue because it looks like it's already fixed, and we also announced tool-specific (vendor or 3rd-party) provisioner deprecation in mid-September 2020. Additionally, we added a deprecation notice for tool-specific provisioners in 0.13.4. On a practical level this means we will no longer be reviewing or merging PRs for built-in plugins like the chef provisioner.

@ghost
Copy link

ghost commented Nov 15, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked as resolved and limited conversation to collaborators Nov 15, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug needs-maintainer This code is currently unmaintained. Please submit a PR against our CODEOWNERS to volunteer. provisioner/chef v0.12 Issues (primarily bugs) reported against v0.12 releases
Projects
None yet
Development

No branches or pull requests

7 participants