Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v3.0.1-rc3: IP not found and agent_timeout not being used #1106

Open
arty-hlr opened this issue Sep 17, 2024 · 25 comments
Open

v3.0.1-rc3: IP not found and agent_timeout not being used #1106

arty-hlr opened this issue Sep 17, 2024 · 25 comments
Assignees
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented resource/qemu Issue or PR related to Qemu resource type/bug

Comments

@arty-hlr
Copy link

Contrary to the PR #1016 and to what the documentation https://registry.terraform.io/providers/telmate/proxmox/3.0.1-rc3/docs/resources/vm_qemu says, agent_timeout is not being used, which results in VMs being created and then the following error:
image

I do use agent_timeout in the module:
image

but it's not being shown in the plan output:
image

The only difference with previous templates I tried to clone is that there is now SSH installed. I am not sure why it cannot find the IP address as it shows it in the GUI. I would expect agent_timeout to help, as the error suggests, but it does not do anything unfortunately. It also sometimes work on one VM or the other, pretty randomly as it seems, so doing multiple terraform apply helps, kinda.

Here is the full output with TF_VAR_LOG=DEBUG:
image

To reproduce:

  • create a windows template with SSH server installed and clone it (linked clone) with terraform
  • create a VM from a template with terraform and use agent_timeout
@GMZwinge
Copy link

GMZwinge commented Oct 2, 2024

Test without Cloud-init (os_type = "centos"):

  • 3.0.1-rc3 provides an IP address:
2024-10-02T10:56:03.512-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:03 [DEBUG] VM is running, checking the IP: timestamp=2024-10-02T10:56:03.512-0400
2024-10-02T10:56:03.512-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:03 [INFO][initConnInfo] trying to get vm ip address for provisioner: timestamp=2024-10-02T10:56:03.512-0400
2024-10-02T10:56:03.512-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:03 [DEBUG][initConnInfo] retrying for at most  20m0s minutes before giving up: timestamp=2024-10-02T10:56:03.512-0400
2024-10-02T10:56:03.512-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:03 [DEBUG][initConnInfo] retries will end at 2024-10-02 11:16:03.5122731 -0400 EDT m=+1224.259660501: timestamp=2024-10-02T10:56:03.512-0400
2024-10-02T10:56:06.525-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:06 [INFO][getPrimaryIP] check ip result error 500 QEMU guest agent is not running: timestamp=2024-10-02T10:56:06.524-0400
2024-10-02T10:56:14.566-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:14 [INFO][getPrimaryIP] check ip result error 500 QEMU guest agent is not running: timestamp=2024-10-02T10:56:14.565-0400
2024-10-02T10:56:19.712-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:19 [INFO][getPrimaryIP] QEMU Agent interfaces found: [{00:00:00:00:00:00 [127.0.0.1 ::1] lo <nil>} {<MacAddress> [<Ipv6Addresses>] ens18 <nil>}]: timestamp=2024-10-02T10:56:19.712-0400
2024-10-02T10:56:24.922-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:24 [INFO][getPrimaryIP] QEMU Agent interfaces found: [{00:00:00:00:00:00 [127.0.0.1 ::1] lo <nil>} {<MacAddress> [<Ipv4Address> <Ipv6Addresses>] ens18 <nil>}]: timestamp=2024-10-02T10:56:24.922-0400
2024-10-02T10:56:24.922-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc3.exe: 2024/10/02 10:56:24 [DEBUG][initConnInfo] this is the vm configuration: <Ipv4Address> 22: timestamp=2024-10-02T10:56:24.922-0400
  • 3.0.1-rc4 doesn't provide an IP address:
2024-10-02T10:53:14.311-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc4.exe: 2024/10/02 10:53:14 [DEBUG] VM is running, checking the IP: timestamp=2024-10-02T10:53:14.311-0400
2024-10-02T10:53:14.311-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc4.exe: 2024/10/02 10:53:14 [INFO][initConnInfo] trying to get vm ip address for provisioner: timestamp=2024-10-02T10:53:14.311-0400
2024-10-02T10:53:14.311-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc4.exe: 2024/10/02 10:53:14 [DEBUG][initConnInfo] retrying for at most  20m0s minutes before giving up: timestamp=2024-10-02T10:53:14.311-0400
2024-10-02T10:53:14.311-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc4.exe: 2024/10/02 10:53:14 [DEBUG][initConnInfo] retries will end at 2024-10-02 11:13:14.3111935 -0400 EDT m=+1221.822484201: timestamp=2024-10-02T10:53:14.311-0400
2024-10-02T10:53:14.311-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc4.exe: 2024/10/02 10:53:14 [INFO][getPrimaryIP] vm has a cloud-init configuration: timestamp=2024-10-02T10:53:14.311-0400
2024-10-02T10:53:14.311-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc4.exe: 2024/10/02 10:53:14 [DEBUG][initConnInfo] this is the vm configuration:  22: timestamp=2024-10-02T10:53:14.311-0400

Also, 3.0.1-rc4 displays those warning: Warning: Cloud-init is enabled but no IP config is set and Cloud-init is enabled in your configuration but no static IP address is set, nor is the DHCP option enabled. Was able to get rid of those warning with define_connection_info = false, but it still doesn't provide an IP address:

2024-10-02T11:09:14.978-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc4.exe: 2024/10/02 11:09:14 [DEBUG] VM is running, checking the IP: timestamp=2024-10-02T11:09:14.978-0400
2024-10-02T11:09:14.978-0400 [INFO]  provider.terraform-provider-proxmox_v3.0.1-rc4.exe: 2024/10/02 11:09:14 [INFO][initConnInfo] define_connection_info is false, no further action: timestamp=2024-10-02T11:09:14.978-0400

Not sure if that's the correct way to NOT use Cloud-init though.

@Tinyblargon
Copy link
Collaborator

Found the commit that most likely broke it.
Ironically it separated the logic so it would be easier to test.

18095e5#diff-104365919f693375882979581d9f36f5266991eef361948e56c220931fe886ddL1916

@Tinyblargon
Copy link
Collaborator

@GMZwinge could you check if #1120 fixes your issue?

@Tinyblargon
Copy link
Collaborator

@arty-hlr

agent_timeout is being used, it could be that endTime on L1970 is shorter, but by default this is 20 minutes.

for time.Now().Before(endTime) {
var interfaces []pxapi.AgentNetworkInterface
interfaces, err = vmr.GetAgentInformation(client, false)
if err != nil {
if !strings.Contains(err.Error(), ErrorGuestAgentNotRunning) {
return primaryIPs{}, diag.FromErr(err)
}
log.Printf("[INFO][getPrimaryIP] check ip result error %s", err.Error())
logger.Debug().Int("vmid", vmr.VmId()).Msgf("check ip result error %s", err.Error())
} else { // vm is running and reachable
if len(interfaces) > 0 { // agent returned some information
log.Printf("[INFO][getPrimaryIP] QEMU Agent interfaces found: %v", interfaces)
logger.Debug().Int("vmid", vmr.VmId()).Msgf("QEMU Agent interfaces found: %v", interfaces)
conn = conn.parsePrimaryIPs(interfaces, primaryMacAddress)
if conn.hasRequiredIP() {
return conn.IPs, diag.Diagnostics{}
}
}
if waitedTime > agentTimeout {
break
}
waitedTime += additionalWait
}
time.Sleep(time.Duration(additionalWait) * time.Second)
}

However, it the log your showed does show the information in it's raw form.
I think we are parsing the MAC address wrong and therefore we can't match it with an interface returned by the guest-agent.

if _, ok := vmConfig["net"+strconv.Itoa(i)]; ok {
primaryMacAddress = macAddressRegex.FindString(vmConfig["net"+strconv.Itoa(i)].(string))
break
}

Upstream I'm working on re-implementingthe network interfaces to get rid of this regex parsing of the MAC address on L1966.

@Tinyblargon
Copy link
Collaborator

@arty-hlr can you check if #1120 fixes your issue?

@arty-hlr
Copy link
Author

@Tinyblargon How do I tell terraform/opentofu to use your branch? I only see source and version in the terraform provider config. Should I compile it, and copy the executable to the plugin directory like here?

@arty-hlr
Copy link
Author

about agent_timeout: I did try to change its value, but it didn't make any difference, so I assumed it wasn't used at all as it wasn't in the logs.

@Tinyblargon
Copy link
Collaborator

@arty-hlr when you compile the branch, the compiled binary has to be renamed to linux_amd64, then you can put at .terraform/providers/terraform.local/local/proxmox/1.0.0/linux_amd64 inside your terraform project developer.md has more information about this.

@arty-hlr
Copy link
Author

Hi @Tinyblargon, that didn't work, I had to change the .terraformrc to add a filesystem mirror, unfortunately that's not mentioned anywhere in the docs.

I just ran a test with your branch and unfortunately it still doesn't work, here's the relevant part of the output:

grafik

It seems that looping over terraform apply (which worked before) doesn't anymore either.

grafik

@Tinyblargon
Copy link
Collaborator

@arty-hlr

Checking if i have the correct situation.

  • os_type == "cloud-init"
  • a cloud-init disk is configured
  • skip_ipv4 == true
  • skip_ipv6 == false

The error is telling me that ipconfig0 is not configured.

@arty-hlr
Copy link
Author

@Tinyblargon The same happens with os_type as cloud-init or cloud-init not set at all, the above screenshots from the run were with cloud-init not set at all. There is no cloud-init disk configured, the VMs are just cloned from the template and get a dynamic IP address from the DHCP server, so I don't use ipconfig0 as there's no actual cloud-init.

I'm actually not sure why it says Cloud-init is enabled because it is not in the terraform config and the templates don't have a cloud-init drive.

grafik

@arty-hlr
Copy link
Author

The code here https://github.com/Telmate/terraform-provider-proxmox/blob/master/proxmox/resource_vm_qemu.go#L1935 seems to indicate that this shouldn't happen when cloud-init is not set

Tinyblargon added a commit to Tinyblargon/terraform-provider-proxmox that referenced this issue Oct 31, 2024
@Tinyblargon
Copy link
Collaborator

@arty-hlr made some changes and tested locally.
It now works as expected and reliably gets an IP address.
Did 2 test, one VM with cloud-init, and one VM without cloud-init with pre-configured DHCP.

Tinyblargon added a commit to Tinyblargon/terraform-provider-proxmox that referenced this issue Nov 19, 2024
@Tinyblargon
Copy link
Collaborator

@GMZwinge @arty-hlr could you test if the latest release solves your issue?

@Tinyblargon Tinyblargon added type/bug resource/qemu Issue or PR related to Qemu resource issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented and removed issue/investigate labels Nov 27, 2024
@arty-hlr
Copy link
Author

Sorry for the late answer, will do today and let you know

@arty-hlr
Copy link
Author

@Tinyblargon Unfortunately same problem with the latest release (rc6) 😢 the provider couldn't get the IP of one of the VMs even though it was up and showed an IP on proxmox.

image

Here the next test run, I was checking the VMs in proxmox as they were created and all 5 got an IP and displayed it, somehow the provider couldn't find the IP for 3 of them:

image

@Tinyblargon
Copy link
Collaborator

@arty-hlr could you set the agent_timeout really high and see if it has different results?
Does the vm have a net0 network interface? I believe it only checks for that one. The GUI does display the ip of the lowest net adapter.

@arty-hlr
Copy link
Author

They have a net0 network interface, I had agent_timeout at 120, just set it to 3600, it made no difference unfortunately. The EX-1 VM was not the last to be created, had an IP in proxmox, but the provider couldn't get it.

image

image

This is the full terraform file for a windows VM we use, if that helps:

terraform {
  required_providers {
    proxmox = {
      source  = "telmate/proxmox"
      version = "3.0.1-rc6"
    }
  }
}

resource "proxmox_vm_qemu" "windows-vm" {
  name          = var.name
  target_node   = var.proxmox_host
  clone         = var.template
  full_clone    = "false"
  agent         = 1
  agent_timeout = 3600
  os_type       = "cloud-init"
  cores         = 4
  sockets       = 1
  cpu_type      = "host"
  bios          = "ovmf"
  memory        = 4096
  scsihw        = "virtio-scsi-pci"
  bootdisk      = "scsi0"
  pool          = var.pool
  skip_ipv6     = "true"

  disks {
    sata {
      sata0 {
        disk {
          size       = "128G"
          storage    = "loc_pve-xxx_r1"
          emulatessd = true
          discard    = true
        }
      }
    }
  }

  network {
    id     = 0
    model  = "virtio"
    bridge = var.nic_name
    tag    = var.vlan_num
  }
  lifecycle {
    ignore_changes = [
      network,
    ]
  }
}

@Tinyblargon
Copy link
Collaborator

@arty-hlr Could you check if the debug logs show the following errors? It will show the raw data of the qemu-guest-agent.

https://github.com/Telmate/terraform-provider-proxmox/blame/7e667094714d35f66c8d45ef08400895e1ad6804/proxmox/resource_vm_qemu.go#L1679-L1680

@arty-hlr
Copy link
Author

@Tinyblargon So, I think we're getting closer. This is how it looks like when the IP is found:
image

and this is when the IP is not found:
image

Here's the whole log file: https://gist.github.com/arty-hlr/6c25344a04e9fcfdebcbb02c5a1830f5

For me it looks like the qemu guest agent returns a temporary IP address (169.254.XXX.XXX) which is then not parsed correctly somehow. I'd suggest maybe waiting until an actual private IP is returned by the qemu guest agent?

@Tinyblargon
Copy link
Collaborator

@arty-hlr 192.254.xxx.xxx is used by windows for local addresses when it didn't get dhcp yet?

@arty-hlr
Copy link
Author

@Tinyblargon 169.254.xxx.xxx, yes. When I look at the VM in proxmox in real time, I usually see a second or so with that IP before the actual IP is given by the DHCP server. It could be a "slow" race condition between when the provider checks and when the IP is given? Then maybe checking if the IP is in 10.0.0.0/8, 172.16.0.0/12, or 192.168.0.0/16 could be worth doing? But that implies it's supposed to be a private IP though. Then maybe just ignoring 169.254.xxx.xxx would be better? Just throwing ideas

@Tinyblargon
Copy link
Collaborator

@arty-hlr After doing some more research these are Link-Local addresses and should be ignored.
https://en.wikipedia.org/wiki/Link-local_address

@arty-hlr
Copy link
Author

@Tinyblargon Yes exactly. Do you mean they should already be ignored, or it should be fixed so that they are? From what I can see it is reported by getPrimaryIP but not printed in initConnInfo.

@Tinyblargon
Copy link
Collaborator

@arty-hlr it should be fixed so they are ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
issue/confirmed Issue has been reviewed and confirmed to be present or accepted to be implemented resource/qemu Issue or PR related to Qemu resource type/bug
Projects
None yet
Development

No branches or pull requests

3 participants