Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows workloads cannot be deleted post upgrade, stuck in Terminating #5551

Closed
HarrisonWAffel opened this issue Mar 5, 2024 · 10 comments
Closed
Assignees
Labels
area/windows kind/bug Something isn't working

Comments

@HarrisonWAffel
Copy link
Contributor

Environmental Info:
RKE2 Version: v1.25.16, upgrading to v1.26 - seems to reproduce for upgrades between any version

rke2 version v1.25.16+rke2r1 (3fe54b9)
go version go1.20.11 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Server: Linux haffel-testing-linux-server-0 5.4.0-1109-azure #115~18.04.1-Ubuntu SMP Mon May 22 20:06:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Windows Worker: Microsoft Windows [Version 10.0.20348.2322] (2022 Azure datacenter)

Cluster Configuration:

1 server, all roles
1 windows worker

Describe the bug:

Windows workloads will not properly delete post rke2 upgrade. For instance, deploying a simple IIS web server via a deployment on 1.25 and then attempting to delete a pod spawned from that deployment post upgrade to 1.26 will result in the pod never completely terminating. kubectl describe pod shows the following error message

error killing pod: failed to "KillPodSandbox" for "18d4a5fe-e570-4257-a3f4-0c6cb54f3a79" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to remove network namespace for sandbox \"8a0c4392b09910f8a944ab51a7639185e829c884a568e4a3d9bacae43db63d97\": hcnDeleteNamespace failed in Win32: The specified request is unsupported. (0x803b0015) {\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}

This issue was encountered while debugging rancher/rancher#42414, but it reproduces for workloads other than the Rancher monitoring chart. This issue doesn't seem to be version specific, and has been reproduced for a number of different rke2 upgrade paths

Steps To Reproduce:

  • Create 1 linux server node running 1.25 (though this is not specific to 1.25)
  • Create 1 windows worker node running 1.25
  • Create a windows IIS web server deployment and wait for the pods to be created
  • Upgrade the linux server to 1.26, wait for upgrade to complete
  • Upgrade the windows worker to 1.26
  • Attempt to delete a pod created by the IIS web server deployment

Expected behavior:
The pod is fully deleted

Actual behavior:
The pod is stuck in Terminating

Additional context / logs:

@HarrisonWAffel
Copy link
Contributor Author

From what I can determine, the root cause of this issue seems to stem from RKE2 deleting and recreating all calico networks each time it restarts. Calico/Felix tracks HNS endpoints using their IP addresses, and when networks are recreated endpoints are also recreated and are assigned new IP addresses. This results in an inability to delete those endpoints, which prevents references to containers from being removed from HNS namespaces.

The error shown when running kubectl describe on a pod stuck in terminating ({\"Success\":false,\"Error\":\"The specified request is unsupported. \",\"ErrorCode\":2151350293}) indicates a failure to delete an HNS namespace due to containers still running within that namespace (as described in the above linked issue in microsoft/Windows-Containers).

It looks like upstream Calico scripts which handle starting the calico-node service on Windows only delete and recreate networks when starting the node after a reboot. RKE2 may need to adopt the same behavior.

@brandond
Copy link
Member

It looks like prior to #3615 that function was called deleteAllNetworksOnNodeRestart, so clearly that was the intent at some point - however @manuelbuil noticed that it didn't actually contain any code to detect a reboot and just unconditionally deleted all the networks whenever called, so he corrected the function name.

The incorrect behavior goes back to the original rushed windows implementation from Jamie in #1268

@brandond brandond added this to the August 2024 Release Cycle milestone Jul 18, 2024
@brandond
Copy link
Member

@caroline-suse-rancher can we get one of the network team on this?

@rbrtbnfgl
Copy link
Contributor

rbrtbnfgl commented Jul 30, 2024

I think we already doing it here from the code https://github.com/rancher/rke2/blob/master/pkg/windows/calico.go#L305

Edit: I read Brad comment that was already saying that.

@mdrahman-suse
Copy link
Contributor

mdrahman-suse commented Aug 19, 2024

@HarrisonWAffel Do you have an example Windows IIS workload that I can use? I am not able to replicate this issue when I used this deployment for testing: https://raw.githubusercontent.com/rancher/distros-test-framework/main/workloads/amd64/windows_app_deployment.yaml
Also followed the steps below to upgrade rke2 with Windows agent:

Setup:

1 linux server, 1 linux worker, 1 Windows (2019) worker

Steps:

  • Installed rke2 v1.30.2+rke2r1 on all the nodes
  • Ensured the cluster is up and running
  • Deployed the workload mentioned above (pod/windows-app-deployment-*)
Screen Shot 2024-08-19 at 10 48 35 AM
  • Installed v1.30.3+rke2r1 on the server / worker nodes
  • Restarted services on the nodes
  • On Windows node
    • Stopped rke2 service
    • Installed v1.30.3+rke2r1
    • Restarted rke2 service
  • Ensured the cluster is upgraded
Screen Shot 2024-08-19 at 10 49 57 AM
  • Deleted the deployment successfully
Screen Shot 2024-08-19 at 10 50 46 AM

Please advice if any changes needed on the above steps

@HarrisonWAffel
Copy link
Contributor Author

HarrisonWAffel commented Aug 20, 2024

@mdrahman-suse

I was able to reproduce this on v1.25 using the following workload

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: win-webserver
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      containers:
        - name: windowswebserver
          image: mcr.microsoft.com/windows/servercore:ltsc2022
          command:
            - powershell.exe
            - -command
            - "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ;  ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus)  } ; "
      nodeSelector:
        kubernetes.io/os: windows

This should be reproducible both after an upgrade of the rke2 version as well as by simply restarting the rke2 service with powershell. The image in the above yaml can be changed to use ltsc2019 depending on your setup, but ideally this would be tested against both 2019 and 2022

@mdrahman-suse
Copy link
Contributor

mdrahman-suse commented Aug 20, 2024

Update: I was able to replicate the issue. Will validate and update in the respective release branches
As per @rbrtbnfgl

the issue is related to the RKE2 agent that when it restarts it deletes the windows virtual network so the pods that are currently running are not able to communicate. So try nslookup on the Windows pod instead of deletion

@mdrahman-suse
Copy link
Contributor

mdrahman-suse commented Aug 20, 2024

@mdrahman-suse

I was able to reproduce this on v1.25 using the following workload

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: win-webserver
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      containers:
        - name: windowswebserver
          image: mcr.microsoft.com/windows/servercore:ltsc2022
          command:
            - powershell.exe
            - -command
            - "<#code used from https://gist.github.com/19WAS85/5424431#> ; $$listener = New-Object System.Net.HttpListener ; $$listener.Prefixes.Add('http://*:80/') ; $$listener.Start() ; $$callerCounts = @{} ; Write-Host('Listening at http://*:80/') ; while ($$listener.IsListening) { ;$$context = $$listener.GetContext() ;$$requestUrl = $$context.Request.Url ;$$clientIP = $$context.Request.RemoteEndPoint.Address ;$$response = $$context.Response ;Write-Host '' ;Write-Host('> {0}' -f $$requestUrl) ;  ;$$count = 1 ;$$k=$$callerCounts.Get_Item($$clientIP) ;if ($$k -ne $$null) { $$count += $$k } ;$$callerCounts.Set_Item($$clientIP, $$count) ;$$ip=(Get-NetAdapter | Get-NetIpAddress); $$header='<html><body><H1>Windows Container Web Server</H1>' ;$$callerCountsString='' ;$$callerCounts.Keys | % { $$callerCountsString+='<p>IP {0} callerCount {1} ' -f $$ip[1].IPAddress,$$callerCounts.Item($$_) } ;$$footer='</body></html>' ;$$content='{0}{1}{2}' -f $$header,$$callerCountsString,$$footer ;Write-Output $$content ;$$buffer = [System.Text.Encoding]::UTF8.GetBytes($$content) ;$$response.ContentLength64 = $$buffer.Length ;$$response.OutputStream.Write($$buffer, 0, $$buffer.Length) ;$$response.Close() ;$$responseStatus = $$response.StatusCode ;Write-Host('< {0}' -f $$responseStatus)  } ; "
      nodeSelector:
        kubernetes.io/os: windows

This should be reproducible both after an upgrade of the rke2 version as well as by simply restarting the rke2 service with powershell. The image in the above yaml can be changed to use ltsc2019 depending on your setup, but ideally this would be tested against both 2019 and 2022

Also FYI 1.25 is EOL so this fix wont be available in that version. It will be in 1.27+

@HarrisonWAffel
Copy link
Contributor Author

Yep, totally understand 1.25 is super EOL. Just wanted to preface that the workload I provided was last used to produce the issue on 1.25. I retested it after adding the comment on 1.29 and also repro'd it there

@mdrahman-suse
Copy link
Contributor

mdrahman-suse commented Aug 21, 2024

Validated on all the release branches with the latest RCs, except v1.27 (on commit). Closing this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/windows kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants