Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico no connectivity from windows to linux pods #5539

Closed
MBcom opened this issue Jan 29, 2022 · 21 comments
Closed

Calico no connectivity from windows to linux pods #5539

MBcom opened this issue Jan 29, 2022 · 21 comments

Comments

@MBcom
Copy link

MBcom commented Jan 29, 2022

Expected Behavior

  • Windows pods can reach the kubernetes internal DNS server or other Pods/ Services scheduled on linux nodes.

Current Behavior

  • Requests from Windows Pods to Kubernetes Services time out.
  • Requests from Windows Pods to e.g. the internet or intranet Ressources work.
  • Requests from Linux Nodes/ Pods to Services/ Pods on Windows Node work.

Possible Solution

Steps to Reproduce (for bugs)

Invoke-WebRequest https://docs.projectcalico.org/scripts/install-calico-windows.ps1 -OutFile C:\install-calico-windows.ps1
C:\install-calico-windows.ps1 -DownloadOnly yes -KubeVersion 1.23.3 -ServiceCidr 10.152.183.0/24 -DNSServerIPs 10.152.183.10
C:\CalicoWindows\install-calico.ps1
C:\CalicoWindows\kubernetes\install-kube-services.ps1

We changed the log level in C:\k\cni\config\10-calico.conf to error because calico reports that a file named 'mtu' was not found which makes kubelet think there is an error. Than we run:

Start-Service kubelet
Start-Service kube-proxy

Context

Our windows pods need to access S3 storages deployed on linux nodes therefore they need access to other kubernetes services/ pods.
HNSEndpoints looks good:

ActivityId         : 603DB0F9-F200-4233-B67F-69CDEF1F114E
AdditionalParams   :
EncapOverhead      : 50
Health             : @{LastErrorCode=0; LastUpdateTime=132879468690721008}
ID                 : 3034206C-BF5F-49E4-A261-7B5366695FE9
IPAddress          : 10.1.88.194
IsRemoteEndpoint   : True
MacAddress         : 00:50:56:ba:96:d1
Name               : Calico_ep
Policies           : {@{PA=10.19.11.3; Type=PA}}
PrefixLength       : 26
Resources          : @{AdditionalParams=; AllocationOrder=1; Allocators=System.Collections.ArrayList; Health=; ID=603DB0F9-F200-4233-B67F-69CDEF1F114E; PortOperationTime=0;
                     State=1; SwitchOperationTime=0; VfpOperationTime=0; parentId=27768260-1B84-415D-99FC-A997955FF364}
SharedContainers   : {}
State              : 1
Type               : Overlay
Version            : 38654705669
VirtualNetwork     : 82CEB1BC-E638-4433-A470-53974EE6D7D7
VirtualNetworkName : Calico
RunspaceId         : 640dce56-8fe2-4726-a885-51183e7c5b88

ActivityId                : 00F8E23A-DD3E-45B8-9249-7E939DDC8980
AdditionalParams          :
CreateProcessingStartTime : 132879468710080324
DNSServerList             : 10.152.183.10
DNSSuffix                 : test.svc.cluster.local,svc.cluster.local,cluster.local
EncapOverhead             : 50
GatewayAddress            : 10.1.88.193
Health                    : @{LastErrorCode=0; LastUpdateTime=132879468710064560}
ID                        : E64A101E-3869-4E27-AAB1-D84F51A23DCE
IPAddress                 : 10.1.88.245
MacAddress                : 0E-2A-0a-01-58-f5
Name                      : 2406b68077f0fe9181d0b71b377d0cff4a36553a2bdadc5ca2e061db52b732c7_Calico
Policies                  : {@{ExceptionList=System.Collections.ArrayList; Type=OutBoundNAT}, @{DestinationPrefix=10.152.183.0/24; NeedEncap=True; Type=ROUTE}, @{PA=10.19.11.3;
                            Type=PA}, @{Action=Allow; Direction=In; Id=allow-host-to-endpoint; InternalPort=0; LocalAddresses=; LocalPort=0; Priority=900; Protocol=256;
                            RemoteAddresses=10.19.11.3/32,172.22.96.1/32; RemotePort=0; RuleType=Switch; Scope=0; ServiceName=; Type=ACL}…}
PrefixLength              : 26
Resources                 : @{AdditionalParams=; AllocationOrder=13; Allocators=System.Collections.ArrayList; Health=; ID=00F8E23A-DD3E-45B8-9249-7E939DDC8980;
                            PortOperationTime=0; State=1; SwitchOperationTime=0; VfpOperationTime=0; parentId=27768260-1B84-415D-99FC-A997955FF364}
SharedContainers          : {2406b68077f0fe9181d0b71b377d0cff4a36553a2bdadc5ca2e061db52b732c7, 7e2257a99ca038d3f562920658a95176a32af1209005f1ecab1e4fe07d18667f}
StartTime                 : 132879468711043526
State                     : 3
Type                      : Overlay
Version                   : 38654705669
VirtualNetwork            : 82CEB1BC-E638-4433-A470-53974EE6D7D7
VirtualNetworkName        : Calico
RunspaceId                : 640dce56-8fe2-4726-a885-51183e7c5b88

We saw startup/startup.go 907: Selected default IP pool is '192.168.0.0/16' in calico-node.log but IPPool ist configured as

apiVersion: projectcalico.org/v3
items:
- apiVersion: projectcalico.org/v3
  kind: IPPool
  metadata:
    creationTimestamp: "2021-08-31T13:50:55Z"
    name: default-ipv4-ippool
    resourceVersion: "395"
    uid: b3690751-d0d2-4817-b7ab-d318ca0ecd3a
  spec:
    blockSize: 26
    cidr: 10.1.0.0/16
    ipipMode: Never
    natOutgoing: true
    nodeSelector: all()
    vxlanMode: Always
kind: IPPoolList
metadata:
  resourceVersion: "21325002"

Your Environment

  • Calico version v3.21.4
  • Orchestrator version (e.g. kubernetes, mesos, rkt): v1.23.3 (microk8s)
  • Operating System and version: Microsoft Windows Server 2019 Datacenter 10.0.17763
  • Link to your project (optional):
@caseydavenport
Copy link
Member

Looks likely to be a general Windows networking issue - @song-jiang @lmm should be able to help

@lmm
Copy link
Contributor

lmm commented Feb 4, 2022

@MBcom This log:

startup/startup.go 907: Selected default IP pool is '192.168.0.0/16'

means that no default ippool CIDR was defined with CALICO_IPV4POOL_CIDR and so the default CIDR was used.
Did you define the env var CALICO_IPV4POOL_CIDR?

If you have an existing ippool you want to use, then you can use the env var NO_DEFAULT_POOLS

@lmm
Copy link
Contributor

lmm commented Feb 4, 2022

Actually, I see now that in your linked PR , you've already set

            - name: CALICO_IPV4POOL_CIDR
              value: "10.1.0.0/16"

Was this the manifest you used to produce this issue?

@MBcom
Copy link
Author

MBcom commented Feb 14, 2022

@lmm thanks for your response and sorry for my late response

I have set the environment variable you mentioned now but the connection problem persists. The mentioned startup message is gone now.

Yes, the linked PR contains the working manifest for the Linux cluster nodes.

@lmm
Copy link
Contributor

lmm commented Feb 14, 2022

@MBcom are you seeing the connection problem now with new Windows pods that are using your 10.1.0.0/16 ippool?

@MBcom
Copy link
Author

MBcom commented Feb 15, 2022

@lmm Yes, the connection problem is still there. Do you need some more logs?
Thanks in advance

@lmm
Copy link
Contributor

lmm commented Feb 16, 2022

@MBcom do you see any errors in the calico-felix log file in c:\CalicoWindows\logs? Are there any errors or hints in the kube-proxy log in c:\k\?

@MBcom
Copy link
Author

MBcom commented Feb 18, 2022

Hi @lmm,
I see the following in kube-proxy.err.log:

server.go:225] "Warning, all flags other than --config, --write-config-to, and --cleanup are deprecated, please begin using a config file ASAP"
run.go:74] "command failed" err="failed complete: cannot set feature gate IPv6DualStack to false, feature is locked to true"

I can't see any errors in kube-proxy.log but it looks like kube-proxy restarts quite often. (~8200 times in 10 h) So it looks like it crashes.
So I commented out line 92 in c:\calicowindows\kubernetes\kube-proxy-service.ps1 making kube-proxy starting again.

#    $extraFeatures += "IPv6DualStack=false"

Thanks @lmm

In summary:

I'm not sure if it is the best solution to PR the above changes because there could be things I missed?

@lmm
Copy link
Contributor

lmm commented Feb 19, 2022

Hi @MBcom, thanks you are right: our kube-proxy configuration is broken for k8s 1.23+. I tested this myself and kube-proxy will just loop forever in this state doing nothing. This explains why the egress traffic from your Windows pods was not working.

The issue that adding IPv6DualStack feature flag was supposed to work around is not needed anymore. The fix for that is in k8s 1.21.6 (and newer).

So I think the immediate thing to PR is to remove $extraFeatures += "IPv6DualStack=false". If you could do PR that change, that would be awesome.

@lmm
Copy link
Contributor

lmm commented Feb 19, 2022

@MBcom as for your other notes:

the log level in C:\k\cni\config\10-calico.conf have to be set to error because any log message to stdout is treated as an error by kubelet

Could you please explain more about the issue you're seeing here? My understanding of the log_level param is that it controls the Calico cni-plugin logging. But for Windows we don't enable logging to file by default.

kube-proxy should be configured by a config file

Yes I think we should do this but perhaps in a separate PR. Maybe @song-jiang might have ideas about this too.

@lmm
Copy link
Contributor

lmm commented Feb 23, 2022

@MBcom would you be up for putting up the fix to remove the IPv6DualStack flag? We'd like to get the fix in this week if possible.

@lmm
Copy link
Contributor

lmm commented Feb 24, 2022

@MBcom I've put up the PR for the kube-proxy service fix.

@MBcom
Copy link
Author

MBcom commented Mar 6, 2022

@lmm thanks for creating the PR. I was too busy to come back here.

Regarding the other problem:
Kubelet didn't created the pods if we leaved the log-level unchanged. Kubelet's error message contained an info-level message from Calico indicating that Calico didin't find a file for MTU in /var/lib/calico/mtu (or similar). The pods are created successfully if we reduced the log level to hide info-messages.

@ckrueger1979
Copy link

ckrueger1979 commented Mar 10, 2022

Hi,

we have exactly the same problem. We have Calico 3.22.1 on Windows Server 2019. We tried VXLAN and BGP.
It makes no difference, from a windows pod it's not possible to reach the ClusterIP.
I assume that either routing or NAT is broken on Windows nodes.

If I can provide any usefull logs, I'll happy to post them.

greetings
Carsten

@lmm
Copy link
Contributor

lmm commented Mar 10, 2022

kube-proxy should be configured by a config file

I looked a bit further and that warning log is actually quite old (5 years old).

Kubelet didn't created the pods if we leaved the log-level unchanged.

@MBcom that's strange, the log level should not be preventing kubelet from doing its work. Is there anything in the kubelet logs that suggest anything? What does a kubectl describe on the pods show?

@lmm
Copy link
Contributor

lmm commented Mar 10, 2022

@ckruegerkpmg could you please provide logs? See this comment: #5539 (comment)

A fix for the kube-proxy config bug should have been fixed in Calico v3.22.1

@ckrueger1979
Copy link

ckrueger1979 commented Mar 11, 2022

@lmm
Diagnostics + Calico Logs

https://github.com/ckruegerkpmg/logs/blob/main/logs.zip?raw=true

PS:
Something odd I saw in the install-calico-windows.ps1 script:
It downloads an ancient version (4 years old) of hns powershell module)
DownloadFile -Url https://github.com/Microsoft/SDN/raw/master/Kubernetes/windows/hns.psm1 -Destination $BaseDir\hns.psm1

A much newer version (2 years old) is available as a nuget package:
https://www.powershellgallery.com/packages/HNS/

@ckrueger1979
Copy link

@lmm:

The bug IS fixed with 3.22.1 but the
https://projectcalico.docs.tigera.io/scripts/install-calico-windows.ps1
script is stupid.

It doesn't download the new version even if specified via
$ReleaseBaseURL and $ReleaseFile

if (!(Test-Path $CalicoZip))
{
    Write-Host "$CalicoZip not found, downloading Calico for Windows release..."
    DownloadFile -Url $ReleaseBaseURL/$ReleaseFile -Destination c:\calico-windows.zip
}

Is the install-calico-windows.ps1 somewhere available in a public git repo?
I'd like to do some pullrequests.

@lmm
Copy link
Contributor

lmm commented Mar 14, 2022

It downloads an ancient version (4 years old) of hns powershell module)

@ckruegerkpmg yeah you're right, that file is pretty old. Thanks, I didn't know that nuget package existed! We'll have to see if that package works as a drop-in replacement.

It doesn't download the new version even if specified via

Ugh... the script has the wrong version embedded for the v3.22.1 release: https://github.com/projectcalico/calico/blob/v3.22.1/calico/scripts/install-calico-windows.ps1#L23-L24

Thanks for raising that, I'll fix that now. I'll also need to fix how our install script is handled during our release. (This is all fallout from this which I should consider reverting.)

It should be possible to override $ReleaseBaseURL and $ReleaseFile with something like:

c:\install-calico-windows.ps1 -ReleaseBaseURL "https://github.com/projectcalico/calico/releases/download/v3.21.4/" -ReleaseFile "calico-windows-v3.21.4.zip"

@lmm
Copy link
Contributor

lmm commented Mar 14, 2022

@ckruegerkpmg I've opened #5749 to track the bug you raised.

Is the install-calico-windows.ps1 somewhere available in a public git repo?
I'd like to do some pullrequests.

Please take a look here. In my previous comment I linked to the file in the current release branch.

@lmm
Copy link
Contributor

lmm commented Apr 22, 2022

The original kube-proxy service bug has been fixed.

@lmm lmm closed this as completed Apr 22, 2022
@lmm lmm unassigned lmm and song-jiang Apr 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants