Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove support for Canal and the vxlan Flannel backend #8614

Closed
wants to merge 1 commit into from

Conversation

johngmyers
Copy link
Member

Due to flannel-io/flannel#1243

Targeting kops 1.17

Fixes #8562

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 22, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: johngmyers
To complete the pull request process, please assign justinsb
You can assign the PR to them by writing /assign @justinsb in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@johngmyers johngmyers force-pushed the flannel-issue branch 2 times, most recently from 017d8c1 to 96e044d Compare February 23, 2020 00:02
@johngmyers
Copy link
Member Author

There's activity on the underlying bug, so
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 23, 2020
@johngmyers
Copy link
Member Author

/test pull-kops-verify

@joshbranham
Copy link
Contributor

Yeah this thread also has some activity, the maintainer is referenced as well: flannel-io/flannel#1245

@jhohertz
Copy link
Contributor

It would be great if the flannel folks get this fixed up, but there doesn't seem to be a lot of movement.

@johngmyers
Copy link
Member Author

I think I'll asses the state in 2-3 weeks. If no progress by then I'll un-hold.

This should be a 1.17 blocker.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 28, 2020
@gamer22026
Copy link

How does this affect running flannel in AWS ? Even with disabling Source/Destination Check on the nodes, I was not able to get flannel with host-gw backend working (k8s 1.16.7) and aws-vpc is a non-starter.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 7, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 13, 2020
@johngmyers
Copy link
Member Author

I don't see any progress.
/hold cancel

@k8s-ci-robot k8s-ci-robot added area/api area/documentation and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Mar 13, 2020
@johngmyers
Copy link
Member Author

Decision from Kops Office Hours is to defer for two weeks.
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 13, 2020
@johngmyers
Copy link
Member Author

Decision of kops office hours is to proceed.
/hold cancel

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Apr 10, 2020
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 10, 2020
@johngmyers
Copy link
Member Author

/retest

@andersosthus
Copy link
Contributor

There's an open PR now to fix this issue at flannel-io/flannel#1282.
For our part we use Flannel to run Windows Worker nodes, and Flannel is afaik the only supported overlay (https://kubernetes.io/docs/setup/production-environment/windows/intro-windows-in-kubernetes/#networking)

@justinsb
Copy link
Member

I've also (so far) only been able to reproduce the problem on rhel 7.8 images.

@justinsb
Copy link
Member

justinsb commented Apr 20, 2020

And I actually think that #8381 might be a fix.

@justinsb
Copy link
Member

justinsb commented Apr 20, 2020

I'm trying to get some flannel tests going on "every distro": kubernetes/test-infra#17300

@johngmyers
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 20, 2020
@jhohertz
Copy link
Contributor

Just a data point, I tried adding the sysctls from the CentOS7 fix to a Flatcar-based cluster and didn't see any change in flannel's behaviour.

Trying to figure out the best way to shoehorn the ethtool/checksum fix in for a test now...

@hakman
Copy link
Member

hakman commented May 1, 2020

Seems that the proper fix was accepted: kubernetes/kubernetes#88986 (comment). Getting it to already released distros will probably take a long time, if it ever happens.

The patch comment says that the problem seems to be there from day one, although it was probably not visible before UDP tunnels were implemented.

For now the best hope for the near future is the flannel-io/flannel#1282, which will require a new Flannel release.

@jhohertz
Copy link
Contributor

jhohertz commented May 5, 2020

I'm not sure the "proper" fix is the whole story here. I rolled a version of flatcar with the cited kernel patch and see no difference in behaviour for flannel vxlan / canal.

@jhohertz
Copy link
Contributor

jhohertz commented May 5, 2020

I can however now confirm using ethtool to disable offloading works.

@hakman
Copy link
Member

hakman commented May 5, 2020

@jhohertz just curious, how did you try flatcar with that patch?

@jhohertz
Copy link
Contributor

jhohertz commented May 6, 2020

@hakman I forked the manifest and coreos-overlay repositories to add the patch. Running cork create for my attempt looks like: cork create --manifest-branch viafoura-build-2345.3.1 --manifest-url https://github.com/viafoura/flatcar-manifest.git. There are additional steps to actually turn the result into an AMI.

The backport had one minor difference for 4.19 in call signatures but otherwise the logic was identical. There might be other differences that need accounting for, but the patch is so tiny... At any rate, my attempt lives here in the context of the build setup above: https://github.com/viafoura/flatcar-coreos-overlay/blob/viafoura-build-2345.3.1/sys-kernel/coreos-sources/files/4.19/z0004-netfilter-nat-never-update-the-UDP-checksum-when-it-.patch

@jhohertz
Copy link
Contributor

jhohertz commented May 6, 2020

Re: the ethtool thing, I'm going to try to adapt this into a hooks yaml fragment suitable for a cluster spec. Not really used that bit of functionality before, I'll share if I get a working example.

@jhohertz
Copy link
Contributor

jhohertz commented May 6, 2020

At first glance, this seems to work in a cluster spec to implement the workaround in a not terrible way. FAR from extensively tested. And of course if this is actually disabling real hardware offload, this isn't an ideal fix, but it may be the best we can get for now.

spec:
  hooks:
  - name: flannel-cksum-offload-disable.service
    useRawManifest: true
    roles: [Master, Node]
    manifest: |
      [Unit]
      Description=Turn off checksum offload on flannel.1
      After=sys-devices-virtual-net-flannel.1.device
      [Install]
      WantedBy=sys-devices-virtual-net-flannel.1.device
      [Service]
      Type=oneshot
      ExecStart=/sbin/ethtool -K flannel.1 tx-checksum-ip-generic off

@hakman
Copy link
Member

hakman commented May 6, 2020

Re: the ethtool thing, I'm going to try to adapt this into a hooks yaml fragment suitable for a cluster spec. Not really used that bit of functionality before, I'll share if I get a working example.

@jhohertz That is not such a bad idea. May be something to think about adding in Kops as a workaround until a new version of Flannel is released.

@johngmyers
Copy link
Member Author

Closing in favor of #9074

@johngmyers johngmyers closed this May 6, 2020
@johngmyers johngmyers deleted the flannel-issue branch May 6, 2020 15:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api area/documentation cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1.17 alpha versions causing regression for kiam?
8 participants