NFSv4.2 is broken across different hosts #1565

MichaelEischer · 2024-10-15T15:07:21Z

Description

With flatcar 3975.2.1 we see very weird behavior of NFS 4.2 where one pod writes a file but a pod on a different host is unable to see the just written file content.

NFS 3 / 4.1 works as expected. (Haven't tested 4.0). Flatcar 3815.2.5 is also unaffected.

Impact

NFS 4.2 mount is unusable.

Environment and steps to reproduce

Set-up:

at least two nodes in a k8s cluster running flatcar 3975.2.1
Setup nfs-ganesh:

helm repo add nfs-ganesha-server-and-external-provisioner https://kubernetes-sigs.github.io/nfs-ganesha-server-and-external-provisioner/
helm install my-release nfs-ganesha-server-and-external-provisioner/nfs-server-provisioner

Update mount options in `StorageClass` `nfs`

allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    meta.helm.sh/release-name: my-release
    meta.helm.sh/release-namespace: default
  labels:
    app: nfs-server-provisioner
    app.kubernetes.io/managed-by: Helm
    chart: nfs-server-provisioner-1.8.0
    heritage: Helm
    release: my-release
  name: nfs
mountOptions:
- hard
- retrans=3
- proto=tcp
- nfsvers=4.2
- rsize=4096
- wsize=4096
- noatime
- nodiratime
provisioner: cluster.local/my-release-nfs-server-provisioner
reclaimPolicy: Delete
volumeBindingMode: Immediate

create pvc

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: test-dynamic-volume-claim
spec:
  storageClassName: "nfs"
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 100Mi

create pods (must be executed on different hosts)

apiVersion: v1
kind: Pod
metadata:
  name: test-pod-1
  labels:
    app: nginx
spec:
  containers:
    - name: test
      image: nginx
      volumeMounts:
        - name: config
          mountPath: /test
  volumes:
    - name: config
      persistentVolumeClaim:
        claimName: test-dynamic-volume-claim
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: nginx
        topologyKey: "kubernetes.io/hostname"
---
apiVersion: v1
kind: Pod
metadata:
  name: test-pod-2
  labels:
    app: nginx
spec:
  containers:
    - name: test
      image: nginx
      volumeMounts:
        - name: config
          mountPath: /test
  volumes:
    - name: config
      persistentVolumeClaim:
        claimName: test-dynamic-volume-claim
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: nginx
        topologyKey: "kubernetes.io/hostname"

Action(s):
a. kubectl exec -it test-pod-1 -- bash -c 'echo "def" > /test/testfile'
b. kubectl exec -it test-pod-2 -- bash -c 'cat /test/testfile'
Error: The call to cat should return "def", but returns nothing. Note that both pods see accurate metadata (using ls -la /test) for the file

Expected behavior

cat from test-pod-2 should be able to read the just written file content. Note that test-pod-1 is able to read the file contents.

The text was updated successfully, but these errors were encountered:

MichaelEischer · 2024-10-16T09:51:57Z

Just repeated the test with 3975.2.2 and it is also affected.

Edit: https://cdn.kernel.org/pub/linux/kernel/v6.x/ChangeLog-6.6.55 contains a few NFS related fixes, but it's unclear to me whether that would resolve this issue.

ader1990 · 2024-10-16T11:22:31Z

Hello, I could reproduce this behaviour on a two node ARM64 Flatcar latest alpha (4116.0.0) env.

I think the issue is related to a host problem as the files from the secondary host are empty as shown from the host perspective:

# from k8s node 2
cat /var/lib/kubelet/pods/2cf32713-9a7a-412e-b4fc-998741deb125/volumes/kubernetes.io~nfs/pvc-8f47140d-1164-42a3-816f-05b41a9633c9/test

# empty output

ader1990 · 2024-10-16T12:50:05Z

Flatcar main with kernel 6.6.56 is also affected.

ader1990 · 2024-10-18T09:14:34Z

Tested with Flatcar using kernel 6.10.9 and the issue is present there too, this seems to be a Linux kernel regression. Or a tooling / containerd issue - needs debugging to repro this case outside of k8s first and to better pin-point the actual cause.

ader1990 · 2024-10-18T12:50:21Z

torvalds/linux@9cf2744#diff-a24af2ce5442597efe8051684905db2be615f41703247fbce9a446e77f2e9587R214 -> from the linux tree, this is the only thing I see it has changed that might affect NFS 6.6 or 6.10 vs previous ones.

MichaelEischer · 2024-10-18T14:44:12Z

I'm wondering whether the underlying issue might be a bug in the ganesha NFS server that is now exposed by the read_plus default change.

Edit: I did some additional testing and the output of cat /test/testfile is actually not empty, but rather consists only of null bytes with the expected length.

ader1990 · 2024-10-21T06:39:34Z

I'm wondering whether the underlying issue might be a bug in the ganesha NFS server that is now exposed by the read_plus default change.

Edit: I did some additional testing and the output of cat /test/testfile is actually not empty, but rather consists only of null bytes with the expected length.

I am trying now to build a kernel with the read_plus disabled, let's see how that goes.

tormath1 · 2024-10-21T08:02:31Z

We might be interested to update our NFS test then to catch further regressions like this. (https://github.com/flatcar/mantle/blob/02348d65a5f9bd72f3e7412da54a688b7f972790/kola/tests/kubeadm/kubeadm.go#L237)

ader1990 · 2024-10-21T08:13:11Z

Tested with NFS_V4_2_READ_PLUS=n and the issue got solved. This is an upstream issue - kernel or nfs implementation and needs to be properly reported, any idea where is best to have it reported?

jepio · 2024-10-21T08:48:02Z

Normally https://lore.kernel.org/linux-nfs and or the upstream for the server implementation but... the nfs-ganesha-server-and-external-provisioner repo (https://github.com/kubernetes-sigs/nfs-ganesha-server-and-external-provisioner) is still on Ganesha V4.0.8 whereas upstream just released V6. So there is some reason to think that this might be fixed in newer versions.

ader1990 · 2024-10-21T10:01:08Z

kubernetes-sigs/nfs-ganesha-server-and-external-provisioner#152 -> there is a PR to update the chart to use Ganesha v6.

ader1990 · 2024-10-21T10:09:10Z

nfs-ganesha/nfs-ganesha@24da5c3#diff-d4e3191eebe00b04019cafa02691fef13becc8cb3cc098ae6c177653cea40561R776 -> this commit is the best candidate to have a fix for this issue.

Disable CONFIG_NFS_V4_2_READ_PLUS kernel config, as Linux kernel >= 6.6 enabled the CONFIG_NFS_V4_2_READ_PLUS config option by default, and nfs-ganesha version <= 6.1 is broken due to mishandling of the read_plus operation. See: nfs-ganesha/nfs-ganesha@24da5c3 See: flatcar/Flatcar#1565 See: nfs-ganesha/nfs-ganesha#1188

MichaelEischer · 2024-11-18T14:44:21Z

Is there any update here yet?

ader1990 · 2024-11-18T14:56:51Z

Hello, the just released Flatcar versions alpha/beta/stable from https://www.flatcar.org/releases have the Linux kernel fix.

MichaelEischer · 2024-11-22T15:28:36Z

I just gave the new flatcar version a try and NFS in works again. Thanks!

MichaelEischer added the kind/bug Something isn't working label Oct 15, 2024

github-project-automation bot added this to Flatcar tactical, release planning, and roadmap Oct 15, 2024

github-project-automation bot moved this to 📝 Needs Triage in Flatcar tactical, release planning, and roadmap Oct 15, 2024

ader1990 mentioned this issue Oct 16, 2024

Upgrade Linux Kernel for main from 6.6.54 to 6.6.56 flatcar/scripts#2370

Merged

ader1990 self-assigned this Oct 16, 2024

ader1990 mentioned this issue Oct 21, 2024

Linux kernel >= 6.6 default value of CONFIG_NFS_V4_2_READ_PLUS=y breaks nfs-ganesha < 6.2 nfs-ganesha/nfs-ganesha#1188

Open

github-actions bot mentioned this issue Oct 22, 2024

Monthly contributions report 2024-09-22 - 2024-10-21 #1568

Open

ader1990 mentioned this issue Oct 22, 2024

sys-kernel/coreos-modules: disable CONFIG_NFS_V4_2_READ_PLUS flatcar/scripts#2390

Merged

2 tasks

dongsupark moved this from 📝 Needs Triage to ⚒️ In Progress in Flatcar tactical, release planning, and roadmap Oct 25, 2024

dongsupark moved this from ⚒️ In Progress to ✅ Testing / in Review in Flatcar tactical, release planning, and roadmap Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFSv4.2 is broken across different hosts #1565

NFSv4.2 is broken across different hosts #1565

MichaelEischer commented Oct 15, 2024

MichaelEischer commented Oct 16, 2024 •

edited

Loading

ader1990 commented Oct 16, 2024

ader1990 commented Oct 16, 2024

ader1990 commented Oct 18, 2024

ader1990 commented Oct 18, 2024

MichaelEischer commented Oct 18, 2024 •

edited

Loading

ader1990 commented Oct 21, 2024

tormath1 commented Oct 21, 2024

ader1990 commented Oct 21, 2024

jepio commented Oct 21, 2024 •

edited

Loading

ader1990 commented Oct 21, 2024

ader1990 commented Oct 21, 2024

MichaelEischer commented Nov 18, 2024

ader1990 commented Nov 18, 2024

MichaelEischer commented Nov 22, 2024

NFSv4.2 is broken across different hosts #1565

NFSv4.2 is broken across different hosts #1565

Comments

MichaelEischer commented Oct 15, 2024

Description

Impact

Environment and steps to reproduce

Expected behavior

MichaelEischer commented Oct 16, 2024 • edited Loading

ader1990 commented Oct 16, 2024

ader1990 commented Oct 16, 2024

ader1990 commented Oct 18, 2024

ader1990 commented Oct 18, 2024

MichaelEischer commented Oct 18, 2024 • edited Loading

ader1990 commented Oct 21, 2024

tormath1 commented Oct 21, 2024

ader1990 commented Oct 21, 2024

jepio commented Oct 21, 2024 • edited Loading

ader1990 commented Oct 21, 2024

ader1990 commented Oct 21, 2024

MichaelEischer commented Nov 18, 2024

ader1990 commented Nov 18, 2024

MichaelEischer commented Nov 22, 2024

MichaelEischer commented Oct 16, 2024 •

edited

Loading

MichaelEischer commented Oct 18, 2024 •

edited

Loading

jepio commented Oct 21, 2024 •

edited

Loading