Crash boot loop because of "Transport endpoint is not connected", umount -l #10

guysoft · 2021-03-16T08:50:04Z

Hey,
Sometimes I get an issue that pots the pod on CrashLoopBackOff.
The workaround is to ssh to the node and run unmount -l (lazy), then deleted the pod and let it get created.
During that time the mount is down.

Debuging results:

Trying to unmount returns "device resource busy"
The pod describe shows "Failed to create directory (of mount) already exists"
When I ran lsof (which had to to be installed with yum), then I got the error Transport endpoint is not connected

The text was updated successfully, but these errors were encountered:

freegroup · 2021-03-16T10:00:56Z

I already unmount in a "preStop" hook. Did your deployment contians this step as well?

          preStop:
            exec:
              command: ["/bin/sh","-c","umount -f /var/s3"]

guysoft · 2021-03-16T11:48:27Z

Yes, it looks like this here:

          preStop:
            exec:
              command: ["bash", "-c", "umount -f /srv/website-eu/root"]

BTW, I use a sub folder, because it means that you don't loose transport connection in the pods when you re-mount. (Because they get the sub-folder changed, and not the root that they are mapping to).

I guess it should try something like

["bash", "-c", "umount -f /srv/website-eu/root || umount -l /srv/website-eu/root"]

I am not sure if that syntax works. I can try test unless you have a better suggestion.

guysoft · 2021-04-26T07:35:43Z

Found someone getting this on stackoverflow too:
https://stackoverflow.com/questions/64710309/error-transport-endpoint-is-not-connected-while-using-s3fs-with-kubernetes-w

guysoft · 2021-04-26T08:56:07Z

Ok, I think I solved it.
It seems like sometimes the pod looses some kinds of connection, resulting in "Transport not connected".
The workaround I found to fix this is to add an init container, that tries to unmount the folder before. That seems to fix the issue. Will let it run and see if it comes back:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: s3-provider
  name: s3-provider
spec:
  selector:
    matchLabels:
      app: s3-provider
  template:
    metadata:
      labels:
        app: s3-provider
    spec:
      initContainers:
      - name: init-myservice
        image: bash
        command: ['bash', '-c', 'umount -l /mnt/data-s3-fs/root ; true']
        securityContext:
          privileged: true
          capabilities:
            add:
            - SYS_ADMIN
        # use ALL  entries in the config map as environment variables
        envFrom:
        - configMapRef:
            name: s3-config
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - name: mntdatas3fs-init
          mountPath: /mnt:shared
      containers:
      - name: s3fuse
        image: 963341077747.dkr.ecr.us-east-1.amazonaws.com/kube-s3:1.0
        imagePullPolicy: Always
        lifecycle:
          preStop:
            exec:
              command: ["bash", "-c", "umount -f /srv/s3-mount/root"]
        securityContext:
          privileged: true
          capabilities:
            add:
            - SYS_ADMIN
        # use ALL  entries in the config map as environment variables
        envFrom:
        - configMapRef:
            name: s3-config
        env:
        - name: S3_BUCKET
          value: s3-mount
        - name: MNT_POINT
          value: /srv/s3-mount/root
        - name: IAM_ROLE
          value: none
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - name: mntdatas3fs
          mountPath: /srv/s3-mount/root:shared
      volumes:
      - name: devfuse
        hostPath:
          path: /dev/fuse
      - name: mntdatas3fs
        hostPath:
          type: DirectoryOrCreate
          path: /mnt/data-s3-fs/root
      - name: mntdatas3fs-init
        hostPath:
          type: DirectoryOrCreate
          path: /mnt

gaul · 2021-04-26T09:31:28Z

Transport endpoint is not connected

Usually means that s3fs exited unexpectedly. I would check to see if the process is running. If not it would help to gather the logs or attach gdb before the crash to get a backtrace.

guysoft · 2021-04-26T14:17:14Z

I saw no logs. both in describe and in the logs. It might be a GDB traceable issue, though since I found no way to reproduce this without just waiting for it to happen. I am not sure. Its also hard to hold this kind of process under tracing just before it crashes, because I don't know how to cause it.

fenwuyaoji · 2023-09-06T03:02:43Z

I think most crashes are caused by the resource competition like CPU or memory lack. Something like below should be add to the yaml:
resources: limits: cpu: "2" memory: 8Gi requests: cpu: "1" memory: 4Gi

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash boot loop because of "Transport endpoint is not connected", umount -l #10

Crash boot loop because of "Transport endpoint is not connected", umount -l #10

guysoft commented Mar 16, 2021

freegroup commented Mar 16, 2021

guysoft commented Mar 16, 2021

guysoft commented Apr 26, 2021

guysoft commented Apr 26, 2021

gaul commented Apr 26, 2021

guysoft commented Apr 26, 2021

fenwuyaoji commented Sep 6, 2023

Crash boot loop because of "Transport endpoint is not connected", umount -l #10

Crash boot loop because of "Transport endpoint is not connected", umount -l #10

Comments

guysoft commented Mar 16, 2021

freegroup commented Mar 16, 2021

guysoft commented Mar 16, 2021

guysoft commented Apr 26, 2021

guysoft commented Apr 26, 2021

gaul commented Apr 26, 2021

guysoft commented Apr 26, 2021

fenwuyaoji commented Sep 6, 2023