Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash boot loop because of "Transport endpoint is not connected", umount -l #10

Open
guysoft opened this issue Mar 16, 2021 · 7 comments

Comments

@guysoft
Copy link
Contributor

guysoft commented Mar 16, 2021

Hey,
Sometimes I get an issue that pots the pod on CrashLoopBackOff.
The workaround is to ssh to the node and run unmount -l (lazy), then deleted the pod and let it get created.
During that time the mount is down.

Debuging results:

  • Trying to unmount returns "device resource busy"
  • The pod describe shows "Failed to create directory (of mount) already exists"
  • When I ran lsof (which had to to be installed with yum), then I got the error Transport endpoint is not connected
@freegroup
Copy link
Owner

I already unmount in a "preStop" hook. Did your deployment contians this step as well?

          preStop:
            exec:
              command: ["/bin/sh","-c","umount -f /var/s3"]

@guysoft
Copy link
Contributor Author

guysoft commented Mar 16, 2021

Yes, it looks like this here:

          preStop:
            exec:
              command: ["bash", "-c", "umount -f /srv/website-eu/root"]

BTW, I use a sub folder, because it means that you don't loose transport connection in the pods when you re-mount. (Because they get the sub-folder changed, and not the root that they are mapping to).

I guess it should try something like

["bash", "-c", "umount -f /srv/website-eu/root || umount -l /srv/website-eu/root"]

I am not sure if that syntax works. I can try test unless you have a better suggestion.

@guysoft
Copy link
Contributor Author

guysoft commented Apr 26, 2021

@guysoft
Copy link
Contributor Author

guysoft commented Apr 26, 2021

Ok, I think I solved it.
It seems like sometimes the pod looses some kinds of connection, resulting in "Transport not connected".
The workaround I found to fix this is to add an init container, that tries to unmount the folder before. That seems to fix the issue. Will let it run and see if it comes back:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  labels:
    app: s3-provider
  name: s3-provider
spec:
  selector:
    matchLabels:
      app: s3-provider
  template:
    metadata:
      labels:
        app: s3-provider
    spec:
      initContainers:
      - name: init-myservice
        image: bash
        command: ['bash', '-c', 'umount -l /mnt/data-s3-fs/root ; true']
        securityContext:
          privileged: true
          capabilities:
            add:
            - SYS_ADMIN
        # use ALL  entries in the config map as environment variables
        envFrom:
        - configMapRef:
            name: s3-config
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - name: mntdatas3fs-init
          mountPath: /mnt:shared
      containers:
      - name: s3fuse
        image: 963341077747.dkr.ecr.us-east-1.amazonaws.com/kube-s3:1.0
        imagePullPolicy: Always
        lifecycle:
          preStop:
            exec:
              command: ["bash", "-c", "umount -f /srv/s3-mount/root"]
        securityContext:
          privileged: true
          capabilities:
            add:
            - SYS_ADMIN
        # use ALL  entries in the config map as environment variables
        envFrom:
        - configMapRef:
            name: s3-config
        env:
        - name: S3_BUCKET
          value: s3-mount
        - name: MNT_POINT
          value: /srv/s3-mount/root
        - name: IAM_ROLE
          value: none
        volumeMounts:
        - name: devfuse
          mountPath: /dev/fuse
        - name: mntdatas3fs
          mountPath: /srv/s3-mount/root:shared
      volumes:
      - name: devfuse
        hostPath:
          path: /dev/fuse
      - name: mntdatas3fs
        hostPath:
          type: DirectoryOrCreate
          path: /mnt/data-s3-fs/root
      - name: mntdatas3fs-init
        hostPath:
          type: DirectoryOrCreate
          path: /mnt

@gaul
Copy link
Contributor

gaul commented Apr 26, 2021

Transport endpoint is not connected

Usually means that s3fs exited unexpectedly. I would check to see if the process is running. If not it would help to gather the logs or attach gdb before the crash to get a backtrace.

@guysoft
Copy link
Contributor Author

guysoft commented Apr 26, 2021

I saw no logs. both in describe and in the logs. It might be a GDB traceable issue, though since I found no way to reproduce this without just waiting for it to happen. I am not sure. Its also hard to hold this kind of process under tracing just before it crashes, because I don't know how to cause it.

@fenwuyaoji
Copy link

I think most crashes are caused by the resource competition like CPU or memory lack. Something like below should be add to the yaml:
resources: limits: cpu: "2" memory: 8Gi requests: cpu: "1" memory: 4Gi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants