rebase: update minikube to latest version #1811

Madhu-1 · 2020-12-21T07:32:12Z

As minikube 1.17.1 is released and updating the minikube to the latest available version.

Signed-off-by: Madhu Rajanna [email protected]

obnoxxx · 2020-12-21T08:12:52Z

build.env

@@ -36,7 +36,7 @@ SNAPSHOT_VERSION=v3.0.1
 HELM_VERSION=v3.1.2

 # minikube settings
-MINIKUBE_VERSION=v1.14.1
+MINIKUBE_VERSION=v1.16.0


why not use latest?

@obnoxxx currently, we are sticking to the available release once the version is tested in the CI. if there is any regression in the latest available release we might end up in the CI issues which could block the merging of the PR.

Madhu-1 · 2020-12-23T02:59:43Z

/retest all

Madhu-1 · 2020-12-23T03:00:53Z

/test ci/centos/mini-e2e-helm

Madhu-1 · 2020-12-23T03:01:12Z

@Mergifyio rebase

mergify · 2020-12-23T03:01:44Z

Command rebase: success

Branch has been successfully rebased

nixpanic · 2021-01-05T08:07:50Z

@Mergifyio rebase

The logs of the CI jobs have been removed, we will need new logs in order to fix issues.

mergify · 2021-01-05T08:08:32Z

Command rebase: success

Branch has been successfully rebased

nixpanic · 2021-01-05T16:06:21Z

/test ci/centos/mini-e2e-helm

Madhu-1 · 2021-01-19T05:36:05Z

@Mergifyio rebase

The logs of the CI jobs have been removed, we will need new logs in order to fix issues.

mergify · 2021-01-19T05:36:46Z

Command rebase: success

Branch has been successfully rebased

nixpanic · 2021-02-01T08:13:36Z

https://github.com/kubernetes/minikube/tree/v1.17.1 has been released and includes a fix for #1840

Madhu-1 · 2021-02-01T08:16:29Z

/test ci/centos/mini-e2e-helm

Madhu-1 · 2021-02-01T08:16:39Z

/test ci/centos/mini-e2e

Madhu-1 · 2021-02-01T08:17:50Z

/test ci/centos/upgrade-tests-cephfs

Pull request has been modified.

nixpanic · 2021-02-01T11:18:17Z

The e2e tests seem to fail consistently with the following error:

Feb  1 09:40:18.655: INFO: Waiting up to 10m0s for all daemonsets in namespace 'cephcsi-e2e-3b6671624651' to start
Feb  1 09:40:18.658: INFO: 1 / 1 pods ready in namespace 'cephcsi-e2e-3b6671624651' in daemonset 'csi-cephfsplugin' (0 seconds elapsed)
�[1mSTEP�[0m: check static PVC
Feb  1 09:40:18.665: INFO: ExecWithOptions {Command:[/bin/sh -c ceph fsid] Namespace:rook-ceph PodName:rook-ceph-tools-5455675849-qxgb2 ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Feb  1 09:40:18.665: INFO: >>> kubeConfig: /root/.kube/config
Feb  1 09:40:21.320: INFO: ExecWithOptions {Command:[/bin/sh -c ceph fs subvolumegroup create myfs testGroup] Namespace:rook-ceph PodName:rook-ceph-tools-5455675849-qxgb2 ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Feb  1 09:40:21.320: INFO: >>> kubeConfig: /root/.kube/config
Feb  1 09:53:13.588: INFO: stdErr occurred: Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 165, in get_fs_handle
    conn.connect()
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 88, in connect
    self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
  File "cephfs.pyx", line 739, in cephfs.LibCephFS.mount
cephfs.Error: error calling ceph_mount: Connection timed out [Errno 110]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1177, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 426, in handle_command
    return handler(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 34, in wrap
    return f(self, inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 452, in _cmd_fs_subvolumegroup_create
    uid=cmd.get('uid', None), gid=cmd.get('gid', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 480, in create_subvolume_group
    with open_volume(self, volname) as fs_handle:
  File "/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 316, in open_volume
    fs_handle = vc.connection_pool.get_fs_handle(volname)
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 171, in get_fs_handle
    raise VolumeException(-e.args[0], e.args[1])
TypeError: bad operand type for unary -: 'str'


Feb  1 09:53:13.588: FAIL: failed to validate CephFS static pv with error command terminated with exit code 22

Possibly minikube has tightened its network policy and the node-plugin can not connect to the MDS anymore? During e2e testing, we use two different namespaces for the Ceph cluster and to-test ceph-csi services. Maybe that is problematic...

Madhu-1 · 2021-02-04T04:57:46Z

@nixpanic something has changed in minikube 1.17.1 am not able to run ceph fs commands from the toolbox pod.

nixpanic · 2021-02-09T08:41:13Z

@nixpanic something has changed in minikube 1.17.1 am not able to run ceph fs commands from the toolbox pod.

The toolbox pod can access the CephFS MDS, I think. Commands like ceph fs status work just fine.
However, when trying to create a subvolume group, the ceph command needs to talk to the CephMgr. Connecting to the CephMgr works as well, the logs of the CephMgr pod show that the command is received.

debug 2021-02-05T08:46:23.577+0000 7fc29d02b700  0 log_channel(audit) log [DBG] : from='client.11024 -' entity='client.admin' cmd=[{"prefix": "fs subvolumegroup create", "vol_name": "myfs", "target": ["mgr", ""], "group_name": "testGroup"}]: dispatch

However, it seems to run into a timeout:

debug 2021-02-05T08:50:14.554+0000 7fc29c82a700  0 [volumes ERROR volumes.module] Failed _cmd_fs_subvolumegroup_create(group_name:testGroup, prefix:fs subvolumegroup create, target:['mgr', ''], vol_name:myfs) < "":
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 165, in get_fs_handle
    conn.connect()
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 88, in connect
    self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
  File "cephfs.pyx", line 739, in cephfs.LibCephFS.mount
cephfs.Error: error calling ceph_mount: Connection timed out [Errno 110]

Testing communications between toolbox, mds, mgr does not show any restrictions. Installed nc inside the running containers, and run is like this:

[root@rook-ceph-mgr-a-799d4886bd-4gmcs /]# nc -l -p 6902 --verbose
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Listening on :::6902
Ncat: Listening on 0.0.0.0:6902
Ncat: Connection from 172.17.0.15.
Ncat: Connection from 172.17.0.15:49760.
ping

[root@rook-ceph-mds-myfs-a-f474f4b68-c8xxh /]# nc --verbose 172.17.0.10 6902 <<< "ping"
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 172.17.0.10:6902.
Ncat: 5 bytes sent, 0 bytes received in 0.03 seconds.

I have not been able to identify the issue when running ceph fs subvolumegroup create myfs testGroup, but using the --cni=bridge parameter when starting minikube makes things work.

Madhu-1 · 2021-02-09T10:42:06Z

ci/centos/mini-e2e/k8s-1.20 failed with:

Feb  9 10:20:21.798: INFO: csi-cephfs-demo-pod app  to be deleted (600 seconds elapsed)
Feb  9 10:20:21.798: FAIL: failed to validate CephFS pvc and application  binding with error timed out waiting for the condition

Not sure what the cause is, logs.

Analysis

I0209 10:10:23.315530       1 utils.go:132] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0209 10:10:23.315688       1 utils.go:133] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC request: {"target_path":"/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount","volume_id":"0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3"}
I0209 10:14:06.241861       1 cephcmds.go:59] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 command succeeded: umount [/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount]
I0209 10:14:06.242222       1 nodeserver.go:277] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 cephfs: successfully unbinded volume 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 from /var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount
I0209 10:14:06.242277       1 utils.go:138] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC response: {}

It took around 4 minutes to umount the targetPath after that the NodePublish is failing with below error

I0209 10:14:31.025118       1 utils.go:132] ID: 56 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0209 10:14:31.025255       1 utils.go:133] ID: 56 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC request: {"target_path":"/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount","volume_id":"0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3"}
I0209 10:14:31.031176       1 cephcmds.go:53] ID: 56 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 an error (exit status 32) occurred while running umount args: [/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount]
E0209 10:14:31.031215       1 utils.go:136] ID: 56 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC error: rpc error: code = Internal desc = an error (exit status 32) occurred while running umount args: [/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount]

as the umount is already done cephcsi should return success not the Internal server error.

nixpanic · 2021-02-09T12:05:20Z

ci/centos/mini-e2e/k8s-1.20 failed with a timeout in the test suite again:

Feb  9 11:52:47.562: INFO: rbd-32742 app  is in Pending phase expected to be in Running  state (8 seconds elapsed)

panic: test timed out after 1h0m0s

Will increase the timeout to 90 minutes.

Pull request has been modified.

nixpanic · 2021-02-09T14:00:49Z

ci/centos/mini-e2e-helm/k8s-1.20 failed with some unexpected error:

Feb  9 13:15:34.629: INFO: Deleting PersistentVolumeClaim csi-cephfs-pvc on namespace cephfs-3274
Feb  9 13:15:34.645: INFO: waiting for PVC csi-cephfs-pvc in state &PersistentVolumeClaimStatus{Phase:Bound,AccessModes:[ReadWriteMany],Capacity:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},Conditions:[]PersistentVolumeClaimCondition{},} to be deleted (0 seconds elapsed)
Feb  9 13:15:36.684: INFO: waiting for PVC csi-cephfs-pvc in state &PersistentVolumeClaimStatus{Phase:Bound,AccessModes:[ReadWriteMany],Capacity:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},Conditions:[]PersistentVolumeClaimCondition{},} to be deleted (2 seconds elapsed)
[AfterEach] cephfs
  /go/src/github.com/ceph/ceph-csi/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:175
Feb  9 13:15:36.690: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready

logs

nixpanic · 2021-02-09T14:01:02Z

/retest ci/centos/mini-e2e-helm/k8s-1.20

nixpanic · 2021-02-09T15:00:08Z

/retest ci/centos/mini-e2e/k8s-1.20

nixpanic · 2021-02-09T15:00:35Z

/retest ci/centos/mini-e2e/k8s-1.20

Failed due to #1795

nixpanic · 2021-02-09T16:31:26Z

/retest ci/centos/mini-e2e-helm/k8s-1.20

nixpanic · 2021-02-09T16:46:05Z

/retest ci/centos/mini-e2e-helm/k8s-1.20

Resizing a CephFS PVC failed:

Feb  9 15:42:41.257: FAIL: failed to resize PVC with error timed out waiting for the condition
...
Feb  9 15:42:41.271: INFO: At 2021-02-09 15:21:29 +0000 GMT - event for cephfs-32740: {kubelet minikube} FailedMount: MountVolume.MountDevice failed for volume "pvc-75f9da6d-8b2e-431c-a5a0-3a191339f8fe" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Feb  9 15:42:41.271: INFO: At 2021-02-09 15:21:30 +0000 GMT - event for cephfs-32740: {kubelet minikube} FailedMount: MountVolume.MountDevice failed for volume "pvc-75f9da6d-8b2e-431c-a5a0-3a191339f8fe" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0024-f047ded8-7850-46dd-abc2-87233fa3069d-0000000000000001-30e2817d-6aea-11eb-a6c4-863f529beb9b already exists
...

Maybe the provisioner got hung or something, but I did not immediately see it in the logs.

Madhu-1 · 2021-02-10T06:26:31Z

/retest ci/centos/mini-e2e-helm/k8s-1.20

Madhu-1 · 2021-02-10T06:32:03Z

/test ci/centos/mini-e2e-helm/k8s-1.20

Madhu-1 · 2021-02-10T06:33:45Z

@nixpanic looks like merging this PR can affect the cephfs e2e testing. When looking at multiple logs, if feel now cephfs is taking a lot of time for each operation. Merging the PR will make the CI flaky.

Madhu-1 · 2021-02-10T06:34:10Z

@Mergifyio rebase

mergify · 2021-02-10T06:34:42Z

Sorry but I didn't understand the command.

Madhu-1 · 2021-02-10T06:37:46Z

@Mergifyio rebase

As minikube 1.17.1 is released and updating the minikube to the latest available version. Signed-off-by: Madhu Rajanna <[email protected]>

It seems that recent minikube versions changed something in the networking, and that prevents $ ceph fs subvolumegroup create myfs testGroup from working. Strangely RBD is not impacted. Possibly something is confusing the CephMgr pod that handles the CephFS admin commands. Using the "bridge" CNI seems to help, CephFS admin commands work with this in minikube. Signed-off-by: Niels de Vos <[email protected]>

Sometimes testing takes more than 60 minutes. When that is the case, the 60 minute timeout causes a golang panic in the test suite. Signed-off-by: Niels de Vos <[email protected]>

mergify · 2021-02-10T06:38:27Z

Command rebase: success

Branch has been successfully rebased

humblec · 2021-02-10T07:47:06Z

scripts/minikube.sh

@@ -162,6 +162,7 @@ CONTAINER_CMD=${CONTAINER_CMD:-"docker"}
 MEMORY=${MEMORY:-"4096"}
 CPUS=${CPUS:-"$(nproc)"}
 VM_DRIVER=${VM_DRIVER:-"virtualbox"}
+CNI=${CNI:-"bridge"}


@nixpanic whats the default CNI ? if its not bridge may be the change for CNI to bridge causing the E2E to take lots of time to complete , iow, we are hitting some performance issues being the CNI bridge

The default CNI is "auto"... There seems to be some logic in minikube somewhere that decides what CNI to use (maybe dependent on the Kubernetes version and hypervisor?).

It is possible that the minikube VM needs more resources with this rebase. Those are settings in the ci/centos branch, so we could increase those, depending on the current values and bare metal systems in the CI.

nixpanic · 2021-02-10T08:57:28Z

/retest ci/centos/mini-e2e-helm/k8s-1.20

nixpanic · 2021-02-10T08:58:50Z

/retest ci/centos/mini-e2e-helm/k8s-1.20

Failed to delete a volume, the PVC that needs to be deleted seems to be gone. Looks like a bug in the test case:

Feb 10 07:28:36.559: INFO: waiting for PVC rbd-13182 in state &PersistentVolumeClaimStatus{Phase:,AccessModes:[],Capacity:ResourceList{},Conditions:[]PersistentVolumeClaimCondition{},} to be deleted (600 seconds elapsed)
Feb 10 07:28:36.564: INFO: failed to delete PVC and application (rbd-13182): timed out waiting for the condition
Feb 10 07:28:36.564: FAIL: deleting PVCs and applications failed, 1 errors were logged

Madhu-1

LGTM

Madhu-1 added rebase update the version of an external component component/testing Additional test cases or CI work labels Dec 21, 2020

Madhu-1 requested review from nixpanic December 21, 2020 07:32

obnoxxx reviewed Dec 21, 2020

View reviewed changes

Madhu-1 force-pushed the mini-1.16 branch from 9d754cf to 79fa71d Compare December 23, 2020 03:01

Madhu-1 force-pushed the mini-1.16 branch from 79fa71d to c8dd9f8 Compare January 5, 2021 08:08

Madhu-1 force-pushed the mini-1.16 branch from c8dd9f8 to 7338e66 Compare January 19, 2021 05:36

Madhu-1 force-pushed the mini-1.16 branch from 7338e66 to 5214b8f Compare January 27, 2021 13:16

nixpanic mentioned this pull request Feb 1, 2021

Remove losetup workaround in minikube VM #1840

Merged

Madhu-1 force-pushed the mini-1.16 branch from 5214b8f to 1a60efc Compare February 1, 2021 08:19

nixpanic previously approved these changes Feb 1, 2021

View reviewed changes

Madhu-1 mentioned this pull request Feb 9, 2021

Failed to Delete Application using cephfs PVC #1860

Closed

nixpanic approved these changes Feb 9, 2021

View reviewed changes

Madhu-1 and others added 3 commits February 10, 2021 06:38

rebase: update minikube to latest version

5b5419f

As minikube 1.17.1 is released and updating the minikube to the latest available version. Signed-off-by: Madhu Rajanna <[email protected]>

e2e: increase runtime timeout to 90 minutes

0ee62c0

Sometimes testing takes more than 60 minutes. When that is the case, the 60 minute timeout causes a golang panic in the test suite. Signed-off-by: Niels de Vos <[email protected]>

Madhu-1 force-pushed the mini-1.16 branch from 160640f to 0ee62c0 Compare February 10, 2021 06:38

humblec reviewed Feb 10, 2021

View reviewed changes

Madhu-1 added the ready-to-merge This PR is ready to be merged and it doesn't need second review (backports only) label Feb 10, 2021

Madhu-1 commented Feb 10, 2021

View reviewed changes

mergify bot merged commit 6256be0 into ceph:master Feb 10, 2021

nixpanic mentioned this pull request Feb 10, 2021

e2e: add a test case for rbd-nbd mounter #1839

Merged

2 tasks

rebase: update minikube to latest version #1811

rebase: update minikube to latest version #1811

Conversation

Madhu-1 commented Dec 21, 2020 • edited Loading

obnoxxx Dec 21, 2020

Choose a reason for hiding this comment

Madhu-1 Dec 21, 2020

Choose a reason for hiding this comment

Madhu-1 commented Dec 23, 2020

Madhu-1 commented Dec 23, 2020

Madhu-1 commented Dec 23, 2020

mergify bot commented Dec 23, 2020

nixpanic commented Jan 5, 2021

mergify bot commented Jan 5, 2021

nixpanic commented Jan 5, 2021

Madhu-1 commented Jan 19, 2021

mergify bot commented Jan 19, 2021

nixpanic commented Feb 1, 2021

Madhu-1 commented Feb 1, 2021

Madhu-1 commented Feb 1, 2021

Madhu-1 commented Feb 1, 2021

nixpanic commented Feb 1, 2021

Madhu-1 commented Feb 4, 2021

nixpanic commented Feb 9, 2021

Madhu-1 commented Feb 9, 2021

nixpanic commented Feb 9, 2021

nixpanic commented Feb 9, 2021

nixpanic commented Feb 9, 2021

nixpanic commented Feb 9, 2021

nixpanic commented Feb 9, 2021

nixpanic commented Feb 9, 2021

nixpanic commented Feb 9, 2021

Madhu-1 commented Feb 10, 2021

Madhu-1 commented Feb 10, 2021

Madhu-1 commented Feb 10, 2021

Madhu-1 commented Feb 10, 2021

mergify bot commented Feb 10, 2021

Madhu-1 commented Feb 10, 2021

mergify bot commented Feb 10, 2021

humblec Feb 10, 2021

Choose a reason for hiding this comment

nixpanic Feb 10, 2021

Choose a reason for hiding this comment

nixpanic commented Feb 10, 2021

nixpanic commented Feb 10, 2021

Madhu-1 left a comment

Choose a reason for hiding this comment

Madhu-1 commented Dec 21, 2020 •

edited

Loading