Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rebase: update minikube to latest version #1811

Merged
merged 3 commits into from
Feb 10, 2021
Merged

Conversation

Madhu-1
Copy link
Collaborator

@Madhu-1 Madhu-1 commented Dec 21, 2020

As minikube 1.17.1 is released and updating the minikube to the latest available version.

Signed-off-by: Madhu Rajanna [email protected]

@Madhu-1 Madhu-1 added rebase update the version of an external component component/testing Additional test cases or CI work labels Dec 21, 2020
build.env Outdated
@@ -36,7 +36,7 @@ SNAPSHOT_VERSION=v3.0.1
HELM_VERSION=v3.1.2

# minikube settings
MINIKUBE_VERSION=v1.14.1
MINIKUBE_VERSION=v1.16.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use latest?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@obnoxxx currently, we are sticking to the available release once the version is tested in the CI. if there is any regression in the latest available release we might end up in the CI issues which could block the merging of the PR.

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Dec 23, 2020

/retest all

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Dec 23, 2020

/test ci/centos/mini-e2e-helm

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Dec 23, 2020

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Dec 23, 2020

Command rebase: success

Branch has been successfully rebased

@nixpanic
Copy link
Member

nixpanic commented Jan 5, 2021

@Mergifyio rebase

The logs of the CI jobs have been removed, we will need new logs in order to fix issues.

@mergify
Copy link
Contributor

mergify bot commented Jan 5, 2021

Command rebase: success

Branch has been successfully rebased

@nixpanic
Copy link
Member

nixpanic commented Jan 5, 2021

/test ci/centos/mini-e2e-helm

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Jan 19, 2021

@Mergifyio rebase

The logs of the CI jobs have been removed, we will need new logs in order to fix issues.

@mergify
Copy link
Contributor

mergify bot commented Jan 19, 2021

Command rebase: success

Branch has been successfully rebased

@nixpanic
Copy link
Member

nixpanic commented Feb 1, 2021

https://github.com/kubernetes/minikube/tree/v1.17.1 has been released and includes a fix for #1840

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 1, 2021

/test ci/centos/mini-e2e-helm

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 1, 2021

/test ci/centos/mini-e2e

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 1, 2021

/test ci/centos/upgrade-tests-cephfs

nixpanic
nixpanic previously approved these changes Feb 1, 2021
@mergify mergify bot dismissed nixpanic’s stale review February 1, 2021 08:20

Pull request has been modified.

nixpanic
nixpanic previously approved these changes Feb 1, 2021
@nixpanic
Copy link
Member

nixpanic commented Feb 1, 2021

The e2e tests seem to fail consistently with the following error:

Feb  1 09:40:18.655: INFO: Waiting up to 10m0s for all daemonsets in namespace 'cephcsi-e2e-3b6671624651' to start
Feb  1 09:40:18.658: INFO: 1 / 1 pods ready in namespace 'cephcsi-e2e-3b6671624651' in daemonset 'csi-cephfsplugin' (0 seconds elapsed)
�[1mSTEP�[0m: check static PVC
Feb  1 09:40:18.665: INFO: ExecWithOptions {Command:[/bin/sh -c ceph fsid] Namespace:rook-ceph PodName:rook-ceph-tools-5455675849-qxgb2 ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Feb  1 09:40:18.665: INFO: >>> kubeConfig: /root/.kube/config
Feb  1 09:40:21.320: INFO: ExecWithOptions {Command:[/bin/sh -c ceph fs subvolumegroup create myfs testGroup] Namespace:rook-ceph PodName:rook-ceph-tools-5455675849-qxgb2 ContainerName:rook-ceph-tools Stdin:<nil> CaptureStdout:true CaptureStderr:true PreserveWhitespace:true Quiet:false}
Feb  1 09:40:21.320: INFO: >>> kubeConfig: /root/.kube/config
Feb  1 09:53:13.588: INFO: stdErr occurred: Error EINVAL: Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 165, in get_fs_handle
    conn.connect()
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 88, in connect
    self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
  File "cephfs.pyx", line 739, in cephfs.LibCephFS.mount
cephfs.Error: error calling ceph_mount: Connection timed out [Errno 110]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/share/ceph/mgr/mgr_module.py", line 1177, in _handle_command
    return self.handle_command(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 426, in handle_command
    return handler(inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 34, in wrap
    return f(self, inbuf, cmd)
  File "/usr/share/ceph/mgr/volumes/module.py", line 452, in _cmd_fs_subvolumegroup_create
    uid=cmd.get('uid', None), gid=cmd.get('gid', None))
  File "/usr/share/ceph/mgr/volumes/fs/volume.py", line 480, in create_subvolume_group
    with open_volume(self, volname) as fs_handle:
  File "/lib64/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 316, in open_volume
    fs_handle = vc.connection_pool.get_fs_handle(volname)
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 171, in get_fs_handle
    raise VolumeException(-e.args[0], e.args[1])
TypeError: bad operand type for unary -: 'str'


Feb  1 09:53:13.588: FAIL: failed to validate CephFS static pv with error command terminated with exit code 22

Possibly minikube has tightened its network policy and the node-plugin can not connect to the MDS anymore? During e2e testing, we use two different namespaces for the Ceph cluster and to-test ceph-csi services. Maybe that is problematic...

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 4, 2021

@nixpanic something has changed in minikube 1.17.1 am not able to run ceph fs commands from the toolbox pod.

@nixpanic
Copy link
Member

nixpanic commented Feb 9, 2021

@nixpanic something has changed in minikube 1.17.1 am not able to run ceph fs commands from the toolbox pod.

The toolbox pod can access the CephFS MDS, I think. Commands like ceph fs status work just fine.
However, when trying to create a subvolume group, the ceph command needs to talk to the CephMgr. Connecting to the CephMgr works as well, the logs of the CephMgr pod show that the command is received.

debug 2021-02-05T08:46:23.577+0000 7fc29d02b700  0 log_channel(audit) log [DBG] : from='client.11024 -' entity='client.admin' cmd=[{"prefix": "fs subvolumegroup create", "vol_name": "myfs", "target": ["mgr", ""], "group_name": "testGroup"}]: dispatch

However, it seems to run into a timeout:

debug 2021-02-05T08:50:14.554+0000 7fc29c82a700  0 [volumes ERROR volumes.module] Failed _cmd_fs_subvolumegroup_create(group_name:testGroup, prefix:fs subvolumegroup create, target:['mgr', ''], vol_name:myfs) < "":
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 165, in get_fs_handle
    conn.connect()
  File "/usr/share/ceph/mgr/volumes/fs/operations/volume.py", line 88, in connect
    self.fs.mount(filesystem_name=self.fs_name.encode('utf-8'))
  File "cephfs.pyx", line 739, in cephfs.LibCephFS.mount
cephfs.Error: error calling ceph_mount: Connection timed out [Errno 110]

Testing communications between toolbox, mds, mgr does not show any restrictions. Installed nc inside the running containers, and run is like this:

[root@rook-ceph-mgr-a-799d4886bd-4gmcs /]# nc -l -p 6902 --verbose
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Listening on :::6902
Ncat: Listening on 0.0.0.0:6902
Ncat: Connection from 172.17.0.15.
Ncat: Connection from 172.17.0.15:49760.
ping
[root@rook-ceph-mds-myfs-a-f474f4b68-c8xxh /]# nc --verbose 172.17.0.10 6902 <<< "ping"
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 172.17.0.10:6902.
Ncat: 5 bytes sent, 0 bytes received in 0.03 seconds.

I have not been able to identify the issue when running ceph fs subvolumegroup create myfs testGroup, but using the --cni=bridge parameter when starting minikube makes things work.

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 9, 2021

ci/centos/mini-e2e/k8s-1.20 failed with:

Feb  9 10:20:21.798: INFO: csi-cephfs-demo-pod app  to be deleted (600 seconds elapsed)
Feb  9 10:20:21.798: FAIL: failed to validate CephFS pvc and application  binding with error timed out waiting for the condition

Not sure what the cause is, logs.

Analysis

I0209 10:10:23.315530       1 utils.go:132] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0209 10:10:23.315688       1 utils.go:133] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC request: {"target_path":"/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount","volume_id":"0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3"}
I0209 10:14:06.241861       1 cephcmds.go:59] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 command succeeded: umount [/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount]
I0209 10:14:06.242222       1 nodeserver.go:277] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 cephfs: successfully unbinded volume 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 from /var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount
I0209 10:14:06.242277       1 utils.go:138] ID: 42 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC response: {}

It took around 4 minutes to umount the targetPath after that the NodePublish is failing with below error

I0209 10:14:31.025118       1 utils.go:132] ID: 56 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC call: /csi.v1.Node/NodeUnpublishVolume
I0209 10:14:31.025255       1 utils.go:133] ID: 56 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC request: {"target_path":"/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount","volume_id":"0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3"}
I0209 10:14:31.031176       1 cephcmds.go:53] ID: 56 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 an error (exit status 32) occurred while running umount args: [/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount]
E0209 10:14:31.031215       1 utils.go:136] ID: 56 Req-ID: 0001-0024-c3feb7b7-cf6b-4154-b66f-fb09df8d20b7-0000000000000001-fc81f537-6abe-11eb-9ad3-e6ef196d01c3 GRPC error: rpc error: code = Internal desc = an error (exit status 32) occurred while running umount args: [/var/lib/kubelet/pods/9d150e08-6b25-478d-96a6-03bfd96f907a/volumes/kubernetes.io~csi/pvc-2a3650cd-8454-40e8-ae4b-31a7f6034dec/mount]

as the umount is already done cephcsi should return success not the Internal server error.

@nixpanic
Copy link
Member

nixpanic commented Feb 9, 2021

ci/centos/mini-e2e/k8s-1.20 failed with a timeout in the test suite again:

Feb  9 11:52:47.562: INFO: rbd-32742 app  is in Pending phase expected to be in Running  state (8 seconds elapsed)

panic: test timed out after 1h0m0s

Will increase the timeout to 90 minutes.

@mergify mergify bot dismissed stale reviews from nixpanic and humblec February 9, 2021 12:05

Pull request has been modified.

@nixpanic
Copy link
Member

nixpanic commented Feb 9, 2021

ci/centos/mini-e2e-helm/k8s-1.20 failed with some unexpected error:

Feb  9 13:15:34.629: INFO: Deleting PersistentVolumeClaim csi-cephfs-pvc on namespace cephfs-3274
Feb  9 13:15:34.645: INFO: waiting for PVC csi-cephfs-pvc in state &PersistentVolumeClaimStatus{Phase:Bound,AccessModes:[ReadWriteMany],Capacity:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},Conditions:[]PersistentVolumeClaimCondition{},} to be deleted (0 seconds elapsed)
Feb  9 13:15:36.684: INFO: waiting for PVC csi-cephfs-pvc in state &PersistentVolumeClaimStatus{Phase:Bound,AccessModes:[ReadWriteMany],Capacity:ResourceList{storage: {{1073741824 0} {<nil>} 1Gi BinarySI},},Conditions:[]PersistentVolumeClaimCondition{},} to be deleted (2 seconds elapsed)
[AfterEach] cephfs
  /go/src/github.com/ceph/ceph-csi/vendor/k8s.io/kubernetes/test/e2e/framework/framework.go:175
Feb  9 13:15:36.690: INFO: Waiting up to 3m0s for all (but 0) nodes to be ready

logs

@nixpanic
Copy link
Member

nixpanic commented Feb 9, 2021

/retest ci/centos/mini-e2e-helm/k8s-1.20

@nixpanic
Copy link
Member

nixpanic commented Feb 9, 2021

/retest ci/centos/mini-e2e/k8s-1.20

@nixpanic
Copy link
Member

nixpanic commented Feb 9, 2021

/retest ci/centos/mini-e2e/k8s-1.20

Failed due to #1795

@nixpanic
Copy link
Member

nixpanic commented Feb 9, 2021

/retest ci/centos/mini-e2e-helm/k8s-1.20

@nixpanic
Copy link
Member

nixpanic commented Feb 9, 2021

/retest ci/centos/mini-e2e-helm/k8s-1.20

Resizing a CephFS PVC failed:

Feb  9 15:42:41.257: FAIL: failed to resize PVC with error timed out waiting for the condition
...
Feb  9 15:42:41.271: INFO: At 2021-02-09 15:21:29 +0000 GMT - event for cephfs-32740: {kubelet minikube} FailedMount: MountVolume.MountDevice failed for volume "pvc-75f9da6d-8b2e-431c-a5a0-3a191339f8fe" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
Feb  9 15:42:41.271: INFO: At 2021-02-09 15:21:30 +0000 GMT - event for cephfs-32740: {kubelet minikube} FailedMount: MountVolume.MountDevice failed for volume "pvc-75f9da6d-8b2e-431c-a5a0-3a191339f8fe" : rpc error: code = Aborted desc = an operation with the given Volume ID 0001-0024-f047ded8-7850-46dd-abc2-87233fa3069d-0000000000000001-30e2817d-6aea-11eb-a6c4-863f529beb9b already exists
...

Maybe the provisioner got hung or something, but I did not immediately see it in the logs.

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 10, 2021

/retest ci/centos/mini-e2e-helm/k8s-1.20

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 10, 2021

/test ci/centos/mini-e2e-helm/k8s-1.20

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 10, 2021

@nixpanic looks like merging this PR can affect the cephfs e2e testing. When looking at multiple logs, if feel now cephfs is taking a lot of time for each operation. Merging the PR will make the CI flaky.

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 10, 2021

@Mergifyio rebase

@mergify
Copy link
Contributor

mergify bot commented Feb 10, 2021

Sorry but I didn't understand the command.

@Madhu-1
Copy link
Collaborator Author

Madhu-1 commented Feb 10, 2021

@Mergifyio rebase

Madhu-1 and others added 3 commits February 10, 2021 06:38
As minikube 1.17.1 is released and updating
the minikube to the latest available version.

Signed-off-by: Madhu Rajanna <[email protected]>
It seems that recent minikube versions changed something in the
networking, and that prevents

    $ ceph fs subvolumegroup create myfs testGroup

from working. Strangely RBD is not impacted. Possibly something is
confusing the CephMgr pod that handles the CephFS admin commands.

Using the "bridge" CNI seems to help, CephFS admin commands work with
this in minikube.

Signed-off-by: Niels de Vos <[email protected]>
Sometimes testing takes more than 60 minutes. When that is the case, the
60 minute timeout causes a golang panic in the test suite.

Signed-off-by: Niels de Vos <[email protected]>
@mergify
Copy link
Contributor

mergify bot commented Feb 10, 2021

Command rebase: success

Branch has been successfully rebased

@@ -162,6 +162,7 @@ CONTAINER_CMD=${CONTAINER_CMD:-"docker"}
MEMORY=${MEMORY:-"4096"}
CPUS=${CPUS:-"$(nproc)"}
VM_DRIVER=${VM_DRIVER:-"virtualbox"}
CNI=${CNI:-"bridge"}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nixpanic whats the default CNI ? if its not bridge may be the change for CNI to bridge causing the E2E to take lots of time to complete , iow, we are hitting some performance issues being the CNI bridge

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default CNI is "auto"... There seems to be some logic in minikube somewhere that decides what CNI to use (maybe dependent on the Kubernetes version and hypervisor?).

It is possible that the minikube VM needs more resources with this rebase. Those are settings in the ci/centos branch, so we could increase those, depending on the current values and bare metal systems in the CI.

@nixpanic
Copy link
Member

/retest ci/centos/mini-e2e-helm/k8s-1.20

@nixpanic
Copy link
Member

/retest ci/centos/mini-e2e-helm/k8s-1.20

Failed to delete a volume, the PVC that needs to be deleted seems to be gone. Looks like a bug in the test case:

Feb 10 07:28:36.559: INFO: waiting for PVC rbd-13182 in state &PersistentVolumeClaimStatus{Phase:,AccessModes:[],Capacity:ResourceList{},Conditions:[]PersistentVolumeClaimCondition{},} to be deleted (600 seconds elapsed)
Feb 10 07:28:36.564: INFO: failed to delete PVC and application (rbd-13182): timed out waiting for the condition
Feb 10 07:28:36.564: FAIL: deleting PVCs and applications failed, 1 errors were logged

@Madhu-1 Madhu-1 added the ready-to-merge This PR is ready to be merged and it doesn't need second review (backports only) label Feb 10, 2021
Copy link
Collaborator Author

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mergify mergify bot merged commit 6256be0 into ceph:master Feb 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/testing Additional test cases or CI work ready-to-merge This PR is ready to be merged and it doesn't need second review (backports only) rebase update the version of an external component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants