Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hang in unit test test_async.py::test_versions #379

Closed
jimthompson5802 opened this issue Jan 4, 2022 · 13 comments
Closed

Hang in unit test test_async.py::test_versions #379

jimthompson5802 opened this issue Jan 4, 2022 · 13 comments

Comments

@jimthompson5802
Copy link

What happened:

Attempted to run the dask-kubernetes unit test suite. The test suite appears to hang in the first test case: dask_kubernetes/tests/test_async.py::test_versions

What you expected to happen:

I expected to see all 67 test cases run w/o issue.

Minimal Complete Verifiable Example:

I was able to create the hang condition with this command from the project's root directory:

pytest -v dask_kubernetes/tests

Anything else we need to know?:

I captured the following log output.

pytest innovcation

$ pytest -v  dask_kubernetes/tests
====================================================================================== test session starts =======================================================================================
platform darwin -- Python 3.9.7, pytest-6.2.5, py-1.10.0, pluggy-1.0.0 -- /Users/jim/.conda/envs/dask-kubernetes/bin/python
cachedir: .pytest_cache
rootdir: /Users/jim/Desktop/dask/dask-kubernetes, configfile: setup.cfg
plugins: asyncio-0.16.0, xdist-2.5.0, forked-1.4.0, pycharm-0.7.0, kind-21.1.3
collected 67 items

dask_kubernetes/tests/test_async.py::test_versions

The pytest output stops at the above line.

Display status of the kind cluster with kubectl get all -A command

$ kubectl get all -A
NAMESPACE            NAME                                                    READY   STATUS             RESTARTS   AGE
default              pod/dask-jim-0bdc4023-bmqgvf                            0/1     Pending            0          10m
kube-system          pod/coredns-74ff55c5b-pvsv9                             0/1     Pending            0          11m
kube-system          pod/coredns-74ff55c5b-wjv4p                             0/1     Pending            0          11m
kube-system          pod/etcd-pytest-kind-control-plane                      1/1     Running            0          11m
kube-system          pod/kindnet-kmkr8                                       1/1     Running            3          11m
kube-system          pod/kube-apiserver-pytest-kind-control-plane            1/1     Running            0          11m
kube-system          pod/kube-controller-manager-pytest-kind-control-plane   1/1     Running            0          11m
kube-system          pod/kube-proxy-m8rp8                                    0/1     CrashLoopBackOff   7          11m
kube-system          pod/kube-scheduler-pytest-kind-control-plane            1/1     Running            0          11m
local-path-storage   pod/local-path-provisioner-78776bfc44-2l8mv             0/1     Pending            0          11m

NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  11m
kube-system   service/kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   11m

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/kindnet      1         1         1       1            1           <none>                   11m
kube-system   daemonset.apps/kube-proxy   1         1         0       1            0           kubernetes.io/os=linux   11m

NAMESPACE            NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
kube-system          deployment.apps/coredns                  0/2     2            0           11m
local-path-storage   deployment.apps/local-path-provisioner   0/1     1            0           11m

NAMESPACE            NAME                                                DESIRED   CURRENT   READY   AGE
kube-system          replicaset.apps/coredns-74ff55c5b                   2         2         0       11m
local-path-storage   replicaset.apps/local-path-provisioner-78776bfc44   1         1         0       11m

log file from the crashed pod/kube-proxy-m8rp8

$ kubectl logs pod/kube-proxy-m8rp8 -n kube-system
I0104 01:15:18.746378       1 node.go:172] Successfully retrieved node IP: 172.30.0.2
I0104 01:15:18.746449       1 server_others.go:142] kube-proxy node IP is an IPv4 address (172.30.0.2), assume IPv4 operation
I0104 01:15:18.774456       1 server_others.go:185] Using iptables Proxier.
I0104 01:15:18.775208       1 server.go:650] Version: v1.20.2
I0104 01:15:18.776129       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 163840
F0104 01:15:18.776174       1 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

Searching on the error message led to this issue: kubernetes-sigs/kind#2240

Environment:

  • Dask version: From what I can tell, the unit test installs from the main branch.
  • Python version: Python 3.9.7
  • Operating System: Unit test is run on MacOS 11.6.1
  • Install method (conda, pip, source): N/A
  • Other software
    • Docker Desktop 4.2.0 (Docker Engine 20.10.10)
    • kubectl client 1.21.7
    • Helm 3.7.1
@jacobtomlinson
Copy link
Member

Sorry to hear you are having trouble. My guess is that you have not installed the test dependencies with pip install -r requirements_test.txt.

It looks like this is fixed in kind 0.11.1. Sadly while that is the latest version in pytest-kind there hasn't been a release over there for a while and if you do pip install pytest-kind you get 0.10.0, so we have to install pytest-kind from source, which our requirements_test.txt file should do.

Could you give that a go and let me know how you get on?

@jimthompson5802
Copy link
Author

@jacobtomlinson Thank you for the quick response.

I did as you requested: pip install -r requirements-test.txt. This is the what pip list looks like for pytest related packages:

$ pip list | grep pytest
pytest             6.2.5
pytest-asyncio     0.16.0
pytest-forked      1.4.0
pytest-kind        21.1.3
pytest-pycharm     0.7.0
pytest-timeout     2.0.2
pytest-xdist       2.5.0

Unfortunately, no change in symptoms when I run pytest -v dask_kubernetes/tests

$ kubectl get all -A
NAMESPACE            NAME                                                    READY   STATUS             RESTARTS   AGE
default              pod/dask-jim-184c3d5b-eds2bw                            0/1     Pending            0          71s
default              pod/dask-jim-45f924b1-2m6cr9                            0/1     Pending            0          4m22s
kube-system          pod/coredns-74ff55c5b-4vljh                             0/1     Pending            0          5m34s
kube-system          pod/coredns-74ff55c5b-6vgbj                             0/1     Pending            0          5m34s
kube-system          pod/etcd-pytest-kind-control-plane                      1/1     Running            0          5m47s
kube-system          pod/kindnet-2chkv                                       1/1     Running            1          5m34s
kube-system          pod/kube-apiserver-pytest-kind-control-plane            1/1     Running            0          5m47s
kube-system          pod/kube-controller-manager-pytest-kind-control-plane   1/1     Running            0          5m47s
kube-system          pod/kube-proxy-2pnmz                                    0/1     CrashLoopBackOff   5          5m34s
kube-system          pod/kube-scheduler-pytest-kind-control-plane            1/1     Running            0          5m47s
local-path-storage   pod/local-path-provisioner-78776bfc44-j2w9x             0/1     Pending            0          5m34s

NAMESPACE     NAME                 TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP                  5m49s
kube-system   service/kube-dns     ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   5m47s

NAMESPACE     NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
kube-system   daemonset.apps/kindnet      1         1         1       1            1           <none>                   5m45s
kube-system   daemonset.apps/kube-proxy   1         1         0       1            0           kubernetes.io/os=linux   5m47s

NAMESPACE            NAME                                     READY   UP-TO-DATE   AVAILABLE   AGE
kube-system          deployment.apps/coredns                  0/2     2            0           5m47s
local-path-storage   deployment.apps/local-path-provisioner   0/1     1            0           5m44s

NAMESPACE            NAME                                                DESIRED   CURRENT   READY   AGE
kube-system          replicaset.apps/coredns-74ff55c5b                   2         2         0       5m35s
local-path-storage   replicaset.apps/local-path-provisioner-78776bfc44   1         1         0       5m35s


$ kubectl logs pod/kube-proxy-2pnmz -n kube-system
I0104 11:41:06.239102       1 node.go:172] Successfully retrieved node IP: 172.31.0.2
I0104 11:41:06.239168       1 server_others.go:142] kube-proxy node IP is an IPv4 address (172.31.0.2), assume IPv4 operation
I0104 11:41:06.256907       1 server_others.go:185] Using iptables Proxier.
I0104 11:41:06.257237       1 server.go:650] Version: v1.20.2
I0104 11:41:06.259804       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 163840
F0104 11:41:06.259888       1 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

Any thoughts on this issue: kubernetes-sigs/kind#2240? The symptoms described appear to be the same as what I see in the log message for kube-proxy. Right now I'm trying to see if I can specify a different version of kind for pytest.

@jacobtomlinson
Copy link
Member

jacobtomlinson commented Jan 4, 2022

If you already have pytest-kind installed you'll need to do pip install --upgrade -r requirements_test.txt. Upgrading this will upgrade the version of kind.

In the issue you linked the last comment mentions that this is fixed in kind 0.11.1.

@jimthompson5802
Copy link
Author

Got it...ran pip install --upgrade -r requirements-test.txt

$ pip install --upgrade -r requirements-test.txt
Collecting git+https://codeberg.org/hjacobs/pytest-kind.git (from -r requirements-test.txt (line 6))
  Cloning https://codeberg.org/hjacobs/pytest-kind.git to /private/var/folders/8_/h2tjc94d2y5fwx8dhbcrnbfc0000gn/T/pip-req-build-u23h8gyq
  Running command git clone -q https://codeberg.org/hjacobs/pytest-kind.git /private/var/folders/8_/h2tjc94d2y5fwx8dhbcrnbfc0000gn/T/pip-req-build-u23h8gyq
  Resolved https://codeberg.org/hjacobs/pytest-kind.git to commit 12bd425e54932d33485c4652aacfec8448131122
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
    Preparing wheel metadata ... done
Requirement already satisfied: flake8>=3.7 in /Users/jim/.conda/envs/dask-kubernetes/lib/python3.9/site-packages (from -r requirements-test.txt (line 1)) (4.0.1)
Requirement already satisfied: black>=18.9b0 in /Users/jim/.conda/envs/dask-kubernetes/lib/python3.9/site-packages (from -r requirements-test.txt (line 2)) (21.12b0)
<<<<<<< REMOVED INTERMEDIATE MESSAGES >>>>>>>>
Requirement already satisfied: heapdict in /Users/jim/.conda/envs/dask-kubernetes/lib/python3.9/site-packages (from zict>=0.1.3->distributed->dask-ctl>=2021.3.0->-r requirements-test.txt (line 3)) (1.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in /Users/jim/.conda/envs/dask-kubernetes/lib/python3.9/site-packages (from jinja2->distributed->dask-ctl>=2021.3.0->-r requirements-test.txt (line 3)) (2.0.1)


$ pip list | grep pytest
pytest             6.2.5
pytest-asyncio     0.16.0
pytest-forked      1.4.0
pytest-kind        21.1.3
pytest-pycharm     0.7.0
pytest-timeout     2.0.2
pytest-xdist       2.5.0

No change in behavior kube-proxy still fails.

kube-system          pod/kube-proxy-rhzvk                                    0/1     CrashLoopBackOff   5          5m45s

$ kubectl logs pod/kube-proxy-rhzvk
Error from server (NotFound): pods "kube-proxy-rhzvk" not found
Jim-MacBook-Pro:jim pytest-kind[584]$ kubectl logs pod/kube-proxy-rhzvk -n kube-system
I0104 12:20:27.560294       1 node.go:172] Successfully retrieved node IP: 192.168.0.2
I0104 12:20:27.560417       1 server_others.go:142] kube-proxy node IP is an IPv4 address (192.168.0.2), assume IPv4 operation
I0104 12:20:27.583833       1 server_others.go:185] Using iptables Proxier.
I0104 12:20:27.584393       1 server.go:650] Version: v1.20.2
I0104 12:20:27.584806       1 conntrack.go:100] Set sysctl 'net/netfilter/nf_conntrack_max' to 163840
F0104 12:20:27.584884       1 server.go:495] open /proc/sys/net/netfilter/nf_conntrack_max: permission denied

If it helps, here is the docker image that is downloaded when the kind cluster is started.

$ docker images | grep kind
kindest/node                     <none>           094599011731   11 months ago   1.17GB

@jimthompson5802
Copy link
Author

I forgot to mention that before running pytest this last time, I did remove the docker image cited above to avoid the possibility of reusing an old image. The image shown above was freshly downloaded.

@jacobtomlinson
Copy link
Member

Hrm ok it looks like you have the right pytest-kind now. Can you confirm that it is definitely installing kind 0.11.1?

@jimthompson5802
Copy link
Author

Can you confirm that it is definitely installing kind 0.11.1?

I assume you mean this command kind version using the executable found in .pytest-kind/pytest-kind/ directory. If this correct, here is the output.

Jim-MacBook-Pro:jim pytest-kind[615]$ pwd
/Users/jim/Desktop/dask/dask-kubernetes/.pytest-kind/pytest-kind
Jim-MacBook-Pro:jim pytest-kind[616]$ ls -l
total 14496
-rwxr-xr-x  1 jim  staff  7396096 Jan  4 09:47 kind
-rw-------  1 jim  staff     5552 Jan  4 09:48 kubeconfig
Jim-MacBook-Pro:jim pytest-kind[617]$ ./kind version
kind v0.10.0 go1.15.7 darwin/amd64

It appears, I do not have the correct version of kind. I have 0.10.0 and not 0.11.0.

Is there a config parameter that I'm missing?

@jimthompson5802
Copy link
Author

And just to confirm...here is git log for my local repo. I believe this is the current version.

commit bf66618346840ccfbd35ab68df7625bf365fd3cc (HEAD -> main, upstream/main, origin/main)
Author: andrethrill <[email protected]>
Date:   Thu Nov 11 17:41:04 2021 +0000

    Convert scheduler_pod_template from str and dict (#374)

    * Update core.py

    * removed spaces

@jacobtomlinson
Copy link
Member

Yup so that's the problem. Can you delete the binary and try again? pytest-kind should pull down the correct version on first run.

@jimthompson5802
Copy link
Author

Before the last run I reported, I rm -fr .pytest-kind/. So I delete the entire sub-directory.

What I showed above is after the deletion. It looks like pytest-kind seems to pull down kind 0.10.0.

@jacobtomlinson
Copy link
Member

Strange because it definitely should be pulling down 0.11.1.

https://codeberg.org/hjacobs/pytest-kind/src/branch/main/pytest_kind/cluster.py#L18

@jimthompson5802
Copy link
Author

Thank you for the insight. I may have found root cause. It appears my conda enviornment may be broken. When I look in the conda env library, I see this
image

image

Let me kill the conda environment and recreate. I'll get back to you on this.

I really appreciate your patience and guidance in tracking this down.

@jimthompson5802
Copy link
Author

@jacobtomlinson Thank you very much for you help. I am now able to run the unit test.

It looks like my conda environment was broken. After deleting and recreating the conda environment and installing the requirements.txt and requirements-test.txt dependencies, the pytest tests now run.

Here is the pytest summary line

FAILED dask_kubernetes/tests/test_helm.py::test_discovery - AssertionError: assert 'helmcluster' in {'proxycluster': {'discover': <function discover at 0x7fb529a...
=========================================== 1 failed, 55 passed, 6 skipped, 5 xfailed, 61 warnings in 368.41s (0:06:08) ============================================

Since I can run the unit tests, I'm closing this issue. Thank you again for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants