Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NSM pods can freeze on latest k8s versions #1041

Open
denis-tingaikin opened this issue Nov 19, 2024 · 9 comments
Open

NSM pods can freeze on latest k8s versions #1041

denis-tingaikin opened this issue Nov 19, 2024 · 9 comments
Assignees

Comments

@denis-tingaikin
Copy link
Member

denis-tingaikin commented Nov 19, 2024

Description

On the newest K8S clusters, NSM pods can freeze on deleting, which ruines tests.

Incidents

  1. https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/11687420643
  2. https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/11705850858
  3. https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/11723342924/job/32654812715

Logs

time=2024-11-07T13:18:16Z level=info msg=kubectl delete pod -n nsm-system ${FORWARDER} TestRunHealSuite/TestRemote_forwarder_death_ip=stdin
panic: test timed out after 2h30m0s
running tests:
	TestRunHealSuite (2h15m21s)
	TestRunHealSuite/TestRemote_forwarder_death_ip (2h4m25s)

goroutine 379 [running]:
testing.(*M).startAlarm.func1()
	/opt/hostedtoolcache/go/1.20.5/x64/src/testing/testing.go:2241 +0x219
created by time.goFunc
	/opt/hostedtoolcache/go/1.20.5/x64/src/time/sleep.go:176 +0x48

goroutine 1 [chan receive, 136 minutes]:
testing.(*T).Run(0xc000134b60, {0xe6b3d5, 0x10}, 0xeb96c8)
	/opt/hostedtoolcache/go/1.20.5/x64/src/testing/testing.go:1630 +0x82e
testing.runTests.func1(0x0?)
	/opt/hostedtoolcache/go/1.20.5/x64/src/testing/testing.go:2036 +0x8e
testing.tRunner(0xc000134b60, 0xc00015bb48)
	/opt/hostedtoolcache/go/1.20.5/x64/src/testing/testing.go:1576 +0x217
testing.runTests(0xc00011f400?, {0x13e6340, 0xa, 0xa}, {0x1c?, 0x4a90f9?, 0x13ed8c0?})
	/opt/hostedtoolcache/go/1.20.5/x64/src/testing/testing.go:2034 +0x87d
testing.(*M).Run(0xc00011f400)
	/opt/hostedtoolcache/go/1.20.5/x64/src/testing/testing.go:1906 +0xb45
main.main()
	_testmain.go:65 +0x2ea

goroutine 328 [chan receive, 125 minutes]:
testing.(*T).Run(0xc000151040, {0xd9cff9, 0x1d}, 0xc0000fe990)
	/opt/hostedtoolcache/go/1.20.5/x64/src/testing/testing.go:1630 +0x82e
github.com/stretchr/testify/suite.runTests({0xf5a1b0, 0xc000151040}, {0xc000004300?, 0x1c, 0x20})
	/home/runner/work/integration-k8s-kind/integration-k8s-kind/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:242 +0x19d
github.com/stretchr/testify/suite.Run(0xc000151040, {0xf560e0?, 0xc0000bb500})
	/home/runner/work/integration-k8s-kind/integration-k8s-kind/pkg/mod/github.com/stretchr/[email protected]/suite/suite.go:215 +0x9b3
github.com/networkservicemesh/integration-k8s-kind/tests_single.TestRunHealSuite(0x0?)
	/home/runner/work/integration-k8s-kind/integration-k8s-kind/src/github.com/networkservicemesh/integration-k8s-kind/tests_single/heal_test.go:28 +0x45
testing.tRunner(0xc000151040, 0xeb96c8)
	/opt/hostedtoolcache/go/1.20.5/x64/src/testing/testing.go:1576 +0x217

goroutine 315 [IO wait, 135 minutes]:
internal/poll.runtime_pollWait(0x7fa38e55b358, 0x72)
	/opt/hostedtoolcache/go/1.20.5/x64/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc00020f2d8, 0xc0003d1400?, 0x1)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_poll_runtime.go:84 +0xbd
internal/poll.(*pollDesc).waitRead(...)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00020f2c0, {0xc0003d1400, 0x400, 0x400})
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_unix.go:167 +0x405
os.(*File).read(...)
	/opt/hostedtoolcache/go/1.20.5/x64/src/os/file_posix.go:31
os.(*File).Read(0xc0000148c8, {0xc0003d1400, 0x400, 0x400})
	/opt/hostedtoolcache/go/1.20.5/x64/src/os/file.go:118 +0xc8
github.com/networkservicemesh/gotestmd/pkg/bash.(*Bash).extractMessagesFromPipe(0xc0000cb900, {0xf53dc0, 0xc0000148c8}, 0xc0002c0420)
	/home/runner/work/integration-k8s-kind/integration-k8s-kind/pkg/mod/github.com/networkservicemesh/[email protected]/pkg/bash/bash.go:140 +0x105
created by github.com/networkservicemesh/gotestmd/pkg/bash.(*Bash).Init
	/home/runner/work/integration-k8s-kind/integration-k8s-kind/pkg/mod/github.com/networkservicemesh/[email protected]/pkg/bash/bash.go:131 +0xc06

goroutine 368 [IO wait, 125 minutes]:
internal/poll.runtime_pollWait(0x7fa38e55b448, 0x72)
	/opt/hostedtoolcache/go/1.20.5/x64/src/runtime/netpoll.go:306 +0x89
internal/poll.(*pollDesc).wait(0xc000176918, 0xc00012d422?, 0x1)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_poll_runtime.go:84 +0xbd
internal/poll.(*pollDesc).waitRead(...)
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000176900, {0xc00012d422, 0x3de, 0x3de})
	/opt/hostedtoolcache/go/1.20.5/x64/src/internal/poll/fd_unix.go:167 +0x405
os.(*File).read(...)
	/opt/hostedtoolcache/go/1.20.5/x64/src/os/file_posix.go:31
os.(*File).Read(0xc0000140f0, {0xc00012d422, 0x3de, 0x3de})
	/opt/hostedtoolcache/go/1.20.5/x64/src/os/file.go:118 +0xc8
github.com/networkservicemesh/gotestmd/pkg/bash.(*Bash).extractMessagesFromPipe(0xc00014a080, {0xf53dc0, 0xc0000140f0}, 0xc0002c02a0)
	/home/runner/work/integration-k8s-kind/integration-k8s-kind/pkg/mod/github.com/networkservicemesh/[email protected]/pkg/bash/bash.go:140 +0x105
created by github.com/networkservicemesh/gotestmd/pkg/bash.(*Bash).Init
	/home/runner/work/integration-k8s-kind/integration-k8s-kind/pkg/mod/github.com/networkservicemesh/[email protected]/pkg/bash/bash.go:130 +0xab2
FAIL	github.com/networkservicemesh/integration-k8s-kind/tests_single	9000.113s

Affected versions

v1.31.1
v1.30.4
v1.29.8
v1.28.12
v1.27.16
v1.26.15
@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Dec 2, 2024

Status update:

Newest k8s versions reproduce the same problem

          - v1.31.2
          - v1.30.5
          - v1.29.9
          - v1.28.13
          - v1.27.17
          - v1.26.16
          - v1.25.17
          - v1.24.18

We also managed to reproduce the problem with old NSM releases v1.13.0, v1.14.0: https://github.com/networkservicemesh/integration-k8s-kind/actions/runs/11701859018/job/32588822494

And also I've managed to successfully run previous fixed versions 3 times in row:

          -  v1.28.0
          - v1.27.2
          - v1.26.4
          - v1.25.11
          - v1.24.15
          - v1.23.17
          - v1.22.17
          - v1.21.14

@arp-est
Copy link

arp-est commented Dec 5, 2024

Hi,
I believe this is related to containerd/containerd#10589
The node image in the ci 'kindest/node:v1.31.1' seems to be using containerd version: v1.7.18

I've run some tests on my local machine, there I've build some node images for kind with different containerd versions, and run a script that constantly reloads forwarder daemonset, to check.
For containerd versions: [v1.7.15, v1.7.18, v1.7.21] forwarder got stuck in ~15 minutes,
When used with containerd versions v1.7.22, and v1.7.23 it ran for more that 30 minutes ( for v1.7.23 it was about 2 hours before I've deleted the cluster)

So I think as they put in the linked issue, a solution could be to run a custom node image with containerd v1.7.22, until they release a kind image with a more recent containerd.

@arp-est
Copy link

arp-est commented Dec 5, 2024

There already seems to be a pull request to update the containerd version in kind: kubernetes-sigs/kind#3801
I guess its just a matter of time then. hopefully that will fix it.

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Dec 6, 2024

Is it possible to check if kubernetes-sigs/kind#3801 fixes our problem?

@arp-est
Copy link

arp-est commented Dec 7, 2024

I can build a docker image with 1.7.23 containerd for kind, if its uploaded in some public space, if I see it correctly, changing the image_image url at these places to the custom built one should do the trick:

node_image: kindest/node:${{ vars.NSM_KUBERNETES_VERSION }}

node_image: kindest/node:${{ matrix.image }}

node_image: kindest/node:${{ vars.NSM_KUBERNETES_VERSION }}

node_image: kindest/node:${{ vars.NSM_KUBERNETES_VERSION }}

node_image: kindest/node:${{ vars.NSM_KUBERNETES_VERSION }}

and then run the ci with these changes.

EDIT:
I'm having some difficulties with finding a 'public space to upload the image' part today though

@arp-est
Copy link

arp-est commented Dec 7, 2024

Szilard uploded an image here: registry.nordix.org/cloud-native/kind-node:containerd1723
It has containerd 1.7.23 running in it, so yeah changing the node_image in the yaml to this, and then running the ci, should do the trick.

@denis-tingaikin
Copy link
Member Author

denis-tingaikin commented Dec 9, 2024

FYI: created a PR with using previous versions https://github.com/networkservicemesh/integration-k8s-kind/pull/1036 

Am I getting it correct that for now, all we need to do is wait for releases that include fixes? Do we know any ETA?

@arp-est
Copy link

arp-est commented Dec 9, 2024

yes I believe so, not sure how long it will take, now that I look at it, based on the dates of the version tags, they seem to do quarterly releases, so if it gets merged, then probably february? (I was hoping it would be earlier)

@denis-tingaikin
Copy link
Member Author

yes I believe so, not sure how long it will take, now that I look at it, based on the dates of the version tags, they seem to do quarterly releases, so if it gets merged, then probably february? (I was hoping it would be earlier)

The problem is sounds like CRITICAL for many folks, so do we have any ticket where folks report the same issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

2 participants