Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestLimiter is flaky #13526

Closed
r0mant opened this issue Jun 15, 2022 · 14 comments · Fixed by #14102 or #31605
Closed

TestLimiter is flaky #13526

r0mant opened this issue Jun 15, 2022 · 14 comments · Fixed by #14102 or #31605

Comments

@r0mant
Copy link
Collaborator

r0mant commented Jun 15, 2022

From the CI:

FAIL: github.com/gravitational/teleport/lib/srv/regular
FAIL: github.com/gravitational/teleport/lib/srv/regular.TestLimiter
...
    sshserver_test.go:1491: 
        	Error Trace:	sshserver_test.go:1491
        	Error:      	Expected value not to be nil.
        	Test:       	TestLimiter

CI Logs:

Investigate how to remove the test's reliance on time and make it non-flaky.

@zmb3
Copy link
Collaborator

zmb3 commented Jun 15, 2022

I've also seen it panic, not just fail.

➜ go test -run TestLimiter -race -count=100 ./lib/srv/regular
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x2 addr=0x0 pc=0x103839518]

goroutine 20317 [running]:
github.com/gravitational/teleport/lib/srv.(*terminal).closeTTY(0x0)
	/Users/zmb/teleport/lib/srv/term.go:273 +0x38
github.com/gravitational/teleport/lib/srv.(*terminal).Close(0x0)
	/Users/zmb/teleport/lib/srv/term.go:265 +0x30
github.com/gravitational/teleport/lib/srv.(*session).Stop(0xc001dcf760)
	/Users/zmb/teleport/lib/srv/sess.go:661 +0x1a4
github.com/gravitational/teleport/lib/srv.(*session).Close(0xc001dcf760)
	/Users/zmb/teleport/lib/srv/sess.go:680 +0x38
github.com/gravitational/teleport/lib/srv.(*SessionRegistry).OpenSession(0xc004620f00?, {0x105234990, 0xc000b50480}, 0xc004620f00)
	/Users/zmb/teleport/lib/srv/sess.go:266 +0x398
github.com/gravitational/teleport/lib/srv.(*TermHandlers).HandleShell(0xc0053d18a8, {0x105234990, 0xc000b50480}, 0xc004bb2980, 0xc004620f00)
	/Users/zmb/teleport/lib/srv/termhandlers.go:115 +0x108
github.com/gravitational/teleport/lib/srv/regular.(*Server).dispatch(0xc001a13b80, {0x1052300c8, 0xc004f533c0}, {0x105234990, 0xc000b50480}, 0xc004bb2980, 0xc004620f00)
	/Users/zmb/teleport/lib/srv/regular/sshserver.go:1575 +0x604
github.com/gravitational/teleport/lib/srv/regular.(*Server).handleSessionRequests(0xc001a13b80, {0x1052300c8, 0xc004f533c0}, _, {{0xc004b3c07d, 0x3}, {0x0, 0x0}, {0xc004989abd, 0x3}, ...}, ...)
	/Users/zmb/teleport/lib/srv/regular/sshserver.go:1485 +0xb14
github.com/gravitational/teleport/lib/srv/regular.(*Server).HandleNewChan.func1()
	/Users/zmb/teleport/lib/srv/regular/sshserver.go:1217 +0xd8
created by github.com/gravitational/teleport/lib/srv/regular.(*Server).HandleNewChan
	/Users/zmb/teleport/lib/srv/regular/sshserver.go:1216 +0xeb0
FAIL	github.com/gravitational/teleport/lib/srv/regular	94.129s
FAIL

@nklaassen
Copy link
Contributor

hit this a couple times today, added logs to the description

@r0mant r0mant linked a pull request Jul 8, 2022 that will close this issue
r0mant added a commit that referenced this issue Jul 11, 2022
* Make TestLimiter test less flaky

* Update lib/srv/regular/sshserver_test.go

Co-authored-by: Roman Tkachenko <[email protected]>

* Update lib/srv/regular/sshserver_test.go

Co-authored-by: Roman Tkachenko <[email protected]>

* Remove unneeded helper to wait for server to start and add a comment

Co-authored-by: Roman Tkachenko <[email protected]>
EdwardDowling added a commit that referenced this issue Jul 19, 2022
EdwardDowling added a commit that referenced this issue Jul 21, 2022
@zmb3
Copy link
Collaborator

zmb3 commented Sep 10, 2022

Reopening this one as it's back among the top failing tests.

@zmb3 zmb3 reopened this Sep 10, 2022
@GavinFrazar
Copy link
Contributor

Saw this fail today for failing to cleanup a temp dir. We must have some goroutine that we don't wait to finish in the test:

log: https://console.cloud.google.com/cloud-build/builds/7749e62a-191d-43dc-a608-9e86e083d030?project=ci-account

relevant snippet:

OUTPUT github.com/gravitational/teleport/lib/srv/regular.TestLimiter
===================================================
=== RUN   TestLimiter
=== PAUSE TestLimiter
=== CONT  TestLimiter
=== CONT  TestLimiter
    testing.go:1090: TempDir RemoveAll cleanup: unlinkat /tmp/TestLimiter672771855/003/log/upload/streaming/default: directory not empty
--- FAIL: TestLimiter (16.40s)

While trying to repro locally I also noticed that this test case is creating zombie shell processes: 3 new shell processes per run. I now have 177 tty zombies.

@zmb3
Copy link
Collaborator

zmb3 commented Nov 7, 2022

@GavinFrazar I've noticed the same when trying to reproduce TestAgentForward. Something in lib/srv is not cleaning up shell processes correctly.

@GavinFrazar
Copy link
Contributor

I think this test caused this failure on branch/v11: https://console.cloud.google.com/cloud-build/builds/cd51f11c-8f22-4099-8e7e-01a8b013f145?project=ci-account

either that or TestInactivityTimeout, it's hard to tell what caused the test suite timeout

@zmb3
Copy link
Collaborator

zmb3 commented Dec 28, 2022

There seems to still be a few issues with this one:

  1. It suffers from TestAgentForward leaves orphaned shell processes behind #13335, so reproducing flakes locally is a pain.
  2. It suffers from the "tempdir cleanup problem" which can be fixed by using newFixtureWithoutDiskBasedLogging(t)
  3. It's still flaky. go test -run TestLimiter -race -count=200 -failfast ./lib/srv/regular failed after about 10 minutes for me locally, but the goroutine dump that it prints is so massive you can't tell what actually went wrong.

@jakule jakule self-assigned this Dec 28, 2022
@ibeckermayer
Copy link
Contributor

@tigrato
Copy link
Contributor

tigrato commented Feb 27, 2023

@zmb3
Copy link
Collaborator

zmb3 commented Apr 12, 2023

This hasn't shown up recently, The 2 most recent occurrences were the test timeout issue that was recently fixed. I'm calling this one solved.

@zmb3 zmb3 closed this as completed Apr 12, 2023
@zmb3
Copy link
Collaborator

zmb3 commented Jul 5, 2023

@zmb3 zmb3 reopened this Jul 5, 2023
@zmb3
Copy link
Collaborator

zmb3 commented Sep 7, 2023

@rosstimothy so far what I've learned:

The test times out when starting the very last session (which is supposed to fail due to the limiter):

	_, err = clt.NewSession(ctx) // this deadlocks
	require.Error(t, err)

A gouroutine dump shows we are stuck on a channel receive trying to open the tracing SSH channel inside isTracingSupported.

There is actually a comment about not "blocking forever if the connection is rejected:

// Note: a channel is used instead of a global request in order prevent blocking
// forever in the event that the connection is rejected. In that case, the server
// doesn't service any global requests and writes the error to the first opened
// channel.

It feels like this is exactly what's happening though. The logs do show a successful authn followed by a max rate reached error:

{"caller":"srv/authhandlers.go:394","component":"node","level":"debug","message":"Successfully authenticated","timestamp":"2023-09-07T12:19:32-06:00","trace.fields":{"fingerprint":"[email protected] SHA256:YE3dUbiTNxxdAJAmruF1h7AFjv5yh98MNBYY3Tdt1/U","local":{"IP":"127.0.0.1","Port":62202,"Zone":""},"remote":{"IP":"127.0.0.1","Port":62208,"Zone":""},"user":"zmb"}}
{"caller":"srv/authhandlers.go:579","component":"node","level":"debug","message":"Checking permissions for (zmb,zmb) to login to node with RBAC checks.","timestamp":"2023-09-07T12:19:32-06:00"}
{"caller":"sshutils/server.go:494","component":"ssh:node","level":"error","message":"max rate reached: retry-in 0s","timestamp":"2023-09-07T12:19:32-06:00"}

Reproduces pretty easily locally. I'm using a lower timeout so I don't have to wait a full 10m for the test to fail:

$ go test -run TestLimiter ./lib/srv/regular -count=10 -failfast -timeout=3m -v

@rosstimothy
Copy link
Contributor

We could probably simplify isTracingSupported by eliminating opening the tracing channel and just checking that the server version is SSH-2.0-Teleport. The channel was mostly for backward compat checking when tracing landed and since all supported versions of Teleport support tracing via SSH we just need to prevent sending tracing payloads to openssh instances.

@ravicious
Copy link
Member

Timeout on v13, does it warrant reopening?

https://github.com/gravitational/teleport/actions/runs/8141493060/job/22248971636#step:6:4446

panic: test timed out after 10m0s
running tests:
	TestLimiter (9m46s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment