Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

acceptance: TestRapidRestarts CI failures #29227

Closed
petermattis opened this issue Aug 29, 2018 · 4 comments · Fixed by #29230
Closed

acceptance: TestRapidRestarts CI failures #29227

petermattis opened this issue Aug 29, 2018 · 4 comments · Fixed by #29230
Assignees
Labels
A-kv-server Relating to the KV-level RPC server C-investigation Further steps needed to qualify. C-label will change.
Milestone

Comments

@petermattis petermattis added C-investigation Further steps needed to qualify. C-label will change. A-kv-server Relating to the KV-level RPC server labels Aug 29, 2018
@petermattis petermattis added this to the 2.1 milestone Aug 29, 2018
@petermattis petermattis self-assigned this Aug 29, 2018
@petermattis
Copy link
Collaborator Author

The failures are all timeouts. Usually the test succeeds in 5-10s. I've been able to reproduce failures locally by setting the timeout to 1m and running in a loop.

@petermattis
Copy link
Collaborator Author

I fooled myself by setting the 1m timeout and running with -count 100. No real local reproduction yet.

But I did find the real error in the failures above:

fatal error: concurrent map read and map write

goroutine 111 [running]:
runtime.throw(0x2a5299c, 0x21)
	/usr/local/go/src/runtime/panic.go:616 +0x81 fp=0xc420a98a18 sp=0xc420a989f8 pc=0x6dc7d1
runtime.mapaccess2_faststr(0x269f180, 0xc4208a9ef0, 0xc420378ce4, 0xd, 0xd, 0xc420a98ae8)
	/usr/local/go/src/runtime/hashmap_fast.go:270 +0x461 fp=0xc420a98a88 sp=0xc420a98a18 pc=0x6bc591
net/http.(*ServeMux).shouldRedirect(0xc420ab86c0, 0xc4202a21f0, 0x9, 0xc420378ce4, 0xd, 0x0)
	/usr/local/go/src/net/http/server.go:2239 +0x114 fp=0xc420a98b48 sp=0xc420a98a88 pc=0x9d8174
net/http.(*ServeMux).redirectToPathSlash(0xc420ab86c0, 0xc4202a21f0, 0x9, 0xc420378ce4, 0xd, 0xc420a6e400, 0xc420ce0000, 0x0)
	/usr/local/go/src/net/http/server.go:2224 +0x57 fp=0xc420a98b90 sp=0xc420a98b48 pc=0x9d7f47
net/http.(*ServeMux).Handler(0xc420ab86c0, 0xc42079c300, 0xc420260850, 0xc420a98d28, 0x18, 0xc420a98d20)
	/usr/local/go/src/net/http/server.go:2293 +0x118 fp=0xc420a98cd8 sp=0xc420a98b90 pc=0x9d8408
net/http.(*ServeMux).ServeHTTP(0xc420ab86c0, 0x2f4e760, 0xc4202ba000, 0xc42079c300)
	/usr/local/go/src/net/http/server.go:2336 +0xfe fp=0xc420a98d18 sp=0xc420a98cd8 pc=0x9d8b7e
github.com/cockroachdb/cockroach/pkg/server.(*Server).ServeHTTP(0xc4208d0000, 0x2f5cae0, 0xc420382380, 0xc42079c300)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1861 +0x11d fp=0xc420a98d58 sp=0xc420a98d18 pc=0x1eaeb2d
net/http.serverHandler.ServeHTTP(0xc4208851e0, 0x2f5cae0, 0xc420382380, 0xc42079c300)
	/usr/local/go/src/net/http/server.go:2694 +0xbc fp=0xc420a98d88 sp=0xc420a98d58 pc=0x9d9bec
net/http.(*conn).serve(0xc420ab6140, 0x2f5f2e0, 0xc420c28500)
	/usr/local/go/src/net/http/server.go:1830 +0x651 fp=0xc420a98fc8 sp=0xc420a98d88 pc=0x9d5e61
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:2361 +0x1 fp=0xc420a98fd0 sp=0xc420a98fc8 pc=0x70be91
created by net/http.(*Server).Serve
	/usr/local/go/src/net/http/server.go:2795 +0x27b

Seems like we're doing something bad early in startup.

@petermattis
Copy link
Collaborator Author

I've probably been staring at the sun too long, but it looks like we're definitely doing something that is not copacetic. http.ServeMux does not support having Handle called after it has been associated with a server. ServeMux.ServeHTTP accesses ServeMux.m without a lock.

Oh, this is apparently a bug in go1.10. Here is the problematic code path in go1.10:

func (mux *ServeMux) redirectToPathSlash(host, path string, u *url.URL) (*url.URL, bool) {
	if !mux.shouldRedirect(host, path) {
		return u, false
	}
	path = path + "/"
	u = &url.URL{Path: path, RawQuery: u.RawQuery}
	return u, true
}

And here is that same code in go1.11:

func (mux *ServeMux) redirectToPathSlash(host, path string, u *url.URL) (*url.URL, bool) {
	mux.mu.RLock()
	shouldRedirect := mux.shouldRedirectRLocked(host, path)
	mux.mu.RUnlock()
	if !shouldRedirect {
		return u, false
	}
	path = path + "/"
	u = &url.URL{Path: path, RawQuery: u.RawQuery}
	return u, true
}

See golang/go@2fd1b52.

I'm not sure what to do in the short term. I suppose we could add our own protection around ServeMux, though that is somewhat unfortunate.

@petermattis
Copy link
Collaborator Author

I believe this test has been flaky since its introduction almost a year ago. Cc @tschottdorf for tickling bugs in the stdlib.

petermattis added a commit to petermattis/cockroach that referenced this issue Aug 29, 2018
In go1.10 and earlier, it was not safe to call `http.ServeMux.ServeHTTP`
concurrently with `http.ServeMux.Handle`. (This is fixed in go1.11). In
the interim, provide our own safeServeMux wrapper that provides proper
locking.

Fixes cockroachdb#29227

Release note: None
craig bot pushed a commit that referenced this issue Aug 29, 2018
29230: server: deflake TestRapidRestarts r=benesch a=petermattis

In go1.10 and earlier, it was not safe to call `http.ServeMux.ServeHTTP`
concurrently with `http.ServeMux.Handle`. (This is fixed in go1.11). In
the interim, provide our own safeServeMux wrapper that provides proper
locking.

Fixes #29227

Release note: None

29236: Revert "storage: enable the merge queue by default" r=tschottdorf a=benesch

This reverts commit 98ca1d0. The merge
queue will be reenabled once flaky tests are fixed.

To reviewers: I'd much prefer to merge #29235. But if that gets stuck in code review or the flakiness reaches a breaking point, feel free to merge this instead.

Co-authored-by: Peter Mattis <[email protected]>
Co-authored-by: Nikhil Benesch <[email protected]>
@craig craig bot closed this as completed in #29230 Aug 29, 2018
petermattis added a commit to petermattis/cockroach that referenced this issue Aug 29, 2018
In go1.10 and earlier, it was not safe to call `http.ServeMux.ServeHTTP`
concurrently with `http.ServeMux.Handle`. (This is fixed in go1.11). In
the interim, provide our own safeServeMux wrapper that provides proper
locking.

Fixes cockroachdb#29227

Release note: None
craig bot pushed a commit that referenced this issue Aug 29, 2018
29259: release-2.1: server: deflake TestRapidRestarts r=benesch a=petermattis

Backport 1/1 commits from #29230.

/cc @cockroachdb/release

---

In go1.10 and earlier, it was not safe to call `http.ServeMux.ServeHTTP`
concurrently with `http.ServeMux.Handle`. (This is fixed in go1.11). In
the interim, provide our own safeServeMux wrapper that provides proper
locking.

Fixes #29227

Release note: None


Co-authored-by: Peter Mattis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-server Relating to the KV-level RPC server C-investigation Further steps needed to qualify. C-label will change.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant