Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

teamcity: failed tests on master: testrace/TestHeartbeatHealth, test/TestHeartbeatHealth #27340

Closed
cockroach-teamcity opened this issue Jul 10, 2018 · 4 comments
Assignees
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

The following tests appear to have failed:

#771008:

--- FAIL: testrace/TestHeartbeatHealth (0.220s)
context_test.go:268: wanted ErrNotHeartbeated, not rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address notlocalserver: missing port in address"
--- FAIL: test/TestHeartbeatHealth (0.260s)
context_test.go:268: wanted ErrNotHeartbeated, not rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address notlocalserver: missing port in address"

Please assign, take a look and update the issue accordingly.

@cockroach-teamcity cockroach-teamcity added this to the 2.1 milestone Jul 10, 2018
@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jul 10, 2018
@tbg
Copy link
Member

tbg commented Jul 10, 2018

@m-schneider, could you take a look at this? Likely caused by your recent PR.

@m-schneider
Copy link
Contributor

I took a look, it looks like the first breakage was June 13th and I could reproduce it before my PR. It looks like there's a race where the error gets set here

func (ctx *Context) ConnHealth(target string) error {
if ctx.GetLocalInternalServerForAddr(target) != nil {
// The local server is always considered healthy.
return nil
}
conn := ctx.GRPCDial(target)
return conn.heartbeatResult.Load().(heartbeatResult).err
}
func (ctx *Context) runHeartbeat(
conn *Connection, target string, redialChan <-chan struct{},
) error {
maxOffset := ctx.LocalClock.MaxOffset()
clusterID := ctx.ClusterID.Get()
request := PingRequest{
Addr: ctx.Addr,
MaxOffsetNanos: maxOffset.Nanoseconds(),
ClusterID: &clusterID,
ServerVersion: ctx.version.ServerVersion,
}
heartbeatClient := NewHeartbeatClient(conn.grpcConn)
var heartbeatTimer timeutil.Timer
defer heartbeatTimer.Stop()
// Give the first iteration a wait-free heartbeat attempt.
heartbeatTimer.Reset(0)
everSucceeded := false
for {
select {
case <-redialChan:
return grpcutil.ErrCannotReuseClientConn
case <-ctx.Stopper.ShouldStop():
return nil
case <-heartbeatTimer.C:
heartbeatTimer.Read = true
}
goCtx := ctx.masterCtx
var cancel context.CancelFunc
if hbTimeout := ctx.heartbeatTimeout; hbTimeout > 0 {
goCtx, cancel = context.WithTimeout(goCtx, hbTimeout)
}
sendTime := ctx.LocalClock.PhysicalTime()
// NB: We want the request to fail-fast (the default), otherwise we won't
// be notified of transport failures.
response, err := heartbeatClient.Ping(goCtx, &request)
if cancel != nil {
cancel()
}
if err == nil {
err = errors.Wrap(
checkVersion(ctx.version, response.ServerVersion),
"version compatibility check failed on ping response")
}
if err == nil {
everSucceeded = true
receiveTime := ctx.LocalClock.PhysicalTime()
// Only update the clock offset measurement if we actually got a
// successful response from the server.
pingDuration := receiveTime.Sub(sendTime)
maxOffset := ctx.LocalClock.MaxOffset()
if maxOffset != timeutil.ClocklessMaxOffset &&
pingDuration > maximumPingDurationMult*maxOffset {
request.Offset.Reset()
} else {
// Offset and error are measured using the remote clock reading
// technique described in
// http://se.inf.tu-dresden.de/pubs/papers/SRDS1994.pdf, page 6.
// However, we assume that drift and min message delay are 0, for
// now.
request.Offset.MeasuredAt = receiveTime.UnixNano()
request.Offset.Uncertainty = (pingDuration / 2).Nanoseconds()
remoteTimeNow := timeutil.Unix(0, response.ServerTime).Add(pingDuration / 2)
request.Offset.Offset = remoteTimeNow.Sub(receiveTime).Nanoseconds()
}
ctx.RemoteClocks.UpdateOffset(ctx.masterCtx, target, request.Offset, pingDuration)
if cb := ctx.HeartbeatCB; cb != nil {
cb()
}
}
conn.heartbeatResult.Store(heartbeatResult{
everSucceeded: everSucceeded,
err: err,

instead of
func newConnection(stopper *stop.Stopper) *Connection {
c := &Connection{
initialHeartbeatDone: make(chan struct{}),
stopper: stopper,
}
c.heartbeatResult.Store(heartbeatResult{err: ErrNotHeartbeated})

I think both seem valid.

@m-schneider
Copy link
Contributor

@bdarnell I traced the error back to what looks like this PR #24845. It takes a couple of runs of stressrace to reproduce. Is this an actual problem or do we need to change the test?

@bdarnell
Copy link
Contributor

I think the test needs to change. ConnHealth isn't supposed to work with strings that aren't of the form host:port, but it's only validating this asynchronously so it's rare (but possible) for the true error to make it back in time. So the strings non-existent-connection, localserver, and notlocalserver need to be replaced with real addresses which will not respond quickly.

I think the simplest way to do this is to create three listeners with net.Listen() and don't do anything with them, but use their addresses. Hopefully that will all clean up nicely if you close the listener at the end of the test.

m-schneider pushed a commit to m-schneider/cockroach that referenced this issue Jul 11, 2018
This test was sending strings to ConnHealth not of the form host:port,
which was being validated asynchronously and sometimes caused that
validation error to be recieved before an ErrNotHeartbeated error. Now we'll use
addresses of the correct form, so we should always receive the ErrNotHeartbeated
error.

Close cockroachdb#27340

Release note: None
m-schneider pushed a commit to m-schneider/cockroach that referenced this issue Jul 11, 2018
This test was sending strings to ConnHealth not of the form host:port,
which was being validated asynchronously and sometimes caused that
validation error to be recieved before an ErrNotHeartbeated error. Now we'll use
addresses of the correct form, so we should always receive the ErrNotHeartbeated
error.

Close cockroachdb#27340

Release note: None
craig bot pushed a commit that referenced this issue Jul 11, 2018
27408: rpc: fix TestHeartbeatHealth r=m-schneider a=m-schneider

This test was sending strings to ConnHealth not of the form host:port,
which was being validated asynchronously and sometimes caused that
validation error to be recieved before an ErrNotHeartbeated error. Now we'll use
addresses of the correct form, so we should always receive the ErrNotHeartbeated
error.

Close #27340

Release note: None

Co-authored-by: Masha Schneider <[email protected]>
@craig craig bot closed this as completed in #27408 Jul 11, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

No branches or pull requests

4 participants