-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
healthcheck: refactor to use less locking #3919
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After an initial (non-exhaustive) pass on this change one high level reaction is that it seems potentially confusing to have two TabletStats
structs for each downstream tablet, one in the tabletHealth
data structure and another for each healthCheckConn
.
I wonder if it would be possible to remove the one from healthCheckConn
since that's no longer used for making routing decisions and then we can accomplish the goals of the refactor without the duplication.
go/vt/discovery/healthcheck.go
Outdated
tabletStats TabletStats | ||
loggedServingState bool | ||
lastResponseTimestamp time.Time // timestamp of the last healthcheck response | ||
} | ||
|
||
// tabletHealth maintains the health status of a tablet fed by the health check connection. | ||
type tabletHealth struct { | ||
cancelFunc context.CancelFunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful to have some comments indicating what each of these fields does. In particular I think this used to be called streamCancelFunc
which made it (somewhat) clearer what it did.
Also it's not clear from an initial read what the relationship is between the tabletStats
in this struct vs the one in healthCheckConn.
IMO renaming to something like latestTabletStats
would make it clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cancelFunc
wasn't renamed. It was moved from hcc into here. The streamCancelFunc
that was in hcc is now a local variable within the goroutine that runs the healthcheck.
Explained the relationship between hcc
and th
in the comments.
Renamed tabletStats
->latestTabletStats
go/vt/discovery/healthcheck.go
Outdated
hcc.conn = nil | ||
if hcc.conn != nil { | ||
// Don't use hcc.ctx because it's already closed. | ||
hcc.conn.Close(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we found out the hard way, under certain circumstances this could take a long time. so I think it's cleaner to update the stats to mark the connection as not serving before doing the connection close which is how I did it before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done.
// servingStatus feeds into the serving var, which keeps track of the serving | ||
// status transmitted by the tablet. | ||
servingStatus := make(chan bool, 5) | ||
serving := hcc.tabletStats.Serving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be clearer if serving
was declared inside the watcher goroutine and not outside.
It's correct as written but since they're only used in the context of the watcher I think that's clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/vt/discovery/healthcheck.go
Outdated
if !serving { | ||
continue | ||
} | ||
timedout = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timedout
is set here inside the watcher goroutine but is then read outside in the main thread.
This seems like a data race to me since there's no synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. I've changed timedout to use atomic operations.
go/vt/discovery/healthcheck.go
Outdated
// We received a message. Reset the back-off. | ||
retryDelay = hc.retryDelay | ||
servingStatus <- shr.Serving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although very very unlikely in practice, it seems like bad things would happen if the watcher goroutine got stalled and the servingStatus
channel does actually fill up. Instead of 5 slots would it make more sense to just have one and then select here in the unlikely event it's blocked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
go/vt/discovery/healthcheck.go
Outdated
return | ||
} | ||
oldts := th.tabletStats | ||
th.tabletStats = *ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of calling copy()
on the way in, it seems more defensive to copy it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing the copy here will reduce some duplication, but it may not separate the concerns correctly. It will introduce a requirement that healthCheckConn should not change the contents of ts
while updateHealth is executing.
But I could be convinced otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most comments addressed, and some responses inline.
On the high level approach of tabletHealth vs healthCheckConn: The separation was needed to simplify the locking scheme. In the new world, healthCheckConn is internal to the checkConn goroutine. For each tabletHealth, a checkConn is launched, which is responsible for providing health updates. So, tabletHealth only cares about the goroutine, and knows nothing of healthCheckConn.
I've added a TODO for moving healthCheckConn and the checkConn to their own file.
go/vt/discovery/healthcheck.go
Outdated
tabletStats TabletStats | ||
loggedServingState bool | ||
lastResponseTimestamp time.Time // timestamp of the last healthcheck response | ||
} | ||
|
||
// tabletHealth maintains the health status of a tablet fed by the health check connection. | ||
type tabletHealth struct { | ||
cancelFunc context.CancelFunc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cancelFunc
wasn't renamed. It was moved from hcc into here. The streamCancelFunc
that was in hcc is now a local variable within the goroutine that runs the healthcheck.
Explained the relationship between hcc
and th
in the comments.
Renamed tabletStats
->latestTabletStats
go/vt/discovery/healthcheck.go
Outdated
return | ||
} | ||
oldts := th.tabletStats | ||
th.tabletStats = *ts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing the copy here will reduce some duplication, but it may not separate the concerns correctly. It will introduce a requirement that healthCheckConn should not change the contents of ts
while updateHealth is executing.
But I could be convinced otherwise.
go/vt/discovery/healthcheck.go
Outdated
hcc.conn = nil | ||
if hcc.conn != nil { | ||
// Don't use hcc.ctx because it's already closed. | ||
hcc.conn.Close(context.Background()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Done.
// servingStatus feeds into the serving var, which keeps track of the serving | ||
// status transmitted by the tablet. | ||
servingStatus := make(chan bool, 5) | ||
serving := hcc.tabletStats.Serving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
go/vt/discovery/healthcheck.go
Outdated
// We received a message. Reset the back-off. | ||
retryDelay = hc.retryDelay | ||
servingStatus <- shr.Serving |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
go/vt/discovery/healthcheck.go
Outdated
if !serving { | ||
continue | ||
} | ||
timedout = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch. I've changed timedout to use atomic operations.
I had to move the serving assignment out of the goroutine. It introduced a data race. |
Instead of looping through every stream, this change makes every stream goroutine perform its own timeout check. Signed-off-by: Sugu Sougoumarane <[email protected]>
This change gets rid of all locks in hcc. Instead we use the approach of "sharing by communicating". Whenever there is a change in state, hcc communicates the change to hc, which then performs the necessary updates and handling. Also, now that hc updates are trivial, the lock has been changed to a simple Mutex, which is more efficient that RWMutex. Signed-off-by: Sugu Sougoumarane <[email protected]>
Signed-off-by: Sugu Sougoumarane <[email protected]>
It turns out that moving the serving assignment within the goroutine introduces a data race. I've moved it back out. Also found another incidental data race: did you know that functions like t.Fatalf should not be called from goroutines? Signed-off-by: Sugu Sougoumarane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made another pass through this after it being in the queue for a long time...
In general I think this refactor is a great step in the right direction, but there are a number of suggestions to improve things even more and one possibly dangerous bug lurking in here that I think should be fixed.
if conn == nil { | ||
var err error | ||
conn, err = tabletconn.GetDialer()(hcc.tabletStats.Tablet, grpcclient.FailFast(true)) | ||
if hcc.conn == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already an issue before this change, but it's hard to reason about why some methods are attached to the healthCheckConn
(taking the HealthCheckImpl
as a parameter), while others are attached to the impl and take the conn as a parameter.
My intuition is that this would all be a lot clearer if it was consistent -- any functionality that refers to a single connection (checkConn, finalizeConn, stream, etc) would be defined on the conn struct, while functionality related to the overall data structure (UpdateHealth, AddTablet, etc) would be defined on the containing impl struct.
That would also help when you split out healthcheck_conn into a separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried making this change, but I encountered a readability problem: checkConn
calls hc.connsWG.Done()
, which belongs to HealthCheckImpl
. This would be confusing if we moved this to a different file. Also, checkConn
references multiple other things in hc
. So, we'll probably need to do a deeper refactor to clean up this mess.
But returns are diminishing.
// later. | ||
timedout := sync2.NewAtomicBool(false) | ||
serving := hcc.tabletStats.Serving | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this may need to bump up the overall threads running wg and then bump down when the goroutine exits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you see below, after the stream
call, we call streamCancel
which will ensure this goroutine will eventually exit. There's no need for this goroutine to exit before HC is closed. We just have to make sure we don't leak them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I found a deeper problem. An inherent race between Close
and all the other AddTablet
, etc. I've now changed Close to set addrToHealth
to nil
as a guard to ensure nobody accepts more changes after HC is closed.
Also found a race between Close and other calls that modify state. That case is also fixed. Signed-off-by: Sugu Sougoumarane <[email protected]>
…ade the Golang version to `go1.21.5` (vitessio#3919) Co-authored-by: Florent Poinsard <[email protected]>
This change gets rid of all locks in hcc. Instead we use the approach
of "sharing by communicating". Whenever there is a change in state,
hcc communicates the change to hc, which then performs the necessary
updates and handling.
Now that hc updates are trivial, the lock has been changed to
a simple Mutex, which is more efficient that RWMutex.
Also, instead of looping through every stream, this change
makes every stream goroutine perform its own timeout check.