Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

healthcheck: refactor to use less locking #3919

Merged
merged 5 commits into from
Nov 18, 2018
Merged

Conversation

sougou
Copy link
Contributor

@sougou sougou commented May 11, 2018

This change gets rid of all locks in hcc. Instead we use the approach
of "sharing by communicating". Whenever there is a change in state,
hcc communicates the change to hc, which then performs the necessary
updates and handling.

Now that hc updates are trivial, the lock has been changed to
a simple Mutex, which is more efficient that RWMutex.

Also, instead of looping through every stream, this change
makes every stream goroutine perform its own timeout check.

@sougou sougou requested a review from demmer May 11, 2018 01:04
Copy link
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an initial (non-exhaustive) pass on this change one high level reaction is that it seems potentially confusing to have two TabletStats structs for each downstream tablet, one in the tabletHealth data structure and another for each healthCheckConn.

I wonder if it would be possible to remove the one from healthCheckConn since that's no longer used for making routing decisions and then we can accomplish the goals of the refactor without the duplication.

tabletStats TabletStats
loggedServingState bool
lastResponseTimestamp time.Time // timestamp of the last healthcheck response
}

// tabletHealth maintains the health status of a tablet fed by the health check connection.
type tabletHealth struct {
cancelFunc context.CancelFunc
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to have some comments indicating what each of these fields does. In particular I think this used to be called streamCancelFunc which made it (somewhat) clearer what it did.

Also it's not clear from an initial read what the relationship is between the tabletStats in this struct vs the one in healthCheckConn.

IMO renaming to something like latestTabletStats would make it clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancelFunc wasn't renamed. It was moved from hcc into here. The streamCancelFunc that was in hcc is now a local variable within the goroutine that runs the healthcheck.

Explained the relationship between hcc and th in the comments.

Renamed tabletStats->latestTabletStats

hcc.conn = nil
if hcc.conn != nil {
// Don't use hcc.ctx because it's already closed.
hcc.conn.Close(context.Background())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we found out the hard way, under certain circumstances this could take a long time. so I think it's cleaner to update the stats to mark the connection as not serving before doing the connection close which is how I did it before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

// servingStatus feeds into the serving var, which keeps track of the serving
// status transmitted by the tablet.
servingStatus := make(chan bool, 5)
serving := hcc.tabletStats.Serving
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be clearer if serving was declared inside the watcher goroutine and not outside.

It's correct as written but since they're only used in the context of the watcher I think that's clearer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if !serving {
continue
}
timedout = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timedout is set here inside the watcher goroutine but is then read outside in the main thread.

This seems like a data race to me since there's no synchronization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I've changed timedout to use atomic operations.

// We received a message. Reset the back-off.
retryDelay = hc.retryDelay
servingStatus <- shr.Serving
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although very very unlikely in practice, it seems like bad things would happen if the watcher goroutine got stalled and the servingStatus channel does actually fill up. Instead of 5 slots would it make more sense to just have one and then select here in the unlikely event it's blocked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

return
}
oldts := th.tabletStats
th.tabletStats = *ts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of calling copy() on the way in, it seems more defensive to copy it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the copy here will reduce some duplication, but it may not separate the concerns correctly. It will introduce a requirement that healthCheckConn should not change the contents of ts while updateHealth is executing.

But I could be convinced otherwise.

Copy link
Contributor Author

@sougou sougou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most comments addressed, and some responses inline.

On the high level approach of tabletHealth vs healthCheckConn: The separation was needed to simplify the locking scheme. In the new world, healthCheckConn is internal to the checkConn goroutine. For each tabletHealth, a checkConn is launched, which is responsible for providing health updates. So, tabletHealth only cares about the goroutine, and knows nothing of healthCheckConn.

I've added a TODO for moving healthCheckConn and the checkConn to their own file.

tabletStats TabletStats
loggedServingState bool
lastResponseTimestamp time.Time // timestamp of the last healthcheck response
}

// tabletHealth maintains the health status of a tablet fed by the health check connection.
type tabletHealth struct {
cancelFunc context.CancelFunc
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cancelFunc wasn't renamed. It was moved from hcc into here. The streamCancelFunc that was in hcc is now a local variable within the goroutine that runs the healthcheck.

Explained the relationship between hcc and th in the comments.

Renamed tabletStats->latestTabletStats

return
}
oldts := th.tabletStats
th.tabletStats = *ts
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing the copy here will reduce some duplication, but it may not separate the concerns correctly. It will introduce a requirement that healthCheckConn should not change the contents of ts while updateHealth is executing.

But I could be convinced otherwise.

hcc.conn = nil
if hcc.conn != nil {
// Don't use hcc.ctx because it's already closed.
hcc.conn.Close(context.Background())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Done.

// servingStatus feeds into the serving var, which keeps track of the serving
// status transmitted by the tablet.
servingStatus := make(chan bool, 5)
serving := hcc.tabletStats.Serving
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// We received a message. Reset the back-off.
retryDelay = hc.retryDelay
servingStatus <- shr.Serving
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

if !serving {
continue
}
timedout = true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch. I've changed timedout to use atomic operations.

@sougou
Copy link
Contributor Author

sougou commented May 26, 2018

I had to move the serving assignment out of the goroutine. It introduced a data race.

Instead of looping through every stream, this change
makes every stream goroutine perform its own timeout check.

Signed-off-by: Sugu Sougoumarane <[email protected]>
This change gets rid of all locks in hcc. Instead we use the approach
of "sharing by communicating". Whenever there is a change in state,
hcc communicates the change to hc, which then performs the necessary
updates and handling.

Also, now that hc updates are trivial, the lock has been changed to
a simple Mutex, which is more efficient that RWMutex.

Signed-off-by: Sugu Sougoumarane <[email protected]>
Signed-off-by: Sugu Sougoumarane <[email protected]>
It turns out that moving the serving assignment within the
goroutine introduces a data race. I've moved it back out.

Also found another incidental data race: did you know that
functions like t.Fatalf should not be called from goroutines?

Signed-off-by: Sugu Sougoumarane <[email protected]>
Copy link
Member

@demmer demmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another pass through this after it being in the queue for a long time...

In general I think this refactor is a great step in the right direction, but there are a number of suggestions to improve things even more and one possibly dangerous bug lurking in here that I think should be fixed.

go/vt/discovery/healthcheck.go Outdated Show resolved Hide resolved
go/vt/discovery/healthcheck.go Show resolved Hide resolved
go/vt/discovery/healthcheck.go Outdated Show resolved Hide resolved
if conn == nil {
var err error
conn, err = tabletconn.GetDialer()(hcc.tabletStats.Tablet, grpcclient.FailFast(true))
if hcc.conn == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was already an issue before this change, but it's hard to reason about why some methods are attached to the healthCheckConn (taking the HealthCheckImpl as a parameter), while others are attached to the impl and take the conn as a parameter.

My intuition is that this would all be a lot clearer if it was consistent -- any functionality that refers to a single connection (checkConn, finalizeConn, stream, etc) would be defined on the conn struct, while functionality related to the overall data structure (UpdateHealth, AddTablet, etc) would be defined on the containing impl struct.

That would also help when you split out healthcheck_conn into a separate file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried making this change, but I encountered a readability problem: checkConn calls hc.connsWG.Done(), which belongs to HealthCheckImpl. This would be confusing if we moved this to a different file. Also, checkConn references multiple other things in hc. So, we'll probably need to do a deeper refactor to clean up this mess.

But returns are diminishing.

go/vt/discovery/healthcheck.go Show resolved Hide resolved
go/vt/discovery/healthcheck.go Outdated Show resolved Hide resolved
// later.
timedout := sync2.NewAtomicBool(false)
serving := hcc.tabletStats.Serving
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may need to bump up the overall threads running wg and then bump down when the goroutine exits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you see below, after the stream call, we call streamCancel which will ensure this goroutine will eventually exit. There's no need for this goroutine to exit before HC is closed. We just have to make sure we don't leak them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I found a deeper problem. An inherent race between Close and all the other AddTablet, etc. I've now changed Close to set addrToHealth to nil as a guard to ensure nobody accepts more changes after HC is closed.

go/vt/discovery/healthcheck.go Outdated Show resolved Hide resolved
Also found a race between Close and other calls that modify state.
That case is also fixed.

Signed-off-by: Sugu Sougoumarane <[email protected]>
@sougou sougou merged commit 606738d into vitessio:master Nov 18, 2018
@sougou sougou deleted the healthcheck branch November 18, 2018 17:48
notfelineit pushed a commit to planetscale/vitess that referenced this pull request Mar 12, 2024
…ade the Golang version to `go1.21.5` (vitessio#3919)

Co-authored-by: Florent Poinsard <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants