Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: move rpc defrag notifier into backend. #16959

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

siyuanfoundation
Copy link
Contributor

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

the problem with the current implementation is:

  1. there could potentially be >1 rpc servers (1 for secure, 1 for insecure).
  2. each rpc server only sets its serve state based on defrag requests to its own server, unaware of each other. So one rpc server could still be serving traffic while the other requested defrag.

Based on these reasons, I think it is necessary to move the defrag notifiers further back to the backend.

@siyuanfoundation
Copy link
Contributor Author

cc @chaochn47 @serathius

Comment on lines 459 to 461
if notifier == nil {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this check? It doesn't protect against nil pointer exceptions.

It will work for

b.SubscribeDefragNotifier(nil)

But not for

var notifier *healthNotifier
b.SubscribeDefragNotifier(notifier)

Because nil != (*healthNotifier)(nil)

Copy link
Contributor Author

@siyuanfoundation siyuanfoundation Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to protect the line

mc := adapter.MaintenanceServerToMaintenanceClient(v3rpc.NewMaintenanceServer(s, nil))

@@ -459,6 +490,9 @@ func (b *backend) defrag() error {
// lock database after lock tx to avoid deadlock.
b.mu.Lock()
defer b.mu.Unlock()
// send notifications after acquiring the lock.
b.defragStarted()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notifying listeners under lock is a simple and correct way to fix the race issue of two goroutines calling Defrag at once. However the current implementation comes with one flaw, it introduces an external call under a lock.

Backend just calls and blocks on executing the notifiers, but doesn't really know what they are doing. We should be really careful about introducing a blocking call here. Do you know if call .SetServingStatus(allGRPCServices, healthpb.HealthCheckResponse_SERVING) is blocking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SetServingStatus call is not blocking. It sends update to a channel, but it clears the channel first. And the lock is used only in simple cases.

// SetServingStatus is called when need to reset the serving status of a service
// or insert a new service entry into the statusMap.
func (s *Server) SetServingStatus(service string, servingStatus healthpb.HealthCheckResponse_ServingStatus) {
	s.mu.Lock()
	defer s.mu.Unlock()
	if s.shutdown {
		logger.Infof("health: status changing for %s to %v is ignored because health service is shutdown", service, servingStatus)
		return
	}

	s.setServingStatusLocked(service, servingStatus)
}

func (s *Server) setServingStatusLocked(service string, servingStatus healthpb.HealthCheckResponse_ServingStatus) {
	s.statusMap[service] = servingStatus
	for _, update := range s.updates[service] {
		// Clears previous updates, that are not sent to the client, from the channel.
		// This can happen if the client is not reading and the server gets flow control limited.
		select {
		case <-update:
		default:
		}
		// Puts the most recent update to the channel.
		update <- servingStatus
	}
}

Do you think we should make the defragStarted call in a separate routine to remove such concerns?

@ahrtr
Copy link
Member

ahrtr commented Nov 18, 2023

close/reopen to re-trigger all arm64 workflows

Copy link

stale bot commented Mar 17, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 17, 2024
@tjungblu
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot
Copy link

@siyuanfoundation: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-etcd-unit-test-amd64 50c9868 link true /test pull-etcd-unit-test-amd64
pull-etcd-unit-test-arm64 50c9868 link true /test pull-etcd-unit-test-arm64

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants