etcdserver: move rpc defrag notifier into backend. #16959

siyuanfoundation · 2023-11-16T22:29:21Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

the problem with the current implementation is:

there could potentially be >1 rpc servers (1 for secure, 1 for insecure).
each rpc server only sets its serve state based on defrag requests to its own server, unaware of each other. So one rpc server could still be serving traffic while the other requested defrag.

Based on these reasons, I think it is necessary to move the defrag notifiers further back to the backend.

siyuanfoundation · 2023-11-16T22:59:14Z

cc @chaochn47 @serathius

serathius · 2023-11-17T09:31:14Z

server/storage/backend/backend.go

+	if notifier == nil {
+		return
+	}


Why this check? It doesn't protect against nil pointer exceptions.

It will work for

b.SubscribeDefragNotifier(nil)

But not for

var notifier *healthNotifier b.SubscribeDefragNotifier(notifier)

Because nil != (*healthNotifier)(nil)

this is to protect the line

mc := adapter.MaintenanceServerToMaintenanceClient(v3rpc.NewMaintenanceServer(s, nil))

serathius · 2023-11-17T09:37:38Z

server/storage/backend/backend.go

@@ -459,6 +490,9 @@ func (b *backend) defrag() error {
 	// lock database after lock tx to avoid deadlock.
 	b.mu.Lock()
 	defer b.mu.Unlock()
+	// send notifications after acquiring the lock.
+	b.defragStarted()


Notifying listeners under lock is a simple and correct way to fix the race issue of two goroutines calling Defrag at once. However the current implementation comes with one flaw, it introduces an external call under a lock.

Backend just calls and blocks on executing the notifiers, but doesn't really know what they are doing. We should be really careful about introducing a blocking call here. Do you know if call .SetServingStatus(allGRPCServices, healthpb.HealthCheckResponse_SERVING) is blocking?

The SetServingStatus call is not blocking. It sends update to a channel, but it clears the channel first. And the lock is used only in simple cases.

// SetServingStatus is called when need to reset the serving status of a service // or insert a new service entry into the statusMap. func (s *Server) SetServingStatus(service string, servingStatus healthpb.HealthCheckResponse_ServingStatus) { s.mu.Lock() defer s.mu.Unlock() if s.shutdown { logger.Infof("health: status changing for %s to %v is ignored because health service is shutdown", service, servingStatus) return } s.setServingStatusLocked(service, servingStatus) } func (s *Server) setServingStatusLocked(service string, servingStatus healthpb.HealthCheckResponse_ServingStatus) { s.statusMap[service] = servingStatus for _, update := range s.updates[service] { // Clears previous updates, that are not sent to the client, from the channel. // This can happen if the client is not reading and the server gets flow control limited. select { case <-update: default: } // Puts the most recent update to the channel. update <- servingStatus } }

Do you think we should make the defragStarted call in a separate routine to remove such concerns?

Signed-off-by: Siyuan Zhang <[email protected]>

ahrtr · 2023-11-18T17:00:14Z

close/reopen to re-trigger all arm64 workflows

stale · 2024-03-17T12:56:28Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

tjungblu · 2024-04-30T10:44:11Z

/remove-lifecycle stale

k8s-ci-robot · 2024-08-05T22:26:39Z

@siyuanfoundation: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-etcd-unit-test-amd64	`50c9868`	link	true	`/test pull-etcd-unit-test-amd64`
pull-etcd-unit-test-arm64	`50c9868`	link	true	`/test pull-etcd-unit-test-arm64`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

siyuanfoundation mentioned this pull request Nov 16, 2023

etcdserver: make livez return ok when defrag is active. #16858

Open

serathius reviewed Nov 17, 2023

View reviewed changes

etcdserver: move rpc defrag notifier into backend.

50c9868

Signed-off-by: Siyuan Zhang <[email protected]>

siyuanfoundation force-pushed the defrag-rpc branch from 8529110 to 50c9868 Compare November 17, 2023 17:40

ahrtr closed this Nov 18, 2023

ahrtr reopened this Nov 18, 2023

stale bot added the stale label Mar 17, 2024

stale bot removed the stale label Apr 30, 2024

tjungblu mentioned this pull request Apr 30, 2024

Livez/Readyz #16007

Open

chaochn47 mentioned this pull request May 3, 2024

[3.5] gRPC health server sets serving status to NOT_SERVING on defrag #17914

Merged

k8s-ci-robot added the needs-rebase label Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: move rpc defrag notifier into backend. #16959

etcdserver: move rpc defrag notifier into backend. #16959

siyuanfoundation commented Nov 16, 2023

siyuanfoundation commented Nov 16, 2023

serathius Nov 17, 2023

siyuanfoundation Nov 17, 2023 •

edited

Loading

serathius Nov 17, 2023

siyuanfoundation Nov 17, 2023

ahrtr commented Nov 18, 2023

stale bot commented Mar 17, 2024

tjungblu commented Apr 30, 2024

k8s-ci-robot commented Aug 5, 2024

etcdserver: move rpc defrag notifier into backend. #16959

Are you sure you want to change the base?

etcdserver: move rpc defrag notifier into backend. #16959

Conversation

siyuanfoundation commented Nov 16, 2023

siyuanfoundation commented Nov 16, 2023

serathius Nov 17, 2023

Choose a reason for hiding this comment

siyuanfoundation Nov 17, 2023 • edited Loading

Choose a reason for hiding this comment

serathius Nov 17, 2023

Choose a reason for hiding this comment

siyuanfoundation Nov 17, 2023

Choose a reason for hiding this comment

ahrtr commented Nov 18, 2023

stale bot commented Mar 17, 2024

tjungblu commented Apr 30, 2024

k8s-ci-robot commented Aug 5, 2024

siyuanfoundation Nov 17, 2023 •

edited

Loading