kvserver: persistent outage when liveness leaseholder deadlocks #80713

tbg · 2022-04-28T15:55:13Z

Describe the problem

Quoting an internal support issue:

RCA: n1 experienced a disk stall. It held the lease for the liveness range r2, and this information was cached by all nodes. When n1 locked up, all liveness requests continued to get redirected to n1. Since all of these requests had a timeout on them, they returned from DistSender.Send. This did not invalidate the cached leaseholder entry, thus rendering this outage permanent. It is likely that the liveness range wasn't technically unavailable, if only nodes had figured out to talk to the new leaseholder.

To Reproduce

See above PR (note that second commit has a hacky fix so to get a repro, remove that)

Expected behavior

Failover as usual

Additional data / screenshots
Environment:

master at time of writing and all past versions presumably

Additional context

Persistent total cluster outage since all nodes failed heartbeats.
Resolved only when deadlocked node was brought down.

Jira issue: CRDB-15539

gz#13737

Epic CRDB-19227

gz#19526

tbg · 2022-05-04T15:44:24Z

What's interesting is that we do have a disk stall roachtest but it didn't catch it:

cockroach/pkg/cmd/roachtest/tests/disk_stall.go

Lines 1 to 139 in 21a286f

    
           // Copyright 2018 The Cockroach Authors. 
        
           // 
        
           // Use of this software is governed by the Business Source License 
        
           // included in the file licenses/BSL.txt. 
        
           // 
        
           // As of the Change Date specified in that file, in accordance with 
        
           // the Business Source License, use of this software will be governed 
        
           // by the Apache License, Version 2.0, included in the file 
        
           // licenses/APL.txt. 
        
           package tests 
        
           import ( 
        
           	"context" 
        
           	"fmt" 
        
           	"math/rand" 
        
           	"runtime" 
        
           	"strings" 
        
           	"time" 
        
           	"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster" 
        
           	"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/registry" 
        
           	"github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test" 
        
           	"github.com/cockroachdb/cockroach/pkg/roachprod/install" 
        
           	"github.com/cockroachdb/cockroach/pkg/util/timeutil" 
        
           ) 
        
           // RegisterDiskStalledDetection registers the disk stall test. 
        
           func RegisterDiskStalledDetection(r registry.Registry) { 
        
           	for _, affectsLogDir := range []bool{false, true} { 
        
           		for _, affectsDataDir := range []bool{false, true} { 
        
           			// Grab copies of the args because we'll pass them into a closure. 
        
           			// Everyone's favorite bug to write in Go. 
        
           			affectsLogDir := affectsLogDir 
        
           			affectsDataDir := affectsDataDir 
        
           			r.Add(registry.TestSpec{ 
        
           				Name: fmt.Sprintf( 
        
           					"disk-stalled/log=%t,data=%t", 
        
           					affectsLogDir, affectsDataDir, 
        
           				), 
        
           				Owner:   registry.OwnerStorage, 
        
           				Cluster: r.MakeClusterSpec(1), 
        
           				Run: func(ctx context.Context, t test.Test, c cluster.Cluster) { 
        
           					runDiskStalledDetection(ctx, t, c, affectsLogDir, affectsDataDir) 
        
           				}, 
        
           			}) 
        
           		} 
        
           	} 
        
           } 
        
           func runDiskStalledDetection( 
        
           	ctx context.Context, t test.Test, c cluster.Cluster, affectsLogDir bool, affectsDataDir bool, 
        
           ) { 
        
           	if c.IsLocal() && runtime.GOOS != "linux" { 
        
           		t.Fatalf("must run on linux os, found %s", runtime.GOOS) 
        
           	} 
        
           	n := c.Node(1) 
        
           	c.Put(ctx, t.Cockroach(), "./cockroach") 
        
           	c.Run(ctx, n, "sudo umount -f {store-dir}/faulty || true") 
        
           	c.Run(ctx, n, "mkdir -p {store-dir}/{real,faulty} || true") 
        
           	// Make sure the actual logs are downloaded as artifacts. 
        
           	c.Run(ctx, n, "rm -f logs && ln -s {store-dir}/real/logs logs || true") 
        
           	t.Status("setting up charybdefs") 
        
           	if err := c.Install(ctx, t.L(), n, "charybdefs"); err != nil { 
        
           		t.Fatal(err) 
        
           	} 
        
           	c.Run(ctx, n, "sudo charybdefs {store-dir}/faulty -oallow_other,modules=subdir,subdir={store-dir}/real") 
        
           	c.Run(ctx, n, "sudo mkdir -p {store-dir}/real/logs") 
        
           	c.Run(ctx, n, "sudo chmod -R 777 {store-dir}/{real,faulty}") 
        
           	errCh := make(chan install.RunResultDetails) 
        
           	// NB: charybdefs' delay nemesis introduces 50ms per syscall. It would 
        
           	// be nicer to introduce a longer delay, but this works. 
        
           	tooShortSync := 40 * time.Millisecond 
        
           	maxLogSync := time.Hour 
        
           	logDir := "real/logs" 
        
           	if affectsLogDir { 
        
           		logDir = "faulty/logs" 
        
           		maxLogSync = tooShortSync 
        
           	} 
        
           	maxDataSync := time.Hour 
        
           	dataDir := "real" 
        
           	if affectsDataDir { 
        
           		maxDataSync = tooShortSync 
        
           		dataDir = "faulty" 
        
           	} 
        
           	tStarted := timeutil.Now() 
        
           	dur := 10 * time.Minute 
        
           	if !affectsDataDir && !affectsLogDir { 
        
           		dur = 30 * time.Second 
        
           	} 
        
           	go func() { 
        
           		t.WorkerStatus("running server") 
        
           		result, err := c.RunWithDetailsSingleNode(ctx, t.L(), n, 
        
           			fmt.Sprintf("timeout --signal 9 %ds env "+ 
        
           				"COCKROACH_ENGINE_MAX_SYNC_DURATION_DEFAULT=%s "+ 
        
           				"COCKROACH_LOG_MAX_SYNC_DURATION=%s "+ 
        
           				"COCKROACH_AUTO_BALLAST=false "+ 
        
           				"./cockroach start-single-node --insecure --store {store-dir}/%s --log '{sinks: {stderr: {filter: INFO}}, file-defaults: {dir: \"{store-dir}/%s\"}}'", 
        
           				int(dur.Seconds()), maxDataSync, maxLogSync, dataDir, logDir, 
        
           			), 
        
           		) 
        
           		if err != nil { 
        
           			result.Err = err 
        
           		} 
        
           		errCh <- result 
        
           	}() 
        
           	time.Sleep(time.Duration(rand.Intn(5)) * time.Second) 
        
           	t.Status("blocking storage") 
        
           	c.Run(ctx, n, "charybdefs-nemesis --delay") 
        
           	result := <-errCh 
        
           	if result.Err == nil { 
        
           		t.Fatalf("expected an error: %s", result.Stdout) 
        
           	} 
        
           	// This test can also run in sanity check mode to make sure it doesn't fail 
        
           	// due to the aggressive env vars above. 
        
           	expectMsg := affectsDataDir || affectsLogDir 
        
           	if expectMsg != strings.Contains(result.Stderr, "disk stall detected") { 
        
           		t.Fatalf("unexpected output: %v", result.Err) 
        
           	} else if elapsed := timeutil.Since(tStarted); !expectMsg && elapsed < dur { 
        
           		t.Fatalf("no disk stall injected, but process terminated too early after %s (expected >= %s)", elapsed, dur) 
        
           	} 
        
           	c.Run(ctx, n, "charybdefs-nemesis --clear") 
        
           	c.Run(ctx, n, "sudo umount {store-dir}/faulty") 
        
           }

erikgrinaker · 2022-05-06T17:02:38Z

This seems closely related to #81100, so I'll pick this one up while I'm at it.

erikgrinaker · 2022-05-08T17:14:54Z

Took a first stab at this over in #81137, where DistSender will prioritize available nodes over unavailable ones (since the liveness leaseholder would become unavailable when the disk stalls). However, this has a bunch of issues -- not least that when the liveness leaseholder goes down then all nodes will be considered unavailable, and so it can still get stuck on the old leaseholder anyway if it's sorted first (e.g. by latency).

I think this issue points towards a more fundamental flaw in the whole model we use around lease acquisition, in that we rely on KV clients to detect failed leaseholders and contact new candidates to trigger lease acquisitions. If the clients fail to do either of these two tasks then leases won't move.

Some other random ideas:

Tag the contexts of internal operations like heartbeats, lease acquisition, and range descriptor lookup. If these contexts time out, invalidate the lease cache. However, it's easy to fail to tag relevant operations, and we'll have to avoid hitting the old leaseholder again on the next attempt.
Have the server-side RPC stack reject requests when the node fails liveness. However, this requires active cooperation from the server, and won't handle e.g. deadlocks or network outages (black holes).
Have the distsender track error statistics for replicas and prioritize more functional ones. However, in this case we'd have to include context cancellation errors, so if a bunch of clients disconnect it can severely disrupt performance.
Pick random replicas when the cached lease is considered invalid. However, this could cause poor tail latencies, since in many cases it would be more efficient to contact the previous leaseholder.

Thoughts or ideas?

joshimhoff · 2022-06-10T16:45:46Z

Looking for something to play with over breather week. Maybe it's this one? Perhaps can limit the scope by focusing on system ranges if needed.

In #81100 (comment), @erikgrinaker says:

Additionally, system ranges may not recover at all because the RPC caller may time out before the NLHE is returned, preventing anyone else from acquiring the lease (#80713).,

To me, this suggests a slightly different understanding of the outage than what @tbg said in this ticket:

It is likely that the liveness range wasn't technically unavailable, if only nodes had figured out to talk to the new leaseholder.

Which is right? Was there a new available leaseholder or was there not a new available leaseholder, since no kvclient poked a functional kvserver triggering it to acquire the lease, since the NLHE was never returned? Erik's theory makes concrete sense to me FWIW.

Some other random ideas:

I like all these ideas.

I have a weird related idea:

Make batch a streaming RPC.
Server returns "I'm leaseholder" then actually evaluates the batch and returns any results
If no "I'm leaseholder" in CRDB-eng-controlled timeout seconds (N=2s), then client invalidate cache + tries a different replica

Then no need for NLHE.

At some level, is the problem that we require the kvserver to send a NLHE (as the kvserver may be non-functional), not that we require a kvclient to send a message to kvserver in order to kick off a lease acquisition attempt?

I sort of wonder if the cost of sending back a "I'm a leaseholder" message on every RPC to kvserver is too high to stomach? But IDK; perf stuff like this is hard for me to judge. I thought I'd share the idea anyway.

erikgrinaker · 2022-06-10T17:48:19Z

Which is right? Was there a new available leaseholder or was there not a new available leaseholder, since no kvclient poked a functional kvserver triggering it to acquire the lease, since the NLHE was never returned? Erik's theory makes concrete sense to me FWIW.

Most likely there was no other leaseholder, since we've found that nodes will keep retrying the previous leaseholder until they get a response. Unless someone else happened to send a request to a different replica, for some reason. Can probably find out by digging through the debug.zip, don't remember the details of the original incident. The no-leaseholder scenario will need a solution in any case, which I suspect will indirectly solve the other-leaseholder scenario.

Some other random ideas:

I like all these ideas.

Meh. I don't think any of them are particularly good, but given the current state this is what I could come up with for short/medium-term fixes. What I'd ideally want here is for replicas to sort out the lease even in the absence of client requests, similarly to how vanilla Raft will transfer leadership even if there are no proposals.

I have a weird related idea:

Make batch a streaming RPC.

Server returns "I'm leaseholder" then actually evaluates the batch and returns any results

If no "I'm leaseholder" in CRDB-eng-controlled timeout seconds (N=2s), then client invalidate cache + tries a different replica

Then no need for NLHE.

We sort of have this channel already, in the RPC heartbeats. But it doesn't carry lease information currently.

At some level, is the problem that we require the kvserver to send a NLHE (as the kvserver may be non-functional), not that we require a kvclient to send a message to kvserver in order to kick off a lease acquisition attempt?

I think both. Ideally, we'd want replicas to sort out the lease between themselves, and for clients to efficiently discover lease changes. These concerns are currently conflated, but they're really separate concerns.

I sort of wonder if the cost of sending back a "I'm a leaseholder" message on every RPC to kvserver is too high to stomach? But IDK; perf stuff like this is hard for me to judge. I thought I'd share the idea anyway.

Been a while since I looked at RPC heartbeats, but piggybacking on them might be viable. I think doing it per-batch would possibly be too expensive, and it's unclear whether it'd be better than an aggregate signal. We could e.g. send a sparse bitmap of ranges that we're actively asserting leases for, or something. This would have to integrate with quiescence somehow.

joshimhoff · 2022-06-10T18:03:32Z

All makes sense! Good point re: aggregate signal maybe being just as good.

tbg · 2022-09-23T06:46:41Z

Most likely there was no other leaseholder, since we've found that nodes will keep retrying the previous leaseholder until they get a response. Unless someone else happened to send a request to a different replica, for some reason. Can probably find out by digging through the debug.zip, don't remember the details of the original incident. The no-leaseholder scenario will need a solution in any case, which I suspect will indirectly solve the other-leaseholder scenario.

There may not have been here, but it looks like https://github.com/cockroachlabs/support/issues/1808 is a possible other instance of the same problem. Here, it looks as though only n5 didn't manage to update its cache; other nodes seem to be reaching liveness OK.

But as Erik pointed out, solving one solves the other - the issue is that we can't rely on "proactive" invalidation of a cached leaseholder.

daniel-crlabs · 2023-11-29T18:22:20Z

A customer ran into this again recently: https://cockroachdb.zendesk.com/agent/tickets/19526

blathers-crl bot added the T-kv KV Team label Apr 28, 2022

erikgrinaker self-assigned this May 6, 2022

erikgrinaker mentioned this issue May 8, 2022

kvcoord: take node availability into account in DistSender #81137

Closed

erikgrinaker mentioned this issue May 19, 2022

kvserver: disk stall prevents lease transfer #81100

Closed

erikgrinaker removed their assignment May 19, 2022

jlinder added sync-me and removed sync-me labels May 20, 2022

erikgrinaker mentioned this issue Oct 6, 2022

kvserver: use shorter lease expiration for liveness range #88443

Open

erikgrinaker mentioned this issue Mar 11, 2023

kvserver: lease maintenance scheduler #98433

Closed

erikgrinaker mentioned this issue Feb 8, 2024

kvcoord: add DistSender circuit breakers #118943

Merged

craig bot closed this as completed in bf013ea Mar 4, 2024

This was referenced Apr 24, 2024

kv: disable DistSender circuit breakers by default #122983

Closed

kvcoord: introduce DistSender circuit breakers for just the liveness range #123117

Open

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: persistent outage when liveness leaseholder deadlocks #80713

kvserver: persistent outage when liveness leaseholder deadlocks #80713

tbg commented Apr 28, 2022 •

edited by RoachietheSupportRoach

Loading

tbg commented May 4, 2022

erikgrinaker commented May 6, 2022

erikgrinaker commented May 8, 2022

joshimhoff commented Jun 10, 2022 •

edited

Loading

erikgrinaker commented Jun 10, 2022 •

edited

Loading

joshimhoff commented Jun 10, 2022

tbg commented Sep 23, 2022

daniel-crlabs commented Nov 29, 2023

kvserver: persistent outage when liveness leaseholder deadlocks #80713

kvserver: persistent outage when liveness leaseholder deadlocks #80713

Comments

tbg commented Apr 28, 2022 • edited by RoachietheSupportRoach Loading

tbg commented May 4, 2022

erikgrinaker commented May 6, 2022

erikgrinaker commented May 8, 2022

joshimhoff commented Jun 10, 2022 • edited Loading

erikgrinaker commented Jun 10, 2022 • edited Loading

joshimhoff commented Jun 10, 2022

tbg commented Sep 23, 2022

daniel-crlabs commented Nov 29, 2023

tbg commented Apr 28, 2022 •

edited by RoachietheSupportRoach

Loading

joshimhoff commented Jun 10, 2022 •

edited

Loading

erikgrinaker commented Jun 10, 2022 •

edited

Loading