-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore2TB/nodes=10 failed [stuck in kv] #61396
Comments
huh, this looks suspect: Looks like it likely got stuck waiting on stuck KV requests? |
huh, lots and lots of
|
|
Just at first glance, I wonder if this is related to #61541. |
It does look similar, though not the same. I have some of the same questions, like how did we get DistSender RPCs to spend hundreds of seconds in a single attempt without also having a slow replication attempt.
It worries me that #61541 and this issue are popping up now, as it's likely that we have regressed in some way. |
Capturing what I saw so far. |
[The test history suggests that the nightly test times also vary wildly](https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&testNameId=741854490244253129&tab=testDetails&branch_Cockroach_Nightlies=%3Cdefault%3E I spent a lot of time a few years ago trying to get the numbers of snapshots down. For a good (if a bit outdated) read on why avoiding snapshots is unusually complex, see here (internal link). |
Internal link to debug artifacts for a slow run (not the failing one here): https://drive.google.com/file/d/1Zm4l6TEk6IiEWpCx4vGG9qQigerXksnD/view?usp=sharing |
In a group session with @aliher1911 @erikgrinaker @angelapwen we looked at logs of a recent slow restore2TB run and found evidence of ranges being slow to apply the split trigger. This then leads to the right hand sides of the splits (on the slow nodes) to ask for a snapshot (indicated via the message "would have dropped incoming MsgApp to wait for split trigger, but allowing due to ..."). That snapshot is then blocked by the slow pre-split range on the recipient follower (since it's in all likelihood not gc'able, just slow/stuck for whatever reason). @aliher1911 will try to get a repro of this with |
I'm taking a look at this as well to be able to team up with @aliher1911 a bit better, working off my repro branch #63784. In the first attempt, I made this a fatal error cockroach/pkg/kv/kvserver/split_trigger_helper.go Lines 146 to 151 in 6a7aa6a
in the assumption that once this fires, something worth investigating has already gone wrong. I also added some extra logging around when each replica applies the split. A node crashed with a minute or so of starting the restore. Looking through, I saw
this does seem a reasonable reduction in the problem that we can iterate on quickly. Replicas falling behind like that on their command application is an issue and we ought to figure out why. I'll keep poking. |
At the same time, we're not seeing this one fire cockroach/pkg/kv/kvserver/store_raft.go Lines 531 to 535 in 3aea858
so no individual ready handling (which includes appending to the log as well as applying entries to the state machine and any sync required to do so) takes more than 500s on any of the nodes. I just did a second repro and the history was very similar. Hard to tell where to go from here. I'll add some more logging about the progress during splits (to see how many indexes behind the followers are when we split). Note that the splits here weren't delayed. I think the splits just happen to exacerbate the problem too, they're not a vital ingredient. They just cause snapshots which is why we're here looking at it. I suspect followers fall back quite a bit just on their own under restore2TB and I'd like to understand why. |
In another run where n9 hit the fatal error (i.e. was not catching up on its splits), I looked into whether n9 is generally unable to apply splits; doesn't seem like it:
From when the split is initiated to when it is applied (per replica), time does routinely pass, and it's not limited to any single node.
In another repro, I printed the It didn't show anything obvious. The groups are generally very well caught up.
|
Ok, this new logging is helpful:
Each line logged is the follower dropping communication from r709's leader because it's still expecting to be initialized locally via split triggers. The interesting bit is that it logs the local replica that it hopes will initialize it. Each generation (
The probing happens in 1s intervals, so we see here that we're executing between one and two split triggers per second (on the chain that will ultimately create r709). Ok, I think I am starting to understand what's happening here. I looked into r329 which is the earliest logged candidate for initializing the final replica for r709/4. I think what's happening is the following. for 2 :=1; i <= 100; i++ {
split(i) // make range i out of range i-1
scatter(range i)
} Every now and then, it will naturally happen that a follower applies the split after the right hand side leader is already trying to append, let's say this happens on the first split. This append to r2 is suppressed ("dropping MsgApp at index 10") to avoid the leader of r2 thinking a snapshot is necessary. However, a side effect is that the leader will append only every second going forward (as it's probing the follower). The follower quickly catches up on the split, but now there's still most of the second remaining until the leader tries again. In the meantime, restore is busy doing split/scatter/split/scatter/...; let's ignore the scatter for now and focus on the splits. For each subsequent split, since r2 still has most of the second to go, the next split is highly likely to have the same problem (the leader makes its RHS, reaches out to the follower, but the follower r3, r4, etc is not there yet because we're still waiting for r2 to catch up into r2, which then has to sit out another good fraction of a second, etc). It becomes the cascade where once you hit this common race, every subsequent rapid split eats approximately a 500ms delay (consistent with seeing about a generation or two pass per second in the logging above). Mixing in the scatter helps and hurts: it does backpressure the rate of splits (since scatter takes time). But! It also removes replicas, and followers take a moment to realize they've been removed, during which time the split-catch-up-chain stalls. In practice, they realize pretty quickly so I'm not actually sure that the scatter is a real negative here. I will try without scatter to see if the problem persists. |
Yeah, the problem also occurs with a version of SCATTER that only ever randomizes leases, and it sure didn't take any longer (but to be fair it's pretty much instant all the time anyway). Curiously in that last run I also saw the pending snapshots shoot up into the hundreds really early in. And I do wonder why that happened, none of the "initiating a split" messages indicated any reason for that. Will add some more logging around that to potentially track it down. |
I looked into the raft snapshot queue buildup I saw in one repro and I think it's just the known problem - while we're 100 splits backed up in a chain, sometimes a root replica gets GC'ed, at which point the remainder of the chain will rightfully ask for snapshots. So this issue is fallout from the chain buildup. |
Logs for a run including this diff (some boring plumbing diffs omitted) look very promising. The cluster isn't hitting the fatal error from timed out snapshot suppression, and the health alerts have cleared out quickly. There was an "underreplicated range" for a while but I verified that this was just the known problem of a removed replica thinking of itself as the range counter (which we should fix anyway, and there might be a question on why the replica stuck around for a minute; probably was quiesced). @@ -397,8 +400,11 @@ func (r *Replica) adminSplitWithDescriptor(
log.Infof(ctx, "initiating a split of this range at key %s [r%d] (%s)%s",
splitKey.StringWithDirs(nil /* valDirs */, 50 /* maxLen */), rightRangeID, reason, extra)
+ var rightDesc *roachpb.RangeDescriptor
if err := r.store.DB().Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
- return splitTxnAttempt(ctx, r.store, txn, rightRangeID, splitKey, args.ExpirationTime, desc)
+ var err error
+ _, rightDesc, err = splitTxnAttempt(ctx, r.store, txn, rightRangeID, splitKey, args.ExpirationTime, desc)
+ return err
}); err != nil {
// The ConditionFailedError can occur because the descriptors acting
// as expected values in the CPuts used to update the left or right
@@ -412,6 +418,12 @@ func (r *Replica) adminSplitWithDescriptor(
}
return reply, errors.Wrapf(err, "split at key %s failed", splitKey)
}
+ if err := waitForReplicasInit(
+ ctx, r.store.cfg.NodeDialer, rightDesc.RangeID, rightDesc.Replicas().AsProto(),
+ ); err != nil {
+ log.Infof(ctx, "failed to wait for right hand side to initialize after split: %s", err)
+ _ = err // handled
+ }
return reply, nil
} |
The spike in log commit latency correlates with a slew of these for various ranges and files
This message comes from here: cockroach/pkg/storage/pebble.go Lines 595 to 596 in 0c1c6d9
@cockroachdb/storage this is my first time trying to dig into disk slowness, what are the metrics that would have exposed this separately? I.e. is there a metric that could have "proven" that the slow raft ready handling was very likely to have been caused by something that is also slow at the |
I believe there are 2 metrics |
I'm running 10 instances of the test now and it looks as though major lock-ups during restore are common. Here's another one: This then causes the node to lose liveness. The
@sumeerbhola / @dt do we want to upgrade this test to fail in such a scenario? |
The ten test runs passed: restore2TB/nodes=10 (2370.18s) This is the total duration of the roachtests, so not the job. The longest import took just north of 1h, the shortest 40 minutes. This is in contrast to the current test (without my provisional fix) which has the same best-case, but will often take multiple hours: |
Interestingly in my testing I got down to the same test run times and no snapshot rejections by bumping maxDelaySplitTriggerTicks to 1000 instead. That allowed splits to propagate even for cases where we try to change raft group memberships "mid-split". |
This isn't entirely surprising. If the restore bursts a given number N of splits, then the natural delay for the entire sequence of splits to fully apply is N/2 seconds. If |
(roachtest).restore2TB/nodes=10 failed on master@6de4313ec216161c79fe725fcc31fc87ef1804ea:
More
Artifacts: /restore2TB/nodes=10
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: