-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: proposal quota leak #39022
Comments
This did not repro in 20 attempts, I'll try changing the scale and running a more targeted import. |
…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. Release note: None
…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Release note: None
39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten Fixes #39018. Fixes #37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Co-authored-by: Nathan VanBenschoten <[email protected]>
|
The range in question is r1514. n2 is the leader and leaseholder, but the range is quiescent, and I verified via logspy that there aren't any moving parts. The range status shows that the "approximate proposal quota" is zero, and that there are no outstanding proposals, and that there is one write latch. That all corroborates that one or more commands were processed successfully, but never returned their quota to the pool, so that we know have an idle replica that will trap any command evaluation in the proposal quota acquisition path. |
What's maybe interesting is that the range has a lease applied index of four, which means it was created right before it got stuck. The raft log shows that n2 was born the leader of the range via the split, ingested one sst, carried out another split. There are no other commands. The split itself wouldn't consume all of the quota, so either we "set the quota to zero" through some bug instead of releasing back to 1mb, or the SST leaked quota and miraculously the split commands still went through. |
^- ah no, I forgot that we acquire quota in loops, so no miracle is necessary. We just leaked a small amount in any of the commands that went through. My money is on the SST simply because that's a weird command and also it's the very first one the range is processing. Or perhaps this code cockroach/pkg/storage/replica_proposal_quota.go Lines 74 to 76 in 3315afa
failed to release the whole queue before the replica went silent. Quota is only released during Ready handling, so there's this sort of delicate dance where we rely on the fact that receiving an MsgAppResp from a follower triggers a Ready handling cycle. But if this were broken, I'd expect to see it more frequently. |
Hmm, shame that we don't expose the proposal quota base index or the release queue via the range info. I will fix that. |
I tried to attach to the process via dlv to
The binary has debug symbols:
so I don't know why this is happening. My next steps after that would have been
followed by a speedy deattach. The memory address comes from
I tried this locally on OSX before, but it also didn't work, though I got further:
Maybe this is just broken on Darwin. Either way, this seems like a really powerful tool to have under our belt, I'd love to find out what went wrong here. There's an issue on a panic that looks similar and it's not the same, but it's closed and I ran off master: go-delve/delve#1572. I also ran the latest release and got same error. Previous releases I can't even build (import path changed). |
Oh nevermind this works locally. I just had an old dlv version plus was using an invalid addr (it's hard to get the addr from dlv itself, but it's easy from a goroutine dump.. probably dlv has this somewhere too)
Ok, maybe I can figure out why this doesn't work on that linux box? |
Ha, look at this:
This sort of confirms what I was suspecting, which is that we failed to release quota. What I don't understand is the base index -- the last index of this range is 15; the base index shouldn't ever be able to exceed the last index. Then again, looking at this code
doesn't give too much confidence that this invariant is upheld. I think what's happening is that for some reason, we're adding to the base index too much, so followers won't ever be able to catch up (after all, that base index is way ahead of the log!) |
The base index is only incremented whenever an element is removed from release queue. But then, for it to race ahead of last index by 12 entries, we'd also have had to enqueue lots of extra entries to be released. But say we enqueue 100 completely bogus entries in the queue, those wouldn't just be released because the highest index we even consider is CommitIndex (which is at most 15), starting out with a base of LastIndex, let's assume that were erroneously taken as 0, that still means this range could only ever have released 15 entries, not 26 or even anywhere close to 100. All that is to say, this mechanism seems janky and hard to understand, but I don't understand how it can go off the rails like that. PS storage unit tests passed with this diff, but that doesn't mean much since coverage isn't great diff --git a/pkg/storage/replica_proposal_quota.go b/pkg/storage/replica_proposal_quota.go
index 0f06b01226..1ffd66aeda 100644
--- a/pkg/storage/replica_proposal_quota.go
+++ b/pkg/storage/replica_proposal_quota.go
@@ -236,6 +236,9 @@ func (r *Replica) updateProposalQuotaRaftMuLocked(
sum += rel
}
r.mu.proposalQuotaBaseIndex += numReleases
+ if r.mu.lastIndex < r.mu.proposalQuotaBaseIndex {
+ panic("uhm what")
+ }
r.mu.quotaReleaseQueue = r.mu.quotaReleaseQueue[numReleases:]
r.mu.proposalQuota.add(sum) |
I know that this code moved around recently: cockroach/pkg/storage/replica_application.go Lines 209 to 214 in aada4fc
I wonder if maybe this just exposed an earlier bug. Potentially we were double-freeing quota enough to always hide the bug? I give up for now. Hopefully either @ajwerner or @nvanbenschoten can look at this with fresh eyes. Poking at the internals via dlv seems like a super nice tool to have under our belt. |
This exposes the information on `_status/range/N`. I've looked at a few ranges locally and I'm already noticing that weirdly often the last index will be significantly ahead of the quota release base index (though the release queue is always empty). This might be something that only happens on single-node instances (the release mechanism is based on Ready handling, which is different in single instances because nothing ever needs to get acked externally), but it's unexpected to me. Motivated by cockroachdb#39022, where we see a proposal quota leak: quota is queued up for release, but the base index is way ahead of the last index, so it will never be released. Release note: None
Filed a bug against delve about the crash: go-delve/delve#1636 The version on n2 is patched to work. Just run |
I ran another 10 imports last night with logging of each acquisition and they all succeeded. Given that we’ve seen this at least 3 times in not all that many runs and I haven’t seen it in 20 with logging leads me to speculate that the logging interferes with the bug. |
39111: storage: expose quota release queue and base index r=ajwerner a=tbg This exposes the information on `_status/range/N`. I've looked at a few ranges locally and I'm already noticing that weirdly often the last index will be significantly ahead of the quota release base index (though the release queue is always empty). This might be something that only happens on single-node instances (the release mechanism is based on Ready handling, which is different in single instances because nothing ever needs to get acked externally), but it's unexpected to me. Motivated by #39022, where we see a proposal quota leak: quota is queued up for release, but the base index is way ahead of the last index, so it will never be released. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Were you running that very same SHA?
|
I'd try a few more runs with assertions instead of logging (kind of hard to believe that the logging actually makes a difference, but maybe?). The invariants are that the base index + length quota release queue shouldn't exceed last index, and of course that we never bump base index past last index. |
No, we've seen this for a week so I doubted it mattered. I've check out a branch and added the assertions you suggested and will spin up some more runs soon. |
I've got some good assertion failures: This is built off of 26edea5 with the following patch run with
|
Here's another assertion failure. It's uncanny how similar these are. They both feature a new leader election leading to a reproposal and a split. They differ in whether the crashing node starts or ends as the raft leader. Now on to interpreting them. |
Here's more another set of logs and corresponding raft logs from the crashing node. This was built off of 9078c4e with added logging at 63722c8 |
I think I've got it. As @tbg noted above, we didn't totally lose the quota we were looking for, we just left it in the
cockroach/pkg/storage/replica.go Lines 225 to 226 in cfdaadc
I suspect usually this tail of quota we accumulate at the beginning of a Replica's leadership term is okay because 1) we rarely have a command that needs more quota than we've leaked (except these I'm not sure what tickled the range to fix the thing. It seems like it was related probably related to the dropped messages. I suspect there was some sort of networking hiccup that lead to a raft election. The remaining question is why are we seeing it now? |
Maybe because we've gotten less aggressive about re-freeing? Also, can this explain the original failure? We saw a base index of 23 and a last index of 15. Let's be generous and say that at initialization we had a last index of 15 already (impossible) and so all subsequent commands would let the base race ahead. There are only 5 commands in the log (it starts with first index 11), so that's at most 20? How did we get to 23? |
Doesn't the last command in the log truncate?
It seems possible that we started below 11. |
No, our ranges always start at 11 unfortunately. You're seeing the split
trigger setting that up for the new range 1515
…On Sat, Jul 27, 2019, 19:42 ajwerner ***@***.***> wrote:
There are only 5 commands in the log (it starts with first index 11), so
that's at most 20? How did we get to 23?
Doesn't the last command in the log truncate?
Put: 0.000000000,0 /Local/RangeID/1515/u/RaftTruncatedState (0x0169f705eb757266747400): index:10 term:5
It seems possible that we started below 11.
https://gist.github.com/tbg/4e0dc388c653f81920d7bab14c190f5e#file-r1514%20raft%20log
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#39022?email_source=notifications&email_token=ABGXPZESAHR5SMTCY27ZPHDQBSCHBA5CNFSM4IF2CHKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26PWLY#issuecomment-515701551>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABGXPZDZSSBO4TZIVVT5CV3QBSCHBANCNFSM4IF2CHKA>
.
|
I'm reaching for straws here. What if the last index when n2 became leader was higher, say 18, but then another term change occurred which didn't change the leader but truncated the end of the log to something shorter? I've successfully run the import 40 times with the change in #39135 and the fatal hasn't happened. |
I need to spend some time catching up on the symptoms we saw here and the interactions around the quotaPool before being able to confidently review #39135. That said, I think I stumbled upon a possible culprit that we might be able to implicate in the surfacing of this stall. faff108 removed a secondary structure that tracked quota sizes for in-flight proposals. This We'll need to think about whether this is an actual issue or whether it's simply helping to trigger the problem @ajwerner found in #39135. If we assume that only proposals performed after the term change and the quota pool initialization in #39135 avoids these issues by initializing the That all would imply that Since @irfansharif is back (👋), this would be a great issue for him to weigh in on! |
Should we reset the Would it make sense to try to have the
True. The API doesn't really permit adding from one pool to another. |
Digging a bit more I'm not sure we can maintain this invariant but it'd be nice to have an invariant that we can maintain about the relationship between the |
Talked to @nvanbenschoten offline. We're going to move forward with the idea of adding zero-value quota to the This should allow us to add an invariant that when a range quiesces its |
The main place where this happens is when we run into the |
This commit changes the initialization of `proposalQuotaBaseIndex` from `lastIndex` which may include entries which are not yet committed to `status.Applied`, the highest applied index. Given the `proposalQuotaBaseIndex` should account for all committed proposals whose quota has been released and proposals add their quota to the release queue after they have been committed, it's important that the that base index not be too high lest we leave quota in the queue. This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the length of the queue is exactly equal to the applied index. In order to maintain this invariant, the commit ensures that we enqueue a zero-value release to the release queue for empty commands and commands proposed on another node. See cockroachdb#39022 (comment) for more details. Fixes cockroachdb#39022. Release note (bug fix): Properly initialize proposal quota tracking to prevent quota leak which can hang imports or other AddSSTable operations.
This commit changes the initialization of `proposalQuotaBaseIndex` from `lastIndex` which may include entries which are not yet committed to `status.Applied`, the highest applied index. Given the `proposalQuotaBaseIndex` should account for all committed proposals whose quota has been released and proposals add their quota to the release queue after they have been committed, it's important that the that base index not be too high lest we leave quota in the queue. This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the length of the queue is exactly equal to the applied index. In order to maintain this invariant, the commit ensures that we enqueue a zero-value release to the release queue for empty commands and commands proposed on another node. See cockroachdb#39022 (comment) for more details. Fixes cockroachdb#39022. Release note (bug fix): Properly initialize proposal quota tracking to prevent quota leak which can hang imports or other AddSSTable operations.
39135: storage: initialize the proposalQuotaBaseIndex from Applied r=nvanbenschoten a=ajwerner This commit changes the initialization of `proposalQuotaBaseIndex` from `lastIndex` which may include entries which are not yet committed to `status.Commit`, the highest committed index. Given the `proposalQuotaBaseIndex` should account for all committed proposals whose quota has been released and proposals add their quota to the release queue after they have been committed, it's important that the that base index not be too high lest we leave quota in the queue. This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the length of the queue does not exceed the current committed index. See #39022 (comment) for more details. This change did not hit this assertion in 10 runs of an import of TPC-C whereas without it, the assertion was hit roughly 30% of the time. Fixes #39022. Release note (bug fix): Properly initialize proposal quota tracking to prevent quota leak which can hang imports or other AddSSTable operations. Co-authored-by: Andrew Werner <[email protected]>
This commit changes the initialization of `proposalQuotaBaseIndex` from `lastIndex` which may include entries which are not yet committed to `status.Applied`, the highest applied index. Given the `proposalQuotaBaseIndex` should account for all committed proposals whose quota has been released and proposals add their quota to the release queue after they have been committed, it's important that the that base index not be too high lest we leave quota in the queue. This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the length of the queue is exactly equal to the applied index. In order to maintain this invariant, the commit ensures that we enqueue a zero-value release to the release queue for empty commands and commands proposed on another node. See #39022 (comment) for more details. Fixes #39022. Release note (bug fix): Properly initialize proposal quota tracking to prevent quota leak which can hang imports or other AddSSTable operations.
Describe the problem
We have a roachtest hung on import seemingly due to an inability to acquire quota. This build is at 1ad0ecc.
In the logs we see:
That 1MB number is the entirety of the proposal quota. Somehow we've lost some small amount. Likely this is due to either #38568 or #38343.
The text was updated successfully, but these errors were encountered: