storage: proposal quota leak #39022

ajwerner · 2019-07-22T16:22:04Z

Describe the problem

We have a roachtest hung on import seemingly due to an inability to acquire quota. This build is at 1ad0ecc.

In the logs we see:

ip-10-12-35-136> W190722 08:52:37.179946 146966 storage/quota_pool.go:108  [n3,s3,r4402/2:/Table/54/1/226{5/724…-6/763…}] have been waiting 1m0.000109892s attempting to acquire 1.0 MiB of proposal quota; acquired 1023 KiB so far

That 1MB number is the entirety of the proposal quota. Somehow we've lost some small amount. Likely this is due to either #38568 or #38343.

The text was updated successfully, but these errors were encountered:

ajwerner · 2019-07-23T14:43:14Z

This did not repro in 20 attempts, I'll try changing the scale and running a more targeted import.

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. Release note: None

…ation Fixes cockroachdb#39018. Fixes cockroachdb#37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious of symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause cockroachdb#39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in cockroachdb#38954. I'm working on that, but I don't think it should hold up the next alpha (cockroachdb#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Release note: None

39064: storage: prevent command reproposal with new lease index after application r=nvanbenschoten a=nvanbenschoten Fixes #39018. Fixes #37810. May fix other tests. This commit fixes a bug introduced in e4ce717 which allowed a single Raft proposal to be applied multiple times at multiple applied indexes. The bug was possible if a raft proposal was reproposed twice, once with the same max lease index and once with a new max lease index. Because there are two entries for the same proposal, one necessarily has to have an invalid max lease applied index (see the invariant in https://github.com/cockroachdb/cockroach/blob/5cbc4b50712546465de75dba69a1c0cdd1fe2f87/pkg/storage/replica_raft.go#L1430) If these two entries happened to land in the same application batch on the leaseholder then the first entry would be rejected and the second would apply. The replicas LeaseAppliedIndex would then be bumped all the way to the max lease index of the entry that applied. The bug occurred when the first entry (who must have hit a proposalIllegalLeaseIndex), called into tryReproposeWithNewLeaseIndex. The ProposalData's MaxLeaseIndex would be equal to the Replica's LeaseAppliedIndex because it would match the index in the successful entry. We would then repropose the proposal with a larger lease applied index. This new entry could then apply and result in duplicate entry application. Luckily, rangefeed's intent reference counting was sensitive enough that it caught this duplicate entry application and panicked loudly. Other tests might also be failing because of it but might not have as obvious symptoms when they hit the bug. In addition to this primary bug fix, this commit has a secondary effect of fixing an issue where two entries for the same command could be in the same batch and only one would be linked to its ProposalData struct and be considered locally proposed (see the change in retrieveLocalProposals). I believe that this would prevent the command from being properly acknowledged if the first entry was rejected due to its max lease index and the second entry had a larger max lease index and applied. I think the first entry would have eventually hit the check in tryReproposeWithNewLeaseIndex and observed that the linked ProposalData had a new MaxLeaseIndex so it would have added it back to the proposal map, but then it would have had to wait for refreshProposalsLocked to refresh the proposal, at which point this refresh would have hit a lease index error and would be reproposed at a higher index. Not only could this cause duplicate versions of the same command to apply (described above), but I think this could even loop forever without acknowledging the client. It seems like there should be a way for this to cause #39022, but it doesn't exactly line up. Again, these kinds of cases will be easier to test once we properly mock out these interfaces in #38954. I'm working on that, but I don't think it should hold up the next alpha (#39036). However, this commit does address a TODO to properly handle errors during tryReproposeWithNewLeaseIndex reproposals and that is properly tested. My debugging process to track this down was to repeatedly run a set of 10 `cdc/ledger/rangefeed=true` roachtests after reducing its duration down to 5m. Usually, at least one of these tests would hit the `negative refcount` assertion. I then incrementally added more and more logging around entry application until I painted a full picture of which logical ops were being consumed by the rangefeed processor and why the same raft command was being applied twice (once it became clear that one was). After a few more rounds of fine-tuning the logging, the interaction with reproposals with a new max lease index became clear. Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2019-07-26T01:23:41Z

nathan-release-v1920-alpha20190729 is currently stuck in this state. We can see in the goroutines on node 2 that two requests have been stuck for two hours on range 1514 acquiring quota from the quota pool.

tbg · 2019-07-26T09:14:38Z

The range in question is r1514. n2 is the leader and leaseholder, but the range is quiescent, and I verified via logspy that there aren't any moving parts.

The range status shows that the "approximate proposal quota" is zero, and that there are no outstanding proposals, and that there is one write latch.

That all corroborates that one or more commands were processed successfully, but never returned their quota to the pool, so that we know have an idle replica that will trap any command evaluation in the proposal quota acquisition path.

tbg · 2019-07-26T09:21:28Z

What's maybe interesting is that the range has a lease applied index of four, which means it was created right before it got stuck. The raft log shows that n2 was born the leader of the range via the split, ingested one sst, carried out another split. There are no other commands. The split itself wouldn't consume all of the quota, so either we "set the quota to zero" through some bug instead of releasing back to 1mb, or the SST leaked quota and miraculously the split commands still went through.

tbg · 2019-07-26T09:37:24Z

^- ah no, I forgot that we acquire quota in loops, so no miracle is necessary. We just leaked a small amount in any of the commands that went through. My money is on the SST simply because that's a weird command and also it's the very first one the range is processing.

Or perhaps this code

cockroach/pkg/storage/replica_proposal_quota.go

Lines 74 to 76 in 3315afa

    
           func (r *Replica) updateProposalQuotaRaftMuLocked( 
        
           	ctx context.Context, lastLeaderID roachpb.ReplicaID, 
        
           ) {

failed to release the whole queue before the replica went silent. Quota is only released during Ready handling, so there's this sort of delicate dance where we rely on the fact that receiving an MsgAppResp from a follower triggers a Ready handling cycle. But if this were broken, I'd expect to see it more frequently.

tbg · 2019-07-26T09:37:58Z

Hmm, shame that we don't expose the proposal quota base index or the release queue via the range info. I will fix that.

tbg · 2019-07-26T10:17:02Z

I tried to attach to the process via dlv to print the quota release queue, but alas:

$ dlv attach 24490
exipanic: interface conversion: interface {} is nil, not dwarf.Offset

goroutine 12 [running]:
github.com/go-delve/delve/pkg/proc.(*BinaryInfo).loadDebugInfoMapsCompileUnit(0xc000096c00, 0xc007ff6000, 0xc0000fe630, 0xc0000ac480, 0xc012da57a0)
	/home/ubuntu/go/src/github.com/go-delve/delve/pkg/proc/types.go:414 +0x1861
github.com/go-delve/delve/pkg/proc.(*BinaryInfo).loadDebugInfoMaps(0xc000096c00, 0xc0000fe630, 0xc0069fc000, 0xcb4b4b, 0xcb4b4b, 0xc0000b6cb0, 0x0)
	/home/ubuntu/go/src/github.com/go-delve/delve/pkg/proc/types.go:267 +0x591
created by github.com/go-delve/delve/pkg/proc.loadBinaryInfoElf
	/home/ubuntu/go/src/github.com/go-delve/delve/pkg/proc/bininfo.go:904 +0x3d5

The binary has debug symbols:

$ file cockroach
cockroach: ELF 64-bit LSB executable, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/l, for GNU/Linux 2.6.32, not stripped

so I don't know why this is happening.

My next steps after that would have been

print (*storage.Replica)(0xc01fed5800).mu.quotaReleaseQueue
print (*storage.Replica)(0xc01fed5800).mu.proposalQuotaBaseIndex

followed by a speedy deattach. The memory address comes from

github.com/cockroachdb/cockroach/pkg/storage.(*Replica).maybeAcquireProposalQuota(0xc01fed5800, 0x4513800, 0xc00c25a240, 0x54c3f3, 0xc018592680, 0x0)

I tried this locally on OSX before, but it also didn't work, though I got further:

(dlv) print (*storage.Store)(0x53f1624).Ident.StoreID
(unreadable protocol error E08 during memory read for packet $mc085481824448b5c,4)

Maybe this is just broken on Darwin.

Either way, this seems like a really powerful tool to have under our belt, I'd love to find out what went wrong here. There's an issue on a panic that looks similar and it's not the same, but it's closed and I ran off master: go-delve/delve#1572. I also ran the latest release and got same error. Previous releases I can't even build (import path changed).

tbg · 2019-07-26T10:25:24Z

Oh nevermind this works locally. I just had an old dlv version plus was using an invalid addr (it's hard to get the addr from dlv itself, but it's easy from a goroutine dump.. probably dlv has this somewhere too)

(dlv) print (*storage.Store)(0xc000f9a000).Ident
*github.com/cockroachdb/cockroach/pkg/roachpb.StoreIdent {
	ClusterID: github.com/cockroachdb/cockroach/pkg/util/uuid.UUID [125,186,186,15,182,253,71,134,149,26,95,149,208,149,198,176],
	NodeID: 1,
	StoreID: 1,}

Ok, maybe I can figure out why this doesn't work on that linux box?

tbg · 2019-07-26T10:37:17Z

Ha, look at this:

$ dlv attach 24490
Type 'help' for list of commands.
(dlv) print (*storage.Replica)(0xc01fed5800).mu.quotaReleaseQueue
[]int64 len: 1, cap: 1, [977]
(dlv) print (*storage.Replica)(0xc01fed5800).mu.proposalQuotaBaseIndex
23
(dlv) print (*storage.Replica)(0xc01fed5800).RangeID
1514
(dlv) exit
Would you like to kill the process? [Y/n] n

This sort of confirms what I was suspecting, which is that we failed to release quota. What I don't understand is the base index -- the last index of this range is 15; the base index shouldn't ever be able to exceed the last index. Then again, looking at this code

cockroach/pkg/storage/replica_proposal_quota.go

Line 238 in 3315afa

r.mu.proposalQuotaBaseIndex += numReleases

doesn't give too much confidence that this invariant is upheld. I think what's happening is that for some reason, we're adding to the base index too much, so followers won't ever be able to catch up (after all, that base index is way ahead of the log!)

tbg · 2019-07-26T10:47:51Z

The base index is only incremented whenever an element is removed from release queue. But then, for it to race ahead of last index by 12 entries, we'd also have had to enqueue lots of extra entries to be released. But say we enqueue 100 completely bogus entries in the queue, those wouldn't just be released because the highest index we even consider is CommitIndex (which is at most 15), starting out with a base of LastIndex, let's assume that were erroneously taken as 0, that still means this range could only ever have released 15 entries, not 26 or even anywhere close to 100. All that is to say, this mechanism seems janky and hard to understand, but I don't understand how it can go off the rails like that.

PS storage unit tests passed with this diff, but that doesn't mean much since coverage isn't great

diff --git a/pkg/storage/replica_proposal_quota.go b/pkg/storage/replica_proposal_quota.go
index 0f06b01226..1ffd66aeda 100644
--- a/pkg/storage/replica_proposal_quota.go
+++ b/pkg/storage/replica_proposal_quota.go
@@ -236,6 +236,9 @@ func (r *Replica) updateProposalQuotaRaftMuLocked(
                        sum += rel
                }
                r.mu.proposalQuotaBaseIndex += numReleases
+               if r.mu.lastIndex < r.mu.proposalQuotaBaseIndex {
+                       panic("uhm what")
+               }
                r.mu.quotaReleaseQueue = r.mu.quotaReleaseQueue[numReleases:]

                r.mu.proposalQuota.add(sum)

tbg · 2019-07-26T10:55:26Z

I know that this code moved around recently:

cockroach/pkg/storage/replica_application.go

Lines 209 to 214 in aada4fc

    
           // At this point we're not guaranteed to have proposalQuota initialized, 
        
           // the same is true for quotaReleaseQueues. Only queue the proposal's 
        
           // quota for release if the proposalQuota is initialized. 
        
           if r.mu.proposalQuota != nil { 
        
           	r.mu.quotaReleaseQueue = append(r.mu.quotaReleaseQueue, cmd.proposal.quotaSize) 
        
           }

I wonder if maybe this just exposed an earlier bug. Potentially we were double-freeing quota enough to always hide the bug?

I give up for now. Hopefully either @ajwerner or @nvanbenschoten can look at this with fresh eyes.

Poking at the internals via dlv seems like a super nice tool to have under our belt.

This exposes the information on `_status/range/N`. I've looked at a few ranges locally and I'm already noticing that weirdly often the last index will be significantly ahead of the quota release base index (though the release queue is always empty). This might be something that only happens on single-node instances (the release mechanism is based on Ready handling, which is different in single instances because nothing ever needs to get acked externally), but it's unexpected to me. Motivated by cockroachdb#39022, where we see a proposal quota leak: quota is queued up for release, but the base index is way ahead of the last index, so it will never be released. Release note: None

tbg · 2019-07-26T11:20:42Z

Filed a bug against delve about the crash: go-delve/delve#1636

The version on n2 is patched to work. Just run export PATH=$HOME/go112/bin:$HOME/go/bin:$PATH after you log in.

ajwerner · 2019-07-26T11:32:23Z

I ran another 10 imports last night with logging of each acquisition and they all succeeded. Given that we’ve seen this at least 3 times in not all that many runs and I haven’t seen it in 20 with logging leads me to speculate that the logging interferes with the bug.

39111: storage: expose quota release queue and base index r=ajwerner a=tbg This exposes the information on `_status/range/N`. I've looked at a few ranges locally and I'm already noticing that weirdly often the last index will be significantly ahead of the quota release base index (though the release queue is always empty). This might be something that only happens on single-node instances (the release mechanism is based on Ready handling, which is different in single instances because nothing ever needs to get acked externally), but it's unexpected to me. Motivated by #39022, where we see a proposal quota leak: quota is queued up for release, but the base index is way ahead of the last index, so it will never be released. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

tbg · 2019-07-26T11:47:48Z

Were you running that very same SHA?

tbg · 2019-07-26T12:05:53Z

I'd try a few more runs with assertions instead of logging (kind of hard to believe that the logging actually makes a difference, but maybe?). The invariants are that the base index + length quota release queue shouldn't exceed last index, and of course that we never bump base index past last index.

ajwerner · 2019-07-26T14:03:18Z

Were you running that very same SHA?

No, we've seen this for a week so I doubted it mattered. I've check out a branch and added the assertions you suggested and will spin up some more runs soon.

ajwerner · 2019-07-26T17:30:23Z

I've got some good assertion failures:

logs.txt

This is built off of 26edea5 with the following patch run with --vmodule=*raft=2

+
+               if r.mu.proposalQuotaBaseIndex+numReleases > r.mu.lastIndex {
+                       log.Fatalf(ctx, "proposalQuotaBaseIndex (%d) + numReleases (%d) > lastIndex (%d), %v",
+                               r.mu.proposalQuotaBaseIndex, numReleases, r.mu.lastIndex, r.mu.quotaReleaseQueue)
+               } else if r.mu.proposalQuotaBaseIndex+uint64(len(r.mu.quotaReleaseQueue)) > r.mu.lastIndex {
+                       log.Fatalf(ctx, "proposalQuotaBaseIndex (%d) + len(quotaReleaseQueue) (%d) > lastIndex (%d), %v",
+                               r.mu.proposalQuotaBaseIndex, len(r.mu.quotaReleaseQueue), r.mu.lastIndex, r.mu.quotaReleaseQueue)
+               }

ajwerner · 2019-07-26T17:40:52Z

Here's another assertion failure. It's uncanny how similar these are. They both feature a new leader election leading to a reproposal and a split. They differ in whether the crashing node starts or ends as the raft leader.

logs2.txt

Now on to interpreting them.

ajwerner · 2019-07-26T21:28:19Z

Here's more another set of logs and corresponding raft logs from the crashing node.

This was built off of 9078c4e with added logging at 63722c8

1564172996.run_1.r1385.log.txt

1564172996.run_1.r1385.n3.raft_log.txt

ajwerner · 2019-07-27T01:02:51Z

I think I've got it. As @tbg noted above, we didn't totally lose the quota we were looking for, we just left it in the quotaReleaseQueue. We'd do that because we messed up the proposalQuotaBaseIndex. Quota is added to the quotaReleaseQueue when entries are committed, or more specifically, when committed entries are applied. The decision to remove from the queue are based on the status of the followers relative to the commit index. The queue should represent quota associated with commands between proposalQuotaBaseIndex and the current committed index. The bug is that we initial the proposalQuotaBaseIndex to a value which is too high.

proposalQuotaBaseIndex is initialized to the r.mu.lastIndex

cockroach/pkg/storage/replica_proposal_quota.go

Line 108 in cfdaadc

r.mu.proposalQuotaBaseIndex = r.mu.lastIndex

cockroach/pkg/storage/replica.go

Lines 225 to 226 in cfdaadc

    
           // Last index/term persisted to the raft log (not necessarily 
        
           // committed). Note that lastTerm may be 0 (and thus invalid) even when

r.mu.lastIndex notes that it can include entries which are not committed. In the logs attached we can see that the proposalQuotaBaseIndex is initialized at 17 but then later we propose and commit values to 17 and 18 and add them to the queue.

ip-10-12-33-119> I190726 22:59:58.911898 214 storage/replica_proposal_quota.go:130  [n3,s3,r762/2:/Table/61/1/69{3/5/2…-5/5/1…}] becoming raft leader, setting proposalQuotaBaseIndex to 17
ip-10-12-33-119> I190726 22:59:58.940379 311 storage/replica_application.go:213  [n3,s3,r762/2:/Table/61/1/69{3/5/2…-5/5/1…}] appending to quotaReleaseQueue 5726512 quota, 17.5
ip-10-12-33-119> I190726 22:59:59.797089 306 storage/replica_application.go:213  [n3,s3,r762/2:/Table/61/1/69{3/5/2…-5/5/1…}] appending to quotaReleaseQueue 94 quota, 18.7
ip-10-12-33-119> F190726 22:59:59.797389 322 storage/replica_proposal_quota.go:250  [n3,s3,r762/2:/Table/61/1/69{3/5/2…-5/5/1…}] proposalQuotaBaseIndex (17) + len(quotaReleaseQueue) (2) > lastIndex (18), [5726512 94]

I suspect usually this tail of quota we accumulate at the beginning of a Replica's leadership term is okay because 1) we rarely have a command that needs more quota than we've leaked (except these AddSSTables) 2) we double-free enough that ultimately the leaked quota get covered over.
In order to cause the observed stall we just need to have the next request which comes in need all of the quota. The range never get unquiesced because the proposal never gets its proposal quota. Eventually when something leads to a leadership change or kicks raft, such that something else can get through.

I'm not sure what tickled the range to fix the thing. It seems like it was related probably related to the dropped messages. I suspect there was some sort of networking hiccup that lead to a raft election.

The remaining question is why are we seeing it now?

1664181104_run_3_r762.logs.txt

tbg · 2019-07-27T15:50:33Z

The remaining question is why are we seeing it now?

Maybe because we've gotten less aggressive about re-freeing?

Also, can this explain the original failure? We saw a base index of 23 and a last index of 15. Let's be generous and say that at initialization we had a last index of 15 already (impossible) and so all subsequent commands would let the base race ahead. There are only 5 commands in the log (it starts with first index 11), so that's at most 20? How did we get to 23?

ajwerner · 2019-07-27T17:41:51Z

There are only 5 commands in the log (it starts with first index 11), so that's at most 20? How did we get to 23?

Doesn't the last command in the log truncate?

Put: 0.000000000,0 /Local/RangeID/1515/u/RaftTruncatedState (0x0169f705eb757266747400): index:10 term:5

It seems possible that we started below 11.
https://gist.github.com/tbg/4e0dc388c653f81920d7bab14c190f5e#file-r1514%20raft%20log

tbg · 2019-07-27T18:34:16Z

No, our ranges always start at 11 unfortunately. You're seeing the split trigger setting that up for the new range 1515

…

On Sat, Jul 27, 2019, 19:42 ajwerner ***@***.***> wrote: There are only 5 commands in the log (it starts with first index 11), so that's at most 20? How did we get to 23? Doesn't the last command in the log truncate? Put: 0.000000000,0 /Local/RangeID/1515/u/RaftTruncatedState (0x0169f705eb757266747400): index:10 term:5 It seems possible that we started below 11. https://gist.github.com/tbg/4e0dc388c653f81920d7bab14c190f5e#file-r1514%20raft%20log — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#39022?email_source=notifications&email_token=ABGXPZESAHR5SMTCY27ZPHDQBSCHBA5CNFSM4IF2CHKKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26PWLY#issuecomment-515701551>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABGXPZDZSSBO4TZIVVT5CV3QBSCHBANCNFSM4IF2CHKA> .

ajwerner · 2019-07-29T01:49:25Z

I'm reaching for straws here. What if the last index when n2 became leader was higher, say 18, but then another term change occurred which didn't change the leader but truncated the end of the log to something shorter?

I've successfully run the import 40 times with the change in #39135 and the fatal hasn't happened.

nvanbenschoten · 2019-07-29T05:05:16Z

I need to spend some time catching up on the symptoms we saw here and the interactions around the quotaPool before being able to confidently review #39135.

That said, I think I stumbled upon a possible culprit that we might be able to implicate in the surfacing of this stall. faff108 removed a secondary structure that tracked quota sizes for in-flight proposals. This commandSizes map was flushed on every leadership change. This means that when a replica became a leader, it would have an empty commandSizes map and never add any committed entries from old terms to its quotaReleaseQueue. The commit removed this map in favor of using the quota recorded on the ProposalData directly. An effect of this is that we may now free quota that was acquired during an earlier leadership term through a different quotaPool (which are reset on each leadership change). I would expect this to be innocuous because it would simply result in freeing extra quota, but perhaps it can explain the excess growth in the quotaReleaseQueue that we saw.

We'll need to think about whether this is an actual issue or whether it's simply helping to trigger the problem @ajwerner found in #39135. If we assume that only proposals performed after the term change and the quota pool initialization in updateProposalQuotaRaftMuLocked could ever make it into the quotaReleaseQueue then I think initializing proposalQuotaBaseIndex to r.mu.lastIndex would be correct (or at least wouldn't cause issues). Meanwhile, adding committed entries proposed in earlier terms to the queue after initializing the base index to r.mu.lastIndex would certainly cause issues like we see here.

#39135 avoids these issues by initializing the proposalQuotaBaseIndex to an index below that of any future addition to the quotaReleaseQueue, so it avoids the over-estimation that is possible if committed entries from previous terms can be added to the queue. However, I'm not sure it's the exact fix we want. That's because it now opens us up to dramatically under-estimating the proposalQuotaBaseIndex. After the leadership change, it's very possible that we commit a series of entries from previous terms that weren't proposed by the new leader. If these entries weren't added to the quotaReleaseQueue then the proposalQuotaBaseIndex would fall behind the leader's commit index. It's unclear to me whether this is ok or whether we try to make some claim about how the proposalQuotaBaseIndex relates to the leader's commit index (assuming no slow followers). If we initialized proposalQuotaBaseIndex to r.mu.lastIndex and every proposal after r.mu.lastIndex was eventually added to the quotaReleaseQueue then we would be able to say that the two would always converge in a quiescent range where all followers are caught up. And yet, this can't be true because not all entries proposed after r.mu.lastIndex will be added to the quotaReleaseQueue (e.g. raft occasionally proposes empty entries whose quota is not tracked). I suspect that this is why we need this condition.

That all would imply that proposalQuotaBaseIndex will necessarily drift away from the raft group's commit index and there's nothing we can do about it, in which case something like #39135 would be fine. A similar solution that avoids unbounded under-estimation is to record the proposer term on ProposalData and only add to the quotaReleaseQueue if the term is still active. This would also avoid mixing quota from different quota pools together, which I don't think the new util/quotapool package will be too thrilled about.

Since @irfansharif is back (👋), this would be a great issue for him to weigh in on!

ajwerner · 2019-07-29T10:22:41Z

A similar solution that avoids unbounded under-estimation is to record the proposer term on ProposalData and only add to the quotaReleaseQueue if the term is still active.

Should we reset the proposalQuotaBaseIndex upon term change?

Would it make sense to try to have the proposalReleaseQueue exactly mirror the commit index? We could add a zero-value entry to the quotaReleaseQueue for committed entries which weren't proposed locally (both the raft injected empty commands and commands which were proposed by a leaseholder which wasn't the raft leader). If we did that it feels like we try to have an invariant that proposalQuotaBaseIndex+len(quotaReleaseQueue)==status.Commit.

This would also avoid mixing quota from different quota pools together, which I don't think the new util/quotapool package will be too thrilled about.

True. The API doesn't really permit adding from one pool to another.

ajwerner · 2019-07-29T11:10:42Z

If we did that it feels like we try to have an invariant that proposalQuotaBaseIndex+len(quotaReleaseQueue)==status.Commit.

Digging a bit more I'm not sure we can maintain this invariant but it'd be nice to have an invariant that we can maintain about the relationship between the proposalQuotaBaseIndex the quotaReleaseQueue and one of status.Commit or r.mu.lastIndex

ajwerner · 2019-07-29T14:34:27Z

Talked to @nvanbenschoten offline. We're going to move forward with the idea of adding zero-value quota to the quotaReleaseQueue for committed entries which were either empty or proposed elsewhere. This should keep the proposalQuotaBaseIndex quota close to and less than or equal to the Commit index. It won't be exactly the Commit index in cases where raft doesn't hand us an entry in CommittedEntries but does tell us that an index is committed. I'm not sure of the exact cases in which this happens but it seems to happen.

This should allow us to add an invariant that when a range quiesces its proposalQuotaBaseIndex should be equal to the Commit index.

nvanbenschoten · 2019-07-29T15:09:31Z

I'm not sure of the exact cases in which this happens but it seems to happen.

The main place where this happens is when we run into the MaxCommittedSizePerReady limit.

This commit changes the initialization of `proposalQuotaBaseIndex` from `lastIndex` which may include entries which are not yet committed to `status.Applied`, the highest applied index. Given the `proposalQuotaBaseIndex` should account for all committed proposals whose quota has been released and proposals add their quota to the release queue after they have been committed, it's important that the that base index not be too high lest we leave quota in the queue. This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the length of the queue is exactly equal to the applied index. In order to maintain this invariant, the commit ensures that we enqueue a zero-value release to the release queue for empty commands and commands proposed on another node. See cockroachdb#39022 (comment) for more details. Fixes cockroachdb#39022. Release note (bug fix): Properly initialize proposal quota tracking to prevent quota leak which can hang imports or other AddSSTable operations.

39135: storage: initialize the proposalQuotaBaseIndex from Applied r=nvanbenschoten a=ajwerner This commit changes the initialization of `proposalQuotaBaseIndex` from `lastIndex` which may include entries which are not yet committed to `status.Commit`, the highest committed index. Given the `proposalQuotaBaseIndex` should account for all committed proposals whose quota has been released and proposals add their quota to the release queue after they have been committed, it's important that the that base index not be too high lest we leave quota in the queue. This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the length of the queue does not exceed the current committed index. See #39022 (comment) for more details. This change did not hit this assertion in 10 runs of an import of TPC-C whereas without it, the assertion was hit roughly 30% of the time. Fixes #39022. Release note (bug fix): Properly initialize proposal quota tracking to prevent quota leak which can hang imports or other AddSSTable operations. Co-authored-by: Andrew Werner <[email protected]>

This commit changes the initialization of `proposalQuotaBaseIndex` from `lastIndex` which may include entries which are not yet committed to `status.Applied`, the highest applied index. Given the `proposalQuotaBaseIndex` should account for all committed proposals whose quota has been released and proposals add their quota to the release queue after they have been committed, it's important that the that base index not be too high lest we leave quota in the queue. This commit also adds an assertion that the `proposalQuotaBaseIndex` plus the length of the queue is exactly equal to the applied index. In order to maintain this invariant, the commit ensures that we enqueue a zero-value release to the release queue for empty commands and commands proposed on another node. See #39022 (comment) for more details. Fixes #39022. Release note (bug fix): Properly initialize proposal quota tracking to prevent quota leak which can hang imports or other AddSSTable operations.

ajwerner self-assigned this Jul 22, 2019

nvanbenschoten mentioned this issue Jul 22, 2019

roachtest: tpccbench/nodes=3/cpu=16 failed #39013

Closed

nvanbenschoten mentioned this issue Jul 24, 2019

storage: prevent command reproposal with new lease index after application #39064

Merged

tbg mentioned this issue Jul 26, 2019

storage: expose quota release queue and base index #39111

Merged

ajwerner mentioned this issue Jul 27, 2019

storage: initialize the proposalQuotaBaseIndex from Applied #39135

Merged

craig bot closed this as completed in #39135 Jul 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: proposal quota leak #39022

storage: proposal quota leak #39022

ajwerner commented Jul 22, 2019

ajwerner commented Jul 23, 2019

nvanbenschoten commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019 •

edited

Loading

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

ajwerner commented Jul 26, 2019

tbg commented Jul 26, 2019 via email

tbg commented Jul 26, 2019

ajwerner commented Jul 26, 2019

ajwerner commented Jul 26, 2019

ajwerner commented Jul 26, 2019

ajwerner commented Jul 26, 2019

ajwerner commented Jul 27, 2019

tbg commented Jul 27, 2019

ajwerner commented Jul 27, 2019

tbg commented Jul 27, 2019 via email

ajwerner commented Jul 29, 2019

nvanbenschoten commented Jul 29, 2019

ajwerner commented Jul 29, 2019 •

edited

Loading

ajwerner commented Jul 29, 2019

ajwerner commented Jul 29, 2019

nvanbenschoten commented Jul 29, 2019

storage: proposal quota leak #39022

storage: proposal quota leak #39022

Comments

ajwerner commented Jul 22, 2019

ajwerner commented Jul 23, 2019

nvanbenschoten commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019 • edited Loading

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

tbg commented Jul 26, 2019

ajwerner commented Jul 26, 2019

tbg commented Jul 26, 2019 via email

tbg commented Jul 26, 2019

ajwerner commented Jul 26, 2019

ajwerner commented Jul 26, 2019

ajwerner commented Jul 26, 2019

ajwerner commented Jul 26, 2019

ajwerner commented Jul 27, 2019

tbg commented Jul 27, 2019

ajwerner commented Jul 27, 2019

tbg commented Jul 27, 2019 via email

ajwerner commented Jul 29, 2019

nvanbenschoten commented Jul 29, 2019

ajwerner commented Jul 29, 2019 • edited Loading

ajwerner commented Jul 29, 2019

ajwerner commented Jul 29, 2019

nvanbenschoten commented Jul 29, 2019

tbg commented Jul 26, 2019 •

edited

Loading

ajwerner commented Jul 29, 2019 •

edited

Loading