-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: transfer all pending prerequisites during command cancellation #17939
storage: transfer all pending prerequisites during command cancellation #17939
Conversation
The removal of the Reviewed 4 of 4 files at r1. pkg/storage/replica_test.go, line 2456 at r2 (raw file):
That's some serious test infrastructure. I'm not following how you're controlling the cancellation order described in #16266. Comments from Reviewable |
Reviewed 4 of 4 files at r1, 3 of 3 files at r2. pkg/storage/replica.go, line 2039 at r2 (raw file):
So this is equivalent to the old code and the second commit is just a test with no functional change so far, right? pkg/storage/replica_test.go, line 2474 at r1 (raw file):
s/Run/Finish/ pkg/storage/replica_test.go, line 2488 at r1 (raw file):
What if you use pkg/storage/replica_test.go, line 2456 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Are you relying on the randomness in cancelCmds? It's not great to introduce a test that is expected to fail by becoming flaky instead of being deterministic. It would be better to be able to pass in a cancellation order (perhaps while leaving the randomized option for a fuzzy test). Comments from Reviewable |
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/storage/replica_test.go, line 2434 at r2 (raw file):
EndTransactionRequest with a split trigger declares a lot more keys than a plain EndTransactionRequest. That may be necessary to trigger the bug. Specifically, with a SplitTrigger the EndTransaction both reads and writes the abort cache key, which I think would move the detection of that collision into a different scope/access pair. Comments from Reviewable |
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/storage/replica.go, line 2039 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
No, this is where the bug was hiding. Previously the code added the Comments from Reviewable |
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/storage/replica.go, line 2039 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
OK, got it now. This comment confused me and I had expected the solution to look more like repeating the block that previously set Comments from Reviewable |
Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/storage/replica.go, line 2039 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Agreed we need to figure out a reproduction. I'm testing this fix on Comments from Reviewable |
Yeah, I wasn't suggesting pushing this until we actually reproduced the issue in Review status: all files reviewed at latest revision, 5 unresolved discussions, some commit checks failed. pkg/storage/replica.go, line 2039 at r2 (raw file): Previously, petermattis (Peter Mattis) wrote…
Yeah like @petermattis said, the second commit fixes the bug discussed in #16266 as well. It does so by removing the @petermattis I'm also going to make a prediction that this doesn't fix the issue because I think I found a bigger one with how cmd cancellation currently interacts with the pkg/storage/replica_test.go, line 2474 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica_test.go, line 2488 at r1 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. pkg/storage/replica_test.go, line 2456 at r2 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Good point. I'll make the order of cancellation deterministic. Comments from Reviewable |
998c131
to
57e29bd
Compare
I updated this with the full fix discussed in #16266 along with more testing around local keys. PTAL. |
b11f9dc
to
7ded8a8
Compare
Review status: 2 of 5 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful. pkg/storage/command_queue.go, line 226 at r6 (raw file):
I think this loop is an optimization and not strictly necessary. When a dependent command waits on the prereqs of I'd prefer to leave this loop out because I'm having a hard time reasoning about the locking here and why it is safe to modify pkg/storage/replica.go, line 2033 at r6 (raw file):
The repetition of
A few more lines of code, but I find this clearer. Feel free to ignore. Comments from Reviewable |
pkg/storage/command_queue.go
Outdated
|
||
// Truncate the command's prerequisite list so that it no longer includes | ||
// the first prerequisite. Before doing so, nil out prefix of slice to allow | ||
// GC of the first command. This prevents us from leaking large chunks of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment accurate? I.e. is
var s []*int
for i := 0; i < 1E10; i++ {
s = append(s, new(5))
s = s[:1]
}
any different from the version with a s[0] = nil
added in? Seems that in both cases GC can only reclaim the old chunk of backing memory after it's had to copy into a new one.
7ded8a8
to
de768dc
Compare
Review status: 2 of 5 files reviewed at latest revision, 8 unresolved discussions. pkg/storage/command_queue.go, line 215 at r5 (raw file): Previously, tschottdorf (Tobias Schottdorf) wrote…
We're not trying to allow the initial chunk of the slice to be GCed, we're trying to allow GC of the pkg/storage/command_queue.go, line 226 at r6 (raw file): Previously, petermattis (Peter Mattis) wrote…
So right now it is necessary because a parent pkg/storage/replica.go, line 2033 at r6 (raw file): Previously, petermattis (Peter Mattis) wrote…
Done. Good call. Comments from Reviewable |
Review status: 2 of 5 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful. pkg/storage/command_queue.go, line 215 at r5 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
The comment talks about leaking, but it isn't really a leak, just a reference. pkg/storage/command_queue.go, line 222 at r6 (raw file):
The first sentence of this comment implies that dependents could have dependencies on the parent command, but that never happens because pkg/storage/command_queue.go, line 226 at r6 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Ah, so much subtlety here. Rather than keeping the
Comments from Reviewable |
de768dc
to
01d3d51
Compare
Review status: 2 of 5 files reviewed at latest revision, 9 unresolved discussions. pkg/storage/command_queue.go, line 215 at r5 (raw file):
pkg/storage/command_queue.go, line 222 at r6 (raw file): Previously, petermattis (Peter Mattis) wrote…
Couldn't it happen if the request has only 1 span, so that there are no children cmds and only a parent? I think I may be misunderstanding your question, but this comment is removed now anyway because of your other suggestion. pkg/storage/command_queue.go, line 226 at r6 (raw file): Previously, petermattis (Peter Mattis) wrote…
I like that! Done. Comments from Reviewable |
Review status: 2 of 5 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful. pkg/storage/command_queue.go, line 222 at r6 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
If a parent doesn't have children, is it really a parent? Comments from Reviewable |
Review status: 2 of 5 files reviewed at latest revision, 7 unresolved discussions, all commit checks successful. pkg/storage/command_queue.go, line 222 at r6 (raw file): Previously, petermattis (Peter Mattis) wrote…
That's a larger philosophical question than I'm qualified to answer. Comments from Reviewable |
Reviewed 2 of 3 files at r7, 1 of 1 files at r8. pkg/storage/replica_test.go, line 2514 at r8 (raw file):
Are these other scenarios derived from anything in particular? Comments from Reviewable |
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/storage/replica_test.go, line 2514 at r8 (raw file): Previously, bdarnell (Ben Darnell) wrote…
No, I was just trying to think of interesting cases that created dependency chains across local keys and global keys and across reads and writes. Are there any other interesting scenarios you think we should add here that might create interesting dependency patterns? Comments from Reviewable |
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/storage/replica_test.go, line 2514 at r8 (raw file): Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
EndTransaction with a SplitTrigger is interesting since it mixes local and global keys and blocks on the entire data span. Comments from Reviewable |
This allows us to test more about which commands are let through the command queue at a given time and make more directed assertions. This is important when testing the exact effect of command cancellation.
Previously, `cmdQCancelTest` would cancel commands in a random order. The test structure now takes the `cancelOrder` as a parameter.
The use of the `cancelled` flag on `cmds` in the `CommandQueue` was flawed in two ways. First, we were only setting it on the `cmd` for the SpanAccess/spanScope combination that was active during the context cancellation. This meant that `cmds` in later SpanAccess/spanScope combinations would not have the flag set even if its corresponding batch was cancelled. Second, neither the `cancelled` flag nor the `prereq` list was not being set on child `cmds`, only on parent `cmds`. This meant that context cancellation would not work properly for BatchRequests that create more than one `*cmd` for any access/scope combination. This commit fixes both of these issues by removing the `cancelled` flag and making sure that parent and child `cmds` keep `prereqs` in-sync. Instead of using the `cancelled` flag to signify a cancelled `cmd`, the change now uses the `cmd.prereq` slice itself to signify `cmd` cancellation. `cmds` now remove prerequisites from their `prereq` set as the prereqs stop pending. If a `cmd` cancels early, it will leave around a non-empty `prereq` set. The transitive dependency migration happens whenever a prereqs stop pending while still holding onto prereqs itself. In this case, the dependent command will add all remaining prereqs to its set. This should be less error-prone and more easy to reason about since we have to maintain less state.
Review status: all files reviewed at latest revision, 5 unresolved discussions, all commit checks successful. pkg/storage/replica_test.go, line 2514 at r8 (raw file): Previously, bdarnell (Ben Darnell) wrote…
Done. Comments from Reviewable |
01d3d51
to
93f7fe8
Compare
I'll merge if #17815 proves successful. @petermattis has the binary running on |
I ran a binary with your PR as of this morning for much of the day. No failures detected. I've since switched to testing something else on |
Fixes #16266.
The first two commits overhaul the
cmdQCancelTest
type, which allows us to do more in-depth testing on command cancellation. This includes testing exactly where each command is with respect to theCommandQueue
at a given time (pending, cancelled, running, or finished). We can then assert that dependencies are properly maintained during command cancellation. The test framework also adds support for handling local keys, which are tested in more depth in the fourth commit.The third commit fixes the issue discussed in #16266.
The use of the
cancelled
flag oncmds
in theCommandQueue
was flawed intwo ways. First, we were only setting it on the
cmd
for theSpanAccess/spanScope combination that was active during the context
cancellation. This meant that
cmds
in later SpanAccess/spanScope combinationswould not have the flag set even if its corresponding batch was cancelled.
Second, neither the
cancelled
flag nor theprereq
list was not being set onchild
cmds
, only on parentcmds
. This meant that context cancellation wouldnot work properly for BatchRequests that create more than one
*cmd
for anyaccess/scope combination.
This commit fixes both of these issues by removing the
cancelled
flag andmaking sure that parent and child
cmds
keepprereqs
in-sync. Instead ofusing the
cancelled
flag to signify a cancelledcmd
, we nowuses the
cmd.prereq
slice itself to signifycmd
cancellation.cmds
nowremove prerequisites from their
prereq
set as the prereqs stop pending.If a
cmd
cancels early, it will leave around a non-emptyprereq
set.The transitive dependency migration happens whenever a prereqs stop pending
while still holding onto prereqs itself. In this case, the dependent command
will add all remaining prereqs to its set. This should be less error-prone
and more easy to reason about since we have to maintain less state.
\cc. @petermattis @tschottdorf