Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: transitive prerequisites are not always transferred on command cancellation #16266

Closed
tbg opened this issue Jun 1, 2017 · 150 comments · Fixed by #17939
Closed

storage: transitive prerequisites are not always transferred on command cancellation #16266

tbg opened this issue Jun 1, 2017 · 150 comments · Fixed by #17939
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Milestone

Comments

@tbg
Copy link
Member

tbg commented Jun 1, 2017

https://sentry.io/cockroach-labs/cockroachdb/issues/284616795/

	if len(reply.Ranges) == 0 {
		// No matching results were returned from the scan. This should
		// never happen with the above logic.
		var buf bytes.Buffer
		buf.WriteString("range lookup of meta key '")
		buf.Write(args.Key)
		buf.WriteString("' found only non-matching ranges:")
		for _, desc := range reply.PrefetchedRanges {
			buf.WriteByte('\n')
			buf.WriteString(desc.String())
		}
		log.Fatal(ctx, buf.String()) // <- boom
	}
*errors.errorString: replica_command.go:1304 string
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_command.go", line 1304, in evalRangeLookup
  File "github.com/cockroachdb/cockroach/pkg/storage/replica_command.go", line 196, in evaluateCommand
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 4330, in evaluateBatch
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 2055, in executeReadOnlyBatch
  File "github.com/cockroachdb/cockroach/pkg/storage/replica.go", line 1421, in Send
...
(10 additional frame(s) were not displayed)

replica_command.go:1304 string
@tbg tbg added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jun 1, 2017
@petermattis petermattis changed the title *errors.errorString: replica_command.go:1304 string storage: *errors.errorString: replica_command.go:1304 string Jun 1, 2017
@petermattis petermattis added this to the 1.1 milestone Jun 1, 2017
@bdarnell bdarnell changed the title storage: *errors.errorString: replica_command.go:1304 string storage: range lookup of meta key found only non-matching ranges Jun 1, 2017
@bdarnell
Copy link
Contributor

bdarnell commented Jun 1, 2017

Last time we saw this error was #10501

@petermattis
Copy link
Collaborator

I'm seeing this on sky right now while try to pre-split a table into 10000 parts:

F170808 17:31:44.518350 9645 storage/replica_command.go:1326  [n88,s88,r1/23:/{Min-System/}] range lookup of meta key '^C<BB><89><FD>%K<A1><A0>ԯ<BE><82>' found only non-matching ranges:

Cc @tamird

@bdarnell
Copy link
Contributor

bdarnell commented Aug 8, 2017

Have the meta ranges split yet? It's possible we're running into #2266 on sky.

@petermattis
Copy link
Collaborator

petermattis commented Aug 8, 2017

@bdarnell The above happened after 2000 splits. Pretty sure the meta ranges have not split yet. Yesterday I had sky up to 400k ranges (via sized-based splits).

@petermattis
Copy link
Collaborator

This happened again on sky last night. This time the system had ~1m ranges. The failure is pretty bad as we seem to have walked through the replicas for this meta2 table and triggered the failure on each, which then resulted in loss of quorum for a large swath of ranges:

cockroach.stderr.17:F170809 04:05:45.149824 4508002 storage/replica_command.go:1326  [n11,s11,r272822/3:/Meta2/Table/51/1/-{465…-299…}] range lookup of meta key '���ɖ��&~��' found only non-matching ranges:
cockroach.stderr.37:F170809 04:05:44.611391 12562094 storage/replica_command.go:1326  [n2,s2,r272822/2:/Meta2/Table/51/1/-{465…-299…}] range lookup of meta key '���ɖ��&~��' found only non-matching ranges:
cockroach.stderr.39:F170809 04:05:44.816689 5185165 storage/replica_command.go:1326  [n21,s20,r272822/1:/Meta2/Table/51/1/-{465…-299…}] range lookup of meta key '���ɖ��&~��' found only non-matching ranges:

The raw printing of the meta key makes debugging a bit difficult. I'm going to try and reproduce again with #17540 in place.

@petermattis
Copy link
Collaborator

Translating that key reveals it to be /Meta2/Table/51/1/-3920777067433132351/0. That key is within the span of r272822 (the negative numbers in the span of r272822 make determining this slightly confusing).

@petermattis
Copy link
Collaborator

cockroach debug range-data <dir> 272822 reveals (this data has been manually massaged for clarity):

/Meta2/Table/51/1/-3920798135581715988,03:40:51
    [/Table/51/1/-3920807719635987012, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:24:16
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:18:47
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:18:45
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:18:41
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:18:36
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)
/Meta2/Table/51/1/-3920798135581715988,03:09:50
    [/Table/51/1/-3920917233049129889, /Table/51/1/-3920798135581715988)

/Meta2/Table/51/1/-3920754471613951699,03:40:50
    [/Table/51/1/-3920755829099915981, /Table/51/1/-3920754471613951699)
/Meta2/Table/51/1/-3920754471613951699,03:28:20
    [/Table/51/1/-3920798135581715988, /Table/51/1/-3920754471613951699)

Note that there are only 2 keys above. If we look at the 2 most recent versions of the keys we see:

/Meta2/Table/51/1/-3920798135581715988,03:40:51
    [/Table/51/1/-3920807719635987012, /Table/51/1/-3920798135581715988)

/Meta2/Table/51/1/-3920754471613951699,03:40:50
    [/Table/51/1/-3920755829099915981, /Table/51/1/-3920754471613951699)

The key we're looking for (3920777067433132351) lies in the gap between these 2 range descriptors. The versions immediately preceding these is:

/Meta2/Table/51/1/-3920798135581715988,03:24:16
    [/Table/51/1/-3920889520691632397, /Table/51/1/-3920798135581715988)

/Meta2/Table/51/1/-3920754471613951699,03:28:20
    [/Table/51/1/-3920798135581715988, /Table/51/1/-3920754471613951699)

Notice that the gap is gone. The end key of the first range descriptor is the start key of the second range descriptor. Still trying to piece together what is going on here. Seems like there was a split at 03:40:50. What happened to it? Did we lose a write? Are we not printing an intent in debug range-data?

@petermattis
Copy link
Collaborator

Here is the split:

cockroach.stderr.25:I170809 03:40:50.842124 2119452 storage/replica_command.go:2667  [n42,s40,r798869/1:/Table/51/1/-39207{981…-544…}] initiating a split of this range at key /Table/51/1/-3920755829099915981 [r936561]

The keys here exactly match the range-data output, except we're missing the record for the left hand side of the split.

@tbg
Copy link
Member Author

tbg commented Aug 9, 2017

If the theory is that we're simply losing the write (not unheard of in past anomalies), and you can "quickly" repro this issue, then you could add the following logging:

  1. log from evalResolveIntent{,Range} whenever an intent changes (resolved, pushed, aborted).
  2. log from evalEndTransaction when it has a splitTrigger. In particular, the list of intents, and which intents we return upwards to the intentResolver.
  3. log from the intent resolver for the EndTransaction case. If the intent resolver ended up eagerly GC'ing the transaction without really resolving the LHS descriptor intent, that intent would subsequently go AWOL as the absent txn record would count as an aborted txn.

@bdarnell
Copy link
Contributor

bdarnell commented Aug 9, 2017

This reminds me of #9265 (comment). That led to #9399 which proposed introducing additional assertions to MVCCResolveWriteIntent, although in the end we decided not to do that.

What are the bounds of r272822? Doesn't the fact that we have meta2 ranges on a range other than r1 mean that meta2 has split?

@petermattis
Copy link
Collaborator

The bounds of r272822 are /Meta2/Table/51/1/-4655368821960641081 and /Meta2/Table/51/1/-2990335433313644993. I think meta2 did split (I calculate we can store 550k range descriptors per meta2 range), but the meta key in question is right in the middle of this range, not near the bounds.

@petermattis
Copy link
Collaborator

Seems like an assertion in MVCCResolveWriteIntent isn't feasible, but we could at least add a warning.

@petermattis
Copy link
Collaborator

sky encountered another of these failures:

F170809 20:26:59.279564 767502 storage/replica_command.go:1317 [n5,s5,r236328/1:/Meta2/Table/51/1/{-88…-385…}] range lookup of meta key '/Meta2/Table/51/1/3856042551658982491' found only non-matching ranges:

The span for r236328 is [/Meta2/Table/51/1/-882441077842675592, /Meta2/Table/51/1/3856096413566835965). But in this case, the last key in r236328 is before the desired meta key:

2017-08-09T20:22:33.39724672Z,0 /Meta2/Table/51/1/3856042551658982491: [/Table/51/1/3856024847892073890, /Table/51/1/3856042551658982491)
2017-08-09T20:22:28.886126068Z,0 /Meta2/Table/51/1/3856042551658982491: [/Table/51/1/3856024847892073890, /Table/51/1/3856042551658982491)
2017-08-09T20:22:28.41487805Z,0 /Meta2/Table/51/1/3856042551658982491: [/Table/51/1/3856024847892073890, /Table/51/1/3856042551658982491)
2017-08-09T20:18:57.808850961Z,0 /Meta2/Table/51/1/3856042551658982491: [/Table/51/1/3856024233464125966, /Table/51/1/3856042551658982491)

Notice that the meta key we were looking up is exactly equal to the end key of the last range descriptor in this meta2 range. I'm going to try and track down the next meta2 range to see what it contains.

@petermattis
Copy link
Collaborator

petermattis commented Aug 9, 2017

And another instance:

F170809 21:13:13.176847 565412 storage/replica_command.go:1317 [n59,s59,r1/4:/M{in-eta2/Table/…}] range lookup of meta key '/Meta2/Table/51/1/1225124192177909585' found only non-matching ranges:

r1 has the bounds [/Min, /Meta2/Table/51/1/1225124729451769362) which means it does contain the desired key, but right at the end of the range:

2017-08-09T21:01:22.608522522Z,11 /Meta2/Table/51/1/1225124192177909585: [/Table/51/1/1225085640252622231, /Table/51/1/1225124192177909585)

The next meta range is r118151 which has the bounds [/Meta2/Table/51/1/1225124729451769362, /System/""). The first meta key is:

2017-08-09T21:13:13.175969315Z,0 /Meta2/Table/51/1/1225124729451769362: [/Table/51/1/1225124192177909585, /Table/51/1/1225124729451769362)

Notice that the meta key aligns with the start key of r118151, but the range descriptor extends before it. Seems like our handling of lookups in meta2 ranges is broken. We need to not only look at the range containing the desired key, but the following range as well.

Cc @nvanbenschoten whom @tschottdorf says has some understanding of this.

Note that the debugging I did in this comment indicated a different issue as the desired meta key was definitely contained within the meta range, not at the end.

@petermattis
Copy link
Collaborator

PS Seems easy to write a test which splits meta2 and verify that we don't properly handle lookup correctly at the end of the first meta2 range.

@nvanbenschoten
Copy link
Member

@petermattis if I'm understanding your last crash correctly, it looks like the meta2 splits are misaligned with the userspace splits, which is why you suggest

We need to not only look at the range containing the desired key, but the following range as well.

I'd expect that we would make sure to keep the split boundaries between the different meta layers aligned, but a quick search doesn't lead me to believe that we do that. For instance, we wouldn't have this issue if r1 was bounded to [/Min, /Meta2/Table/51/1/1225124192177909585) and r118151 bounded to [/Meta2/Table/51/1/1225124192177909585, /System/""). To guarantee this, the only valid meta2 splitKeys would be RangeMetaKey(any existing userspace range start key). Luckily we don't support range merges yet, meaning that once a key is a range start key, it will always be a range start key. That simplifies things in a number of ways.

@petermattis
Copy link
Collaborator

@nvanbenschoten I'm not sure your proposed solution works since every userspace range start key is the end key of the previous range. If r1 was bounded by [/Min, /Meta2/Table/51/1/1225124192177909585), the range [/Table/51/1/1225085640252622231, /Table/51/1/1225124192177909585) would have been placed in the next meta range. I think this approach could work if we chose meta split keys as metaKey.Next(). This would ensure that the last valid key in a range is just before the range split point. This feels fragile to me and maintaining this invariant when we do range merging might be a headache.

It's interesting that a Scan or ReverseScan operation would not have this problem since at the DistSender level those requests can span multiple ranges. Changing DistSender range lookup to send a request to the following range is mildly problematic as we'd have to adjust the key we're looking for so that it falls within the bounds of the following range. This doesn't seem conceptually difficult, but I think you're the most familiar with the range cache code.

In the short term, I'll send a patch to disable splitting of the meta2 range as we clearly don't handle this correctly.

@petermattis
Copy link
Collaborator

I changed my mind. Changing meta split keys as metaKey.Next() is relatively straightforward. We'll be adding future work to range merging and I'll document that in the associated issue.

petermattis added a commit to petermattis/cockroach that referenced this issue Aug 10, 2017
Force meta ranges to split at metaKey.Next(). This is necessary to allow
DistSender to direct RangeLookup requests to the correct meta
range. Consider the following scenario that can occur with the prior
split key selection for meta ranges:

  range 1 [a, e):
    b -> [a, b)
    c -> [b, c)
    d -> [c, d)
  range 2 [e, g):
    e -> [d, e)
    f -> [e, f)
    g -> [f, g)

Now consider looking up the range containing key `d`. The DistSender
routing logic would send the RangeLookup request to range 1 since `d`
lies within the bounds of that range. But notice that the range
descriptor containing `d` lies in range 2. Boom!

The root problem here is that the split key chosen for meta ranges is an
actual key in the range. What we want to do is ensure that the bounds of
range descriptors within a meta range are entirely contained within the
bounds of the meta range. We can accomplish this by splitting on
metaKey.Next(). This would transform the above example into:

  range 1 [a, d\0):
    b -> [a, b)
    c -> [b, c)
    d -> [c, d\0)
  range 2 [d\0, g\0):
    e -> [d, e)
    f -> [e, f)
    g -> [f, g)

Truncate range lookup scans to the range.

Fixes cockroachdb#2266
Fixes cockroachdb#16266
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 10, 2017

I'm not sure I follow. Why wouldn't/shouldn't the range [/Table/51/1/1225085640252622231, /Table/51/1/1225124192177909585) be placed in r1 if we chose a split key of /Table/51/1/1225124192177909585?

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 10, 2017

Aren't the range descriptors stored at the range's start key, not their end key? So in your example, we'd have:

range 1 [a, e):
  a -> [a, b)
  b -> [b, c)
  c -> [c, d)
range 2 [d, g):
  d -> [d, e)
  e -> [e, f)
  f -> [f, g)

@bdarnell
Copy link
Contributor

Aren't the range descriptors stored at the range's start key, not their end key?

They're stored twice: the range-local copy is stored under the start key, and the meta copy is stored under the end key.

@nvanbenschoten
Copy link
Member

Ah you're right, the covering optimization is disabled in the local command queue. However, I don't actually think this issue requires us to use the covering optimization, I think it just requires that we treat a parent *cmd as a handle to all child cmds, which we do regardless. The coveringOptimization flag determines whether we initially insert this *cmd into the interval tree or whether we initially insert all the children, but it won't affect how beginCmds interacts with batches that create multiple cmds.

@bdarnell
Copy link
Contributor

Here's what I see poring through the logs:

I170825 05:53:32.120342 15224381 storage/replica.go:2028  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 3 overlapping requests
  local(read-write):
    [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
    [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
command-queue(global):
command-queue(local):
  7 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  9 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  13 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  10 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  12 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  8 [/Local/Range/Table/51/1/1561260896670017006/RangeDescriptor)

The first PushTxn is assigned ids 12 and 13 in the local command queue. There are already two operations ahead of it: IDs 7 and 8 are from a write to the (new RHS) range descriptor, and 9 and 10 look like a heartbeat for this transaction. (In "waiting for 3 requests", the heartbeat counts double because its reads and writes are counted separately)

I170825 05:53:33.204211 15232755 storage/replica.go:2028  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 EndTxn: waiting for 5 overlapping requests
[snip]
command-queue(local):
  14 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  13 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  25 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  27 [/Local/RangeID/483671/r"",/Local/RangeID/483671/s"")
  28 [/Local/RangeID/483671/u"",/Local/RangeID/483671/v"")
  29 [/Local/RangeID/483671/u/RangeLastReplicaGCTimestamp)
  26 [/Local/Range/Table/51/1/1561186976538960094,/Local/Range/???)
  23 [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  12 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  15 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  22 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  24 [/Local/Range/Table/51/1/1561260896670017006/RangeDescriptor)

By the time EndTxn comes around, we see a second heartbeat, with ids 14 and 15, then the EndTxn's writes get ids 22-29 (the reads are missing here but show up later). All of the things that the push was waiting on have cleared the command queue; it should be able to proceed now.

This is waiting on the push and the second heartbeat (2x); I think to get to 5 commands the first heartbeat had to finish in between the prereq computation and this log message.

I170825 05:53:33.205214 15224381 storage/replica.go:2028  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 2 overlapping requests
[snip]
command-queue(local):
  18 readonly [/Local/RangeID/447049/r"",/Local/RangeID/447049/s"")
  20 readonly [/Local/RangeID/447049/r/AbortCache/"00000000-0000-0000-0000-000000000000",/Local/RangeID/447049/r/AbortCache/"ffffffff-ffff-ffff-ffff-ffffffffffff")
  14 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  19 readonly [/Local/RangeID/447049/u/RangeLastReplicaGCTimestamp)
  17 readonly [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  25 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  32 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  27 [/Local/RangeID/483671/r"",/Local/RangeID/483671/s"")
  28 [/Local/RangeID/483671/u"",/Local/RangeID/483671/v"")
  29 [/Local/RangeID/483671/u/RangeLastReplicaGCTimestamp)
  26 [/Local/Range/Table/51/1/1561186976538960094,/Local/Range/???)
  23 [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  15 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  22 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  31 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  24 [/Local/Range/Table/51/1/1561260896670017006/RangeDescriptor)

And now we see the same goroutine entering the command queue for a PushTxn again (I'll call this second incarnation of the first push Push1.2). I believe this means that after the preceding log block, the PushTxn was allowed to proceed, but hit a retryable error somewhere before it was proposed to raft. I'm not sure what that error could be, but it appears to have done the right thing in the command queue: the first incarnation of the command has been removed (so the 14-15 heartbeat should be executing now). The new push has ids 31-32. We also see IDs 17-22 showing up with the read-only portion of the EndTransaction. They weren't there before (maybe the covering optimization at work?) The 2 commands Push1.2 is waiting on are the read and write sides of the EndTransaction.

I170825 05:53:34.787416 15237224 storage/replica.go:2028  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] 1 PushTxn: waiting for 2 overlapping requests
[snip]
command-queue(local):
  18 readonly [/Local/RangeID/447049/r"",/Local/RangeID/447049/s"")
  20 readonly [/Local/RangeID/447049/r/AbortCache/"00000000-0000-0000-0000-000000000000",/Local/RangeID/447049/r/AbortCache/"ffffffff-ffff-ffff-ffff-ffffffffffff")
  14 readonly [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  19 readonly [/Local/RangeID/447049/u/RangeLastReplicaGCTimestamp)
  17 readonly [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  25 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  32 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  35 [/Local/RangeID/447049/r/AbortCache/"1b64e796-2911-476f-b751-8c908ab4aabc")
  27 [/Local/RangeID/483671/r"",/Local/RangeID/483671/s"")
  28 [/Local/RangeID/483671/u"",/Local/RangeID/483671/v"")
  29 [/Local/RangeID/483671/u/RangeLastReplicaGCTimestamp)
  26 [/Local/Range/Table/51/1/1561186976538960094,/Local/Range/???)
  23 [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor)
  15 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  22 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  31 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  34 [/Local/Range/Table/51/1/1561186976538960094/Transaction/"1b64e796-2911-476f-b751-8c908ab4aabc")
  24 [/Local/Range/Table/51/1/1561260896670017006/RangeDescriptor)

And now we get to Push2, with IDs 34-35. It's hard to tell from this which 2 commands it's waiting on. One is definitely Push1.2. But what's the other? The candidates are the second heartbeat or the EndTransaction (each of which have both read and write halves). Based on what happens next I think it has to be the heartbeat (probably the read half due to some of the timestamp-related logic in the command queue).

W170825 05:53:35.018087 15224381 storage/replica.go:2111  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] context canceled while in command queue: PushTxn [/Local/Range/Table/51/1/1561186976538960094/RangeDescriptor,/Min)
I170825 05:53:35.018382 15237224 storage/replica.go:3075  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] proposing command 0ba12e830838a78e: 1 PushTxn
I170825 05:53:35.444466 15232755 storage/replica.go:3075  [n6,s6,r447049/1:/Table/51/1/1561{1869…-2690…}] proposing command 18c3bd3d47d49b9b: 1 EndTxn

And now Push1.2 is canceled, and both EndTxn and Push2 are allowed to proceed (we can infer from this that Heartbeat2 also finished).

So the full sequence of commands is put, heartbeat, push, heartbeat, commit, push, push. And I think (given the coveringOptimization hypothesis), that can be reduced to heartbeat (or anything that stalls the commit), commit, push, push.

@bdarnell
Copy link
Contributor

It looks like we're only setting cancelled on the parent cmd, not the children.

It's worse than that: we're only setting prereqs on the parent cmd, not the children. And when the covering optimization is disabled, the prereqs themselves are always children, so prereqs don't transfer more than one hop. So this definitely explains it.

We need to better encapsulate the cmd struct. Right now there are two distinct subtypes of cmd (parents and leaves) as used in command_queue.go, but replica.go is oblivious to this. Conversely, cmd carries the canceled and prereqs field which are only touched in replica.go.

This makes me nervous about our test coverage here. Should we fix this or consider reverting #9448 until post 1.1?

@nvanbenschoten
Copy link
Member

we're only setting prereqs on the parent cmd, not the children

Exactly, so right now we aren't properly treating cancellation of BatchRequests that create more than one *cmd for any access/scope combination. I have this fixed and am working on improving the test coverage, which was lacking because we were only testing with DeleteRangeRequests. I don't think we'll need to revert #9448, but your judgement on this is better than mine.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 28, 2017

And when the covering optimization is disabled, the prereqs themselves are always children, so prereqs don't transfer more than one hop

That's not quite accurate. The key is that this will affect all BatchRequests that contain more than one span within a certain access/scope combination.

@bdarnell
Copy link
Contributor

That's not quite accurate. The key is that this will affect all BatchRequests that contain more than one span within a certain access/scope combination.

Right, but this is true of all PushTxn and EndTxn requests so all usage patterns except single KV ops can be affected by this.

@nvanbenschoten
Copy link
Member

Now that it looks like we fixed the consistency issue here, the rest of this issue can be moved to 1.2, right?

@bdarnell
Copy link
Contributor

Yeah. Or even better, retitle this to the consistency issue that took it over, close it, and make a new issue to lift the restriction in 1.2.

@nvanbenschoten nvanbenschoten changed the title storage: allow splitting meta2 ranges storage: transitive prerequisites are not always transferred on command cancellation Aug 30, 2017
@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 30, 2017

Fixed by #17939.

Meta2 splitting issue moved to #18032.

petermattis added a commit to petermattis/cockroach that referenced this issue Dec 10, 2017
Default the .meta zone config to 1h GC TTL. The shorter GC TTL reflects
the lack of need for ever performing historical queries on these ranges
coupled with the desire to keep the meta ranges smaller.

See cockroachdb#16266
See cockroachdb#14990
petermattis added a commit to petermattis/cockroach that referenced this issue Dec 11, 2017
Default the .meta zone config to 1h GC TTL and default the .liveness
zone config to 1m GC TTL. The shorter GC TTLs reflect the lack of need
for ever performing historical queries on these ranges coupled with the
desire to keep the meta and liveness ranges smaller.

See cockroachdb#16266
See cockroachdb#14990
petermattis added a commit to petermattis/cockroach that referenced this issue Dec 13, 2017
Default the .meta zone config to 1h GC TTL and default the .liveness
zone config to 10m GC TTL. The shorter GC TTLs reflect the lack of need
for ever performing historical queries on these ranges coupled with the
desire to keep the meta and liveness ranges smaller.

See cockroachdb#16266
See cockroachdb#14990
petermattis added a commit to petermattis/cockroach that referenced this issue Dec 16, 2017
Default the .meta zone config to 1h GC TTL and default the .liveness
zone config to 10m GC TTL. The shorter GC TTLs reflect the lack of need
for ever performing historical queries on these ranges coupled with the
desire to keep the meta and liveness ranges smaller.

See cockroachdb#16266
See cockroachdb#14990
petermattis added a commit to petermattis/cockroach that referenced this issue Dec 16, 2017
Default the .meta zone config to 1h GC TTL and default the .liveness
zone config to 10m GC TTL. The shorter GC TTLs reflect the lack of need
for ever performing historical queries on these ranges coupled with the
desire to keep the meta and liveness ranges smaller.

See cockroachdb#16266
See cockroachdb#14990

Release note (general change): Clusters are now initialized with default
.meta and .liveness zones with lower GC TTL configurations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants