Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: failover/system-non-liveness/blackhole-recv failed #99931

Closed
cockroach-teamcity opened this issue Mar 29, 2023 · 7 comments · Fixed by #102407
Closed

roachtest: failover/system-non-liveness/blackhole-recv failed #99931

cockroach-teamcity opened this issue Mar 29, 2023 · 7 comments · Fixed by #102407
Assignees
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 29, 2023

roachtest.failover/system-non-liveness/blackhole-recv failed with artifacts on release-23.1 @ a34af06d8e7230e4052db80d55da21a06013f81b:

test artifacts and logs in: /artifacts/failover/system-non-liveness/blackhole-recv/run_1
(cluster.go:1977).Run: output in run_131047.746528933_n7_cockroach-workload-r: ./cockroach workload run kv --read-percent 50 --duration 20m --concurrency 256 --max-rate 2048 --timeout 1m --tolerate-errors --histograms=perf/stats.json {pgurl:1-3} returned: context canceled
(monitor.go:127).Wait: monitor failure: monitor command failure: unexpected node event: 6: dead (exit status 7)
(cluster.go:1977).Run: cluster.RunE: context canceled
(cluster.go:1387).FailOnInvalidDescriptors: invalid descriptors check failed: pq: replica unavailable: (n4,s4):4 unable to serve request to r9:/Table/{5-6} [(n4,s4):4, (n5,s5):2, (n6,s6):3, next=7, gen=18]: lost quorum (down: (n5,s5):2,(n6,s6):3); closed timestamp: 1680096374.884386491,0 (2023-03-29 13:26:14); raft status: {"id":"4","term":13,"vote":"0","commit":120,"lead":"0","raftState":"StatePreCandidate","applied":120,"progress":{},"leadtransferee":"0"}: have been waiting 62.00s for slow proposal RequestLease [/Table/5,/Min)

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-26147

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 29, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 29, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 29, 2023
@andrewbaptist andrewbaptist self-assigned this Mar 29, 2023
@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 29, 2023
@nvanbenschoten
Copy link
Member

This was an instance of #98698.

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Mar 29, 2023

This is a failure in the side transport. Looking through the logs, the following happened.
n6 crashed due a timestamp regression:

6.unredacted/cockroach.log:F230329 13:26:19.490179 80177 kv/kvserver/replica_closedts.go:263 ⋮ [n6] 2059  side-transport update saw closed timestamp regression on r26: (lai=23, ts=1680096375.380888424,0) -> (lai=22, ts=1680096376.289087959,0)

(the regression timestamps are 13:26:15.380 and 13:26:16.289; +3s is when they are likely issued by the side transport)

Node 1 had moved the replicas to 4, 5, 6 16 minutes earlier:

1.unredacted/cockroach-kv-distribution.log:I230329 13:10:00.254614 10988 13@kv/kvserver/replica_command.go:2324 ⋮ [T1,n1,replicate,s1,r26/1:‹/Table/2{4-5}›] 1313  change replicas (add [] remove [(n1,s1):1LEARNER]): existing descriptor r26:‹/Table/2{4-5}› [(n1,s1):1LEARNER, (n6,s6):2, (n5,s5):3, (n4,s4):4, next=5, gen=7]

Node 5 was cut off from the network (with an async partition) ~30 sec earlier:

 13:25:48 failover.go:574: test status: failing n5 (blackhole-recv)

It was still trying to transfer the lease over to n4 at this point - but I don't think it succeeded because of the network outage (although it thinks its for a different reason)

5.unredacted/cockroach-kv-distribution.log:I230329 
13:25:50.052077 74624 13@kv/kvserver/replicate_queue.go:2037 ⋮ [T1,n5,replicate,s5,r26/3:‹/Table/2{4-5}›] 1766  transferring lease to s4
13:26:06.416973 74624 13@kv/kvserver/replicate_queue.go:842 ⋮ [T1,n5,replicate,s5,r26/3:‹/Table/2{4-5}›] 1767  error processing replica: [n5,s5,r26/3:‹/Table/2{4-5}›]: unable to transfer lease to s4: refusing to transfer lease to (n4,s4):4 because target may need a Raft snapshot: ‹replica in StateProbe›‹›

My guess is that the lease was maybe ambiguous between n5 and n4 at this point in time. We cut off incoming connections to n5, but it still had outgoing connections so it might have sent a side transport update with a stale lease.

@nvanbenschoten
Copy link
Member

My guess is that the lease was maybe ambiguous between n5 and n4 at this point in time. We cut off incoming connections to n5, but it still had outgoing connections so it might have sent a side transport update with a stale lease.

It would be easy enough to log every incoming ctpb.Update to determine whether these messages came from the same node or from a different node. The log line is here:

log.VEventf(ctx, 4, "received side-transport update: %v", msg)

So we'd just need to update the test to run with --vmodule=receiver=4 and then hunt for a reproduction.

@andrewbaptist
Copy link
Collaborator

andrewbaptist commented Mar 29, 2023

I see a log on n5 right at the time this is happening:

W230329 13:26:19.517123 75763 kv/kvserver/raft_transport.go:605 ⋮ [T1,n5] 2208  while processing outgoing Raft queue to node 6: ‹EOF›:

and also from n4

I230329 13:26:29.592156 1197 kv/kvserver/closedts/sidetransport/sender.go:795 ⋮ [T1,n4,ctstream=6] 2822  side-transport failed to connect to n6: unable to dial n6: ‹breaker open›

So n5 likely was sending a side-transport update to n6 when n6 crashed. I can tee up trying to reproduce this with the increased logging.

@pav-kv pav-kv added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv Anything in KV that doesn't belong in a more specific category. labels Mar 30, 2023
@andrewbaptist andrewbaptist removed the branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 label Apr 17, 2023
@andrewbaptist
Copy link
Collaborator

Removed the branch-release tag from this as it is not a release or GA blocker for 23.1 With the latest dialback changes this is less likely to happen on this test, but this is still a real bug in that we need to get to the bottom of. However, it has been happening for a long time.

@erikgrinaker
Copy link
Contributor

We're tracking this as a GA blocker over in #98698, @pavelkalinnikov will have a look this week.

@erikgrinaker erikgrinaker assigned pav-kv and unassigned andrewbaptist Apr 17, 2023
@erikgrinaker erikgrinaker added T-kv-replication and removed T-kv KV Team labels Apr 17, 2023
@blathers-crl
Copy link

blathers-crl bot commented Apr 17, 2023

cc @cockroachdb/replication

pav-kv added a commit to pav-kv/cockroach that referenced this issue Apr 27, 2023
The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely
invalid state. The UNUSABLE state (when "now" is in stasis period) is less of a
clear cut: some callers may treat it as invalid, and some may thinks it's still
valid.

For example, the closed timestamp side-transport ignores the UNUSABLE state,
and takes it as valid. Because of the order in which the checks were made, this
has lead to a bug: a PROSCRIBED lease is reported as UNUSABLE during stasis
periods, the closed timestamp side-transport then considers it valid, and
updates closed timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus
method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to
"softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes cockroachdb#98698
Fixes cockroachdb#99931
Fixes cockroachdb#100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its
side-transport. Previously, during asymmetric partitions, a node that transfers
a lease away, and misses a liveness heartbeat, could then erroneously update
the closed timestamp during the stasis period of its liveness. This could lead
to closed timestamp invariant violation, and node crashes; in extreme cases
this could lead to inconsistencies in read-only queries.
pav-kv added a commit to pav-kv/cockroach that referenced this issue Apr 27, 2023
The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely
invalid state. The UNUSABLE state (when request timestamp is in stasis period)
is less of a clear cut: some callers may treat it as invalid, and some may
think it's still valid.

For example, the closed timestamp side-transport ignores the UNUSABLE state,
and takes it as valid. Because of the order in which the checks were made, this
has lead to a bug: a PROSCRIBED lease is reported as UNUSABLE during stasis
periods, the closed timestamp side-transport then considers it valid, and
updates closed timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus
method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to
"softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes cockroachdb#98698
Fixes cockroachdb#99931
Fixes cockroachdb#100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its
side-transport. Previously, during asymmetric partitions, a node that transfers
a lease away, and misses a liveness heartbeat, could then erroneously update
the closed timestamp during the stasis period of its liveness. This could lead
to closed timestamp invariant violation, and node crashes; in extreme cases
this could lead to inconsistencies in read-only queries.
craig bot pushed a commit that referenced this issue Apr 28, 2023
100181: kv: Use strict types for common fields r=erikgrinaker a=andrewbaptist

This PR introduces 3 new typed fields in mvcc.go:
RaftTerm, RaftIndex and LeaseSequence. These fields were previously just unit64 throughout the code and this made the code harder to read and risked incorrect conversions.

Epic: none

Release note: None

102407: kvserver: check PROSCRIBED lease status over UNUSABLE r=erikgrinaker,tbg a=pavelkalinnikov

The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely invalid state. The UNUSABLE state (when request timestamp is in stasis period) is less of a clear cut: we still own the lease but callers may use or not use it depending on context.

For example, the closed timestamp side-transport ignores the UNUSABLE state (because we still own the lease), and takes it as usable for its purposes. Because of the order in which the checks were made, this has lead to a bug: a PROSCRIBED lease is reported as UNUSABLE during stasis periods, the closed timestamp side-transport then considers it usable, and updates closed timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to "softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes #98698
Fixes #99931
Fixes #100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its side-transport. Previously, during asymmetric partitions, a node that transfers a lease away, and misses a liveness heartbeat, could then erroneously update the closed timestamp during the stasis period of its liveness. This could lead to closed timestamp invariant violation, and node crashes; in extreme cases this could lead to inconsistencies in read-only queries.

102503: concurrency: do not partition locks in the lock table by span scope r=nvanbenschoten a=arulajmani

This patch is entirely a refactor and does not change any functionality. This is done in preparation for introducing `LockSpanSets` to track lock spans, which do not make a distinction between global and local keys (unlike `SpanSets`, which do).

The main changes here are in `lockTableImpl`, which actually stores locks, and `lockTableGuardImpl` which snapshots the lock table. We no longer make a distinction between locks on Local and Global keys when storing them.  The majority of this diff is composed of test file churn caused because of the printing changes to the lock table.

Informs #102008

Release note: None

102590: sql,persistedsqlstats: prevent a deadlock during shutdown r=j82w a=knz

Fixes #102574.

Prior to this change, the coordination between the stats flusher
task (an async stopper task) and the activity flusher job was
performed using a two-step process:

- the stats persistence task offered to call a callback _function_
  every time a flush would complete.
- the job would _reconfigure the callback function_ on each iteration.
- the function was writing to a channel that was subsequently
  read by the job iteration body.

This approach was defective in 3 ways:

1. if the job iteration body would exit (e.g. due to a server drain)
   *after* it installed the callback fn, but *before* the stats flusher
   would read and call the callback fn, a window of time existed
   where a deadlock could occur:
   - the stats flusher retrieves the pointer to the caller fn
     but doesn't call it yet.
   - the job loop exits. From then on it will not read from the
     channel any more.
   - the stats flusher attempts to write to the channel.
     A deadlock occurs.

   (This was seen during testing. See #102574)

   The fix here is to always jointly `select` the write to the channel
   and also a read from the drain/stopper signals, to abort the
   channel operation if a shutdown is requested.

2. the stats flusher task was holding the mutex locked while
   performing the channel write. This is generally bad code hygiene
   as it forces the code maintainer to double-check whether the lock
   and channel operations don't mutually interlock.

   The fix is to use the mutex to retrieve the channel reference, and
   then write to the channel while the mutex is not held any more.

3. the API between the two was defining a *callback function* where
   really just a notification channel was needed.

   The fix here is to simplify the API.

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Arul Ajmani <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig craig bot closed this as completed in 401de30 Apr 28, 2023
blathers-crl bot pushed a commit that referenced this issue Apr 28, 2023
The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely
invalid state. The UNUSABLE state (when request timestamp is in stasis period)
is less of a clear cut: we still own the lease but callers may use or not use
it depending on context.

For example, the closed timestamp side-transport ignores the UNUSABLE state
(because we still own the lease), and takes it as usable for its purposes.
Because of the order in which the checks were made, this has lead to a bug: a
PROSCRIBED lease is reported as UNUSABLE during stasis periods, the closed
timestamp side-transport then considers it usable, and updates closed
timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus
method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to
"softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes #98698
Fixes #99931
Fixes #100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its
side-transport. Previously, during asymmetric partitions, a node that transfers
a lease away, and misses a liveness heartbeat, could then erroneously update
the closed timestamp during the stasis period of its liveness. This could lead
to closed timestamp invariant violation, and node crashes; in extreme cases
this could lead to inconsistencies in read-only queries.
blathers-crl bot pushed a commit that referenced this issue Apr 28, 2023
The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely
invalid state. The UNUSABLE state (when request timestamp is in stasis period)
is less of a clear cut: we still own the lease but callers may use or not use
it depending on context.

For example, the closed timestamp side-transport ignores the UNUSABLE state
(because we still own the lease), and takes it as usable for its purposes.
Because of the order in which the checks were made, this has lead to a bug: a
PROSCRIBED lease is reported as UNUSABLE during stasis periods, the closed
timestamp side-transport then considers it usable, and updates closed
timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus
method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to
"softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes #98698
Fixes #99931
Fixes #100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its
side-transport. Previously, during asymmetric partitions, a node that transfers
a lease away, and misses a liveness heartbeat, could then erroneously update
the closed timestamp during the stasis period of its liveness. This could lead
to closed timestamp invariant violation, and node crashes; in extreme cases
this could lead to inconsistencies in read-only queries.
blathers-crl bot pushed a commit that referenced this issue Apr 28, 2023
The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely
invalid state. The UNUSABLE state (when request timestamp is in stasis period)
is less of a clear cut: we still own the lease but callers may use or not use
it depending on context.

For example, the closed timestamp side-transport ignores the UNUSABLE state
(because we still own the lease), and takes it as usable for its purposes.
Because of the order in which the checks were made, this has lead to a bug: a
PROSCRIBED lease is reported as UNUSABLE during stasis periods, the closed
timestamp side-transport then considers it usable, and updates closed
timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus
method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to
"softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes #98698
Fixes #99931
Fixes #100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its
side-transport. Previously, during asymmetric partitions, a node that transfers
a lease away, and misses a liveness heartbeat, could then erroneously update
the closed timestamp during the stasis period of its liveness. This could lead
to closed timestamp invariant violation, and node crashes; in extreme cases
this could lead to inconsistencies in read-only queries.
blathers-crl bot pushed a commit that referenced this issue Apr 28, 2023
The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely
invalid state. The UNUSABLE state (when request timestamp is in stasis period)
is less of a clear cut: we still own the lease but callers may use or not use
it depending on context.

For example, the closed timestamp side-transport ignores the UNUSABLE state
(because we still own the lease), and takes it as usable for its purposes.
Because of the order in which the checks were made, this has lead to a bug: a
PROSCRIBED lease is reported as UNUSABLE during stasis periods, the closed
timestamp side-transport then considers it usable, and updates closed
timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus
method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to
"softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes #98698
Fixes #99931
Fixes #100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its
side-transport. Previously, during asymmetric partitions, a node that transfers
a lease away, and misses a liveness heartbeat, could then erroneously update
the closed timestamp during the stasis period of its liveness. This could lead
to closed timestamp invariant violation, and node crashes; in extreme cases
this could lead to inconsistencies in read-only queries.
cameronnunez pushed a commit to cameronnunez/cockroach that referenced this issue May 2, 2023
The PROSCRIBED lease status, just like EXPIRED, puts a lease to a definitely
invalid state. The UNUSABLE state (when request timestamp is in stasis period)
is less of a clear cut: we still own the lease but callers may use or not use
it depending on context.

For example, the closed timestamp side-transport ignores the UNUSABLE state
(because we still own the lease), and takes it as usable for its purposes.
Because of the order in which the checks were made, this has lead to a bug: a
PROSCRIBED lease is reported as UNUSABLE during stasis periods, the closed
timestamp side-transport then considers it usable, and updates closed
timestamps when it shouldn't.

This commit fixes the bug by swapping the order of checks in the leaseStatus
method. The order now goes from "hard" checks like EXPIRED and PROSCRIBED, to
"softer" UNUSABLE, and (when the softness is put to the limit) VALID.

Fixes cockroachdb#98698
Fixes cockroachdb#99931
Fixes cockroachdb#100101
Epic: none

Release note (bug fix): a bug is fixed in closed timestamp updates within its
side-transport. Previously, during asymmetric partitions, a node that transfers
a lease away, and misses a liveness heartbeat, could then erroneously update
the closed timestamp during the stasis period of its liveness. This could lead
to closed timestamp invariant violation, and node crashes; in extreme cases
this could lead to inconsistencies in read-only queries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv Anything in KV that doesn't belong in a more specific category. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants