Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: "new range lease" log entry is flooding logs #119451

Closed
ajstorm opened this issue Feb 21, 2024 · 3 comments · Fixed by #119474
Closed

kv: "new range lease" log entry is flooding logs #119451

ajstorm opened this issue Feb 21, 2024 · 3 comments · Fixed by #119474
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-1 Issues/test failures with a fix SLA of 1 month

Comments

@ajstorm
Copy link
Collaborator

ajstorm commented Feb 21, 2024

On the DRT cluster we're seeing several log files filled up very quickly with this log entry (a drain is in progress at the time):

https://github.com/cockroachdb/cockroach/blob/master/pkg/kv/kvserver/replica_proposal.go#L377-L386

In the 15 minute period below, we're filling 130 MBs of log files with this message. This is problematic because it causes the log files to wrap faster, thus losing important information (see #119384).

-rw-r----- 1 ubuntu ubuntu 10485735 Feb 20 14:18 cockroach.cct-232-0009.ubuntu.2024-02-20T14_18_32Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485673 Feb 20 14:18 cockroach.cct-232-0009.ubuntu.2024-02-20T14_18_05Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485371 Feb 20 14:18 cockroach.cct-232-0009.ubuntu.2024-02-20T14_17_40Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485677 Feb 20 14:17 cockroach.cct-232-0009.ubuntu.2024-02-20T14_17_14Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485461 Feb 20 14:17 cockroach.cct-232-0009.ubuntu.2024-02-20T14_16_48Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485693 Feb 20 14:16 cockroach.cct-232-0009.ubuntu.2024-02-20T14_16_22Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485594 Feb 20 14:16 cockroach.cct-232-0009.ubuntu.2024-02-20T14_15_57Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485304 Feb 20 14:15 cockroach.cct-232-0009.ubuntu.2024-02-20T14_15_31Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485457 Feb 20 14:15 cockroach.cct-232-0009.ubuntu.2024-02-20T14_15_06Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485666 Feb 20 14:15 cockroach.cct-232-0009.ubuntu.2024-02-20T14_14_40Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485392 Feb 20 14:14 cockroach.cct-232-0009.ubuntu.2024-02-20T14_14_14Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485734 Feb 20 14:14 cockroach.cct-232-0009.ubuntu.2024-02-20T14_13_48Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485493 Feb 20 14:13 cockroach.cct-232-0009.ubuntu.2024-02-20T14_13_23Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485627 Feb 20 14:13 cockroach.cct-232-0009.ubuntu.2024-02-20T14_12_57Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485534 Feb 20 14:12 cockroach.cct-232-0009.ubuntu.2024-02-20T14_12_32Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485533 Feb 20 14:12 cockroach.cct-232-0009.ubuntu.2024-02-20T14_12_06Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485290 Feb 20 14:12 cockroach.cct-232-0009.ubuntu.2024-02-20T14_11_40Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485708 Feb 20 14:11 cockroach.cct-232-0009.ubuntu.2024-02-20T14_11_15Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485549 Feb 20 14:11 cockroach.cct-232-0009.ubuntu.2024-02-20T14_10_49Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485522 Feb 20 14:10 cockroach.cct-232-0009.ubuntu.2024-02-20T14_10_24Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485619 Feb 20 14:10 cockroach.cct-232-0009.ubuntu.2024-02-20T14_09_58Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485616 Feb 20 14:09 cockroach.cct-232-0009.ubuntu.2024-02-20T14_09_32Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485275 Feb 20 14:09 cockroach.cct-232-0009.ubuntu.2024-02-20T14_09_07Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485473 Feb 20 14:09 cockroach.cct-232-0009.ubuntu.2024-02-20T14_08_41Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485495 Feb 20 14:08 cockroach.cct-232-0009.ubuntu.2024-02-20T14_08_15Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485534 Feb 20 14:08 cockroach.cct-232-0009.ubuntu.2024-02-20T14_07_50Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485458 Feb 20 14:07 cockroach.cct-232-0009.ubuntu.2024-02-20T14_07_24Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485611 Feb 20 14:07 cockroach.cct-232-0009.ubuntu.2024-02-20T14_06_58Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485484 Feb 20 14:06 cockroach.cct-232-0009.ubuntu.2024-02-20T14_06_32Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485484 Feb 20 14:06 cockroach.cct-232-0009.ubuntu.2024-02-20T14_06_06Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485625 Feb 20 14:06 cockroach.cct-232-0009.ubuntu.2024-02-20T14_05_41Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485633 Feb 20 14:05 cockroach.cct-232-0009.ubuntu.2024-02-20T14_05_15Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485488 Feb 20 14:05 cockroach.cct-232-0009.ubuntu.2024-02-20T14_04_49Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485677 Feb 20 14:04 cockroach.cct-232-0009.ubuntu.2024-02-20T14_04_24Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485480 Feb 20 14:04 cockroach.cct-232-0009.ubuntu.2024-02-20T14_03_58Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485647 Feb 20 14:03 cockroach.cct-232-0009.ubuntu.2024-02-20T14_03_32Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485525 Feb 20 14:03 cockroach.cct-232-0009.ubuntu.2024-02-20T14_03_07Z.2020030.log
-rw-r----- 1 ubuntu ubuntu 10485423 Feb 20 14:03 cockroach.cct-232-0009.ubuntu.2024-02-20T14_02_41Z.2020030.log

Marking it as a P-1 for now, as it's impeding our ability to diagnose problems on the DRT cluster.

Jira issue: CRDB-36203

@ajstorm ajstorm added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-1 Issues/test failures with a fix SLA of 1 month labels Feb 21, 2024
Copy link

blathers-crl bot commented Feb 21, 2024

cc @cockroachdb/replication

@erikgrinaker
Copy link
Contributor

I believe this was added to diagnose otherwise hard-to-track-down decommission stalls due to draining nodes picking up spurious leases. @nvanbenschoten added this back in #87885, so he may have additional context. I suppose we could add a log.Every on this or something.

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Feb 21, 2024

The logs all look like:

./cockroach.cct-232-0009.ubuntu.2024-02-20T13_43_21Z.2020030.log:I240220 13:43:21.044230 3353 kv/kvserver/replica_proposal.go:382 ⋮ [T1,Vsystem,n9,s9,r1638520/5:‹/Tenant/3/Table/168277/1/4…›,raft] 1562878  new range lease repl=(n9,s9):5 seq=74813 start=1708423336.989804834,0 exp=1708436606.990520355,0 pro=1708436600.990520355,0 following repl=(n9,s9):5 seq=74813 start=1708423336.989804834,0 exp=1708436603.505445057,0 pro=1708436597.505445057,0
./cockroach.cct-232-0009.ubuntu.2024-02-20T13_43_21Z.2020030.log:I240220 13:43:21.044126 3383 kv/kvserver/replica_proposal.go:382 ⋮ [T1,Vsystem,n9,s9,r1695158/4:‹/Tenant/3/Table/170856/1/-62…›,raft] 1562872  new range lease repl=(n9,s9):4 seq=75618 start=1708424184.346483435,1 exp=1708436607.012891049,0 pro=1708436601.012891049,0 following repl=(n9,s9):4 seq=75618 start=1708424184.346483435,1 exp=1708436604.008157019,0 pro=1708436598.008157019,0
./cockroach.cct-232-0009.ubuntu.2024-02-20T13_43_21Z.2020030.log:I240220 13:43:21.044139 3375 kv/kvserver/replica_proposal.go:382 ⋮ [T1,Vsystem,n9,s9,r1686594/1:‹/Tenant/3/Table/170424/1/…›,raft] 1562873  new range lease repl=(n9,s9):1 seq=75427 start=1708428828.753862930,1 exp=1708436607.013107442,0 pro=1708436601.013107442,0 following repl=(n9,s9):1 seq=75427 start=1708428828.753862930,1 exp=1708436604.008380223,0 pro=1708436598.008380223,0
./cockroach.cct-232-0009.ubuntu.2024-02-20T13_43_21Z.2020030.log:I240220 13:43:21.043851 3377 kv/kvserver/replica_proposal.go:382 ⋮ [T1,Vsystem,n9,s9,r1710170/4:‹/Tenant/3/Table/171088/1/14…›,raft] 1562857  new range lease repl=(n9,s9):4 seq=75743 start=1708427132.548816779,1 exp=1708436607.007520912,0 pro=1708436601.007520912,0 following repl=(n9,s9):4 seq=75743 start=1708427132.548816779,1 exp=1708436604.003125014,0 pro=1708436598.003125014,0
./cockroach.cct-232-0009.ubuntu.2024-02-20T13_43_21Z.2020030.log:I240220 13:43:21.043710 3376 kv/kvserver/replica_proposal.go:382 ⋮ [T1,Vsystem,n9,s9,r1688614/6:‹/Tenant/3/Table/170537/1/4…›,raft] 1562851  new range lease repl=(n9,s9):6 seq=75467 start=1708427131.576933606,1 exp=1708436606.996262994,0 pro=1708436600.996262994,0 following repl=(n9,s9):6 seq=75467 start=1708427131.576933606,1 exp=1708436603.513797742,0 pro=1708436597.513797742,0

Notice that they're expiration-based leases extensions, not acquisitions. We should probably only be logging this loudly when a lease is changing hands. I'll make that change.

craig bot pushed a commit that referenced this issue Feb 21, 2024
119474: kv: don't log on lease extension while draining r=nvanbenschoten a=nvanbenschoten

Fixes #119451.

This commit updates the logging in `Replica.leasePostApplyLocked` to not log on expiration-based lease extensions while the new leaseholder's store is draining. We now only log when a draining store receives a new lease.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig craig bot closed this as completed in ec8bd9b Feb 21, 2024
blathers-crl bot pushed a commit that referenced this issue Feb 21, 2024
Fixes #119451.

This commit updates the logging in `Replica.leasePostApplyLocked` to not
log on expiration-based lease extensions while the new leaseholder's store
is draining. We now only log when a draining store receives a new lease.

Release note: None
blathers-crl bot pushed a commit that referenced this issue Feb 21, 2024
Fixes #119451.

This commit updates the logging in `Replica.leasePostApplyLocked` to not
log on expiration-based lease extensions while the new leaseholder's store
is draining. We now only log when a draining store receives a new lease.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-testcluster Issues found or occurred on a test cluster, i.e. a long-running internal cluster P-1 Issues/test failures with a fix SLA of 1 month
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants