kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over #65659

joshimhoff · 2021-05-25T13:27:15Z

Is your feature request related to a problem? Please describe.
Here are some example logs:

I210511 17:06:12.977887 1 cli/start.go:821 ⋮ initiating graceful shutdown of server
...
I210511 17:07:06.103135 18255896 server/drain.go:174 ⋮ [server drain process] drain remaining: 3
I210511 17:07:06.103174 18255896 server/drain.go:176 ⋮ [server drain process] drain details: range lease iterations: 3
I210511 17:07:07.360491 171 server/status/runtime.go:525 ⋮ [n1] runtime stats: 728 MiB RSS, 472 goroutines, 93 MiB/202 MiB/177 MiB GO alloc/idle/total, 350 MiB/452 MiB CGO alloc/total, 0.1 CGO/sec, 2.1/1.0 %(u/s)time, 0.0 %gc (0x), 358 KiB/1.0 MiB (r/w)net
I210511 17:07:07.978903 18255897 cli/start.go:831 ⋮ 26 running tasks
E210511 17:07:12.121121 18258961 server/server.go:1901 ⋮ [n1,client=‹10.0.2.129:53070›] serving SQL client conn: server is not accepting clients
I210511 17:07:12.978900 18255897 cli/start.go:831 ⋮ 26 running tasks

Above logs show 1m passing without graceful drain finishing (without all range lease transfers moving over), leading to some impact during update.

More detailed logs might help with debugging. It's not clear what is going wrong from the above logs.

Describe the solution you'd like

I expect it would help debugging if we logged more verbose info once we got down to N=10 or so range leases still held on the node. We could log:

Range IDs, min & max keys, etc. of all ranges still leased by node.
Range lease transfer attempt info. Log when a transfer attempt is made, log when error is returned if any, & log showing how long blocked on transfer attempt. (Tracing would help here too.)

Describe alternatives you've considered
We could do nothing, relying on operators using vmodule to get the additional verbosity when needed.

Additional context
On CC, graceful drain doesn't always finish before SIGKILL time. We want to fix this.

Epic: CRDB-8501

Jira issue: CRDB-7706

The text was updated successfully, but these errors were encountered:

knz · 2021-05-25T13:32:22Z

NB: the logging you're requesting is already available via the vmodule setting.
Are you suggesting we should promote these log messages to be unconditional?

joshimhoff · 2021-05-25T13:36:45Z

That is my suggestion but ONLY once we get down to N=10 (not sure what N should be) or so range leases still held on the node. Do you think that make sense? I think there is no reason to not be very verbose once you are down to a small number of ranges. What do you think about that?

knz · 2021-05-25T13:37:38Z

I'm not opposed to it.

joshimhoff · 2021-05-25T13:39:37Z

Ack. I think how useful it would be depends on the scope of the issue we see on CC. I can get more info about scope without this change. I can post here when I have a sense of scope.

cameronnunez · 2021-07-29T18:20:26Z

@tbg has suggested that we can use a verbose condition, since draining could stall/encounter errors at a point where the number of remaining leases is greater than N. We can set verbose to true when the number of remaining leases does not decrease (or worse increases) with graceful drain calls.

68488: server: add verbose logging when graceful drain range lease transfer stalls r=knz a=cameronnunez Informs #65659. Debugging graceful drain is challenging due to limited information. It is not clear what is going wrong from the logs. This patch adds more detailed logging when graceful drain range lease transfer encounters issues. Specifically, range lease transfer attempt information (when an attempt occurs, duration blocked on transfer attempt, and the lease that is being transferred) is logged. The reason behind the failure of the transfer attempt may not be logged on the draining node. This is because this information lies on the node that serves the draining request. Currently, we do not have logging that aggregates all of this information in one place. See #74158. Release note (cli change): If graceful drain range lease transfer encounters issues, verbose logging is automatically set to help with troubleshooting. 75271: sql,server: create toString to role option r=maryliag a=maryliag Previously, the code was using roleOption.String() to create a string to be used on queries. This was causing an issue with roles that contained spaces within it, e.g. role option `VALIDUNTIL` was being translate to `VALIDUNTIL` instead of the correct way `VALID UNTIL`. This commit updates the String() function to correctly add the space on the `VALID UNTIL` case. Release note (bug fix): Update the String() function of roleOption to add a space on the role `VALID UNTIL`. 75293: backupccl: reuse allocated tenant prefix in rekey r=dt a=dt Release note: none. 75303: migrations: make sure public role gets privs on public schema r=ajwerner a=rafiss This matches up with how the public schema gets created in a non-migration scenario. Release note: None Co-authored-by: Cameron Nunez <[email protected]> Co-authored-by: Marylia Gutierrez <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>

github-actions · 2023-08-29T11:08:23Z

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

joshimhoff added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. A-kv-observability labels May 25, 2021

knz changed the title ~~Do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over~~ kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over May 25, 2021

jlinder added the T-kv KV Team label Jun 16, 2021

knz added the X-server-triaged-202105 label Jun 28, 2021

lunevalex added the good first issue label Jul 1, 2021

lunevalex assigned cameronnunez Jul 14, 2021

cameronnunez mentioned this issue Aug 5, 2021

server: add verbose logging when graceful drain range lease transfer stalls #68488

Merged

github-actions bot added the no-issue-activity label Aug 29, 2023

knz removed no-issue-activity X-server-triaged-202105 labels Aug 29, 2023

knz unassigned cameronnunez Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over #65659

kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over #65659

joshimhoff commented May 25, 2021 •

edited by cockroach-jira-scripts

Loading

knz commented May 25, 2021

joshimhoff commented May 25, 2021

knz commented May 25, 2021

joshimhoff commented May 25, 2021

cameronnunez commented Jul 29, 2021 •

edited

Loading

github-actions bot commented Aug 29, 2023

kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over #65659

kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over #65659

Comments

joshimhoff commented May 25, 2021 • edited by cockroach-jira-scripts Loading

knz commented May 25, 2021

joshimhoff commented May 25, 2021

knz commented May 25, 2021

joshimhoff commented May 25, 2021

cameronnunez commented Jul 29, 2021 • edited Loading

github-actions bot commented Aug 29, 2023

joshimhoff commented May 25, 2021 •

edited by cockroach-jira-scripts

Loading

cameronnunez commented Jul 29, 2021 •

edited

Loading