Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over #65659

Open
joshimhoff opened this issue May 25, 2021 · 6 comments
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) good first issue O-sre For issues SRE opened or otherwise cares about tracking. T-kv KV Team

Comments

@joshimhoff
Copy link
Collaborator

joshimhoff commented May 25, 2021

Is your feature request related to a problem? Please describe.
Here are some example logs:

I210511 17:06:12.977887 1 cli/start.go:821 ⋮ initiating graceful shutdown of server
...
I210511 17:07:06.103135 18255896 server/drain.go:174 ⋮ [server drain process] drain remaining: 3
I210511 17:07:06.103174 18255896 server/drain.go:176 ⋮ [server drain process] drain details: range lease iterations: 3
I210511 17:07:07.360491 171 server/status/runtime.go:525 ⋮ [n1] runtime stats: 728 MiB RSS, 472 goroutines, 93 MiB/202 MiB/177 MiB GO alloc/idle/total, 350 MiB/452 MiB CGO alloc/total, 0.1 CGO/sec, 2.1/1.0 %(u/s)time, 0.0 %gc (0x), 358 KiB/1.0 MiB (r/w)net
I210511 17:07:07.978903 18255897 cli/start.go:831 ⋮ 26 running tasks
E210511 17:07:12.121121 18258961 server/server.go:1901 ⋮ [n1,client=‹10.0.2.129:53070›] serving SQL client conn: server is not accepting clients
I210511 17:07:12.978900 18255897 cli/start.go:831 ⋮ 26 running tasks

Above logs show 1m passing without graceful drain finishing (without all range lease transfers moving over), leading to some impact during update.

More detailed logs might help with debugging. It's not clear what is going wrong from the above logs.

Describe the solution you'd like

I expect it would help debugging if we logged more verbose info once we got down to N=10 or so range leases still held on the node. We could log:

  1. Range IDs, min & max keys, etc. of all ranges still leased by node.
  2. Range lease transfer attempt info. Log when a transfer attempt is made, log when error is returned if any, & log showing how long blocked on transfer attempt. (Tracing would help here too.)

Describe alternatives you've considered
We could do nothing, relying on operators using vmodule to get the additional verbosity when needed.

Additional context
On CC, graceful drain doesn't always finish before SIGKILL time. We want to fix this.

Epic: CRDB-8501

Jira issue: CRDB-7706

@joshimhoff joshimhoff added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) O-sre For issues SRE opened or otherwise cares about tracking. A-kv-observability labels May 25, 2021
@knz
Copy link
Contributor

knz commented May 25, 2021

NB: the logging you're requesting is already available via the vmodule setting.
Are you suggesting we should promote these log messages to be unconditional?

@knz knz changed the title Do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over kv,server: do more verbose logging of graceful drain range lease transfer, once the majority of ranges leases are transferred over May 25, 2021
@joshimhoff
Copy link
Collaborator Author

That is my suggestion but ONLY once we get down to N=10 (not sure what N should be) or so range leases still held on the node. Do you think that make sense? I think there is no reason to not be very verbose once you are down to a small number of ranges. What do you think about that?

@knz
Copy link
Contributor

knz commented May 25, 2021

I'm not opposed to it.

@joshimhoff
Copy link
Collaborator Author

Ack. I think how useful it would be depends on the scope of the issue we see on CC. I can get more info about scope without this change. I can post here when I have a sense of scope.

@cameronnunez
Copy link
Contributor

cameronnunez commented Jul 29, 2021

@tbg has suggested that we can use a verbose condition, since draining could stall/encounter errors at a point where the number of remaining leases is greater than N. We can set verbose to true when the number of remaining leases does not decrease (or worse increases) with graceful drain calls.

craig bot pushed a commit that referenced this issue Jan 25, 2022
68488: server: add verbose logging when graceful drain range lease transfer stalls r=knz a=cameronnunez

Informs #65659.

Debugging graceful drain is challenging due to limited information. It is not clear what
is going wrong from the logs. This patch adds more detailed logging when graceful drain
range lease transfer encounters issues. Specifically, range lease transfer attempt
information (when an attempt occurs, duration blocked on transfer attempt, and the lease
that is being transferred) is logged.

The reason behind the failure of the transfer attempt may not be logged on the
draining node. This is because this information lies on the node that serves the draining
request. Currently, we do not have logging that aggregates all of this information
in one place. See #74158.

Release note (cli change): If graceful drain range lease transfer encounters issues,
verbose logging is automatically set to help with troubleshooting.

75271: sql,server: create toString to role option r=maryliag a=maryliag

Previously, the code was using roleOption.String() to
create a string to be used on queries. This was causing
an issue with roles that contained spaces within it,
e.g. role option `VALIDUNTIL` was being translate to
`VALIDUNTIL` instead of the correct way `VALID UNTIL`.
This commit updates the String() function to correctly
add the space on the `VALID UNTIL` case.

Release note (bug fix): Update the String() function of
roleOption to add a space on the role `VALID UNTIL`.

75293: backupccl: reuse allocated tenant prefix in rekey r=dt a=dt

Release note: none.

75303: migrations: make sure public role gets privs on public schema r=ajwerner a=rafiss

This matches up with how the public schema gets created in a
non-migration scenario.

Release note: None

Co-authored-by: Cameron Nunez <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@github-actions
Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) good first issue O-sre For issues SRE opened or otherwise cares about tracking. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

5 participants