*: ban the go keyword in non-test code #58164

jordanlewis · 2020-12-22T02:36:35Z

Misuse of the go keyword is problematic for a server that desires to be HA like CockroachDB: a panic-causing stack on the resultant goroutine will crash the entire process without custom crash handling.

There are already several wrappers for go throughout the codebase, such as Tasks and Workers creatable by the stopper. These wrappers perform cleanup and report errors to sentry, and can be extended over time. Unfortunately, they're not used consistently, and their use is not enforced.

This ticket tracks removing the go keyword from non-test code, except for the packages that implement the wrappers, and adding a linter that bans direct use of the go keyword except for in those places.

Jira issue: CRDB-3425

The text was updated successfully, but these errors were encountered:

jordanlewis · 2020-12-22T04:41:37Z

Here is the current list of users of go in the source tree. It's possible not all of these would be appropriately replaced by stopper.RunAsyncTask or whatever the case may be. I would advocate to add a simpler utils.Go() method that performs cleanup, dumps goroutines and currently-active queries to a file, and reports to Sentry, but perhaps some of these ought to be migrated to a stopper. I'm not sure.

rg -tgo -g "!{*_test,fuzz,*.pb.gw}.go" -g"!pkg/{workload,cmd,cli,acceptance,testutils}" -g"!pkg/util/stop"  "\Wgo func" pkg | pbcopy

pkg/sql/virtual_table.go:	go func() {
pkg/sql/planhook.go:	go func() {
pkg/sql/colexec/parallel_unordered_synchronizer.go:		go func(input SynchronizerInput, inputIdx int) {
pkg/sql/copy_file_upload.go:	go func() {
pkg/sql/pgwire/conn.go:	go func() {
pkg/sql/pgwire/server.go:	go func() {
pkg/sql/create_stats.go:	go func() {
pkg/sql/stats/stats_cache.go:	go func() {
pkg/sql/execinfra/base.go:			go func(input RowSource) {
pkg/sql/flowinfra/inbound.go:	go func() {
pkg/sql/flowinfra/flow_scheduler.go:	go func() {
pkg/sql/flowinfra/flow_registry.go:				go func(r InboundStreamHandler) {
pkg/sql/flowinfra/flow_registry.go:	go func() {
pkg/sql/flowinfra/flow.go:		go func(i int) {
pkg/sql/flowinfra/flow.go:	go func() {
pkg/sql/flowinfra/flow.go:		go func(receiver InboundStreamHandler) {
pkg/sql/rowflow/routers.go:		go func(ctx context.Context, rb *routerBase, ro *routerOutput, wg *sync.WaitGroup) {
pkg/sql/internal.go:	go func() {
pkg/sql/colflow/colrpc/outbox.go:	go func() {
pkg/sql/colflow/vectorized_flow.go:			go func() {
pkg/rpc/context.go:	go func() {
pkg/rpc/context.go:	go func() {
pkg/jobs/deprecated.go:		go func() {
pkg/jobs/deprecated.go:		go func() {
pkg/jobs/adopt.go:		go func(id int64) {
pkg/jobs/adopt.go:	go func() {
pkg/jobs/adopt.go:	go func() {
pkg/server/status.go:		go func() {
pkg/server/drain.go:	go func() {
pkg/ccl/sqlproxyccl/server.go:	go func() {
pkg/ccl/sqlproxyccl/server.go:		go func() {
pkg/ccl/changefeedccl/changefeed_processors.go:	go func() {
pkg/ccl/sqlproxyccl/proxy.go:			go func() {
pkg/ccl/sqlproxyccl/proxy.go:	go func() {
pkg/ccl/sqlproxyccl/proxy.go:	go func() {
pkg/ccl/importccl/import_processor.go:	go func() {
pkg/ccl/backupccl/backup_processor.go:	go func() {
pkg/ccl/backupccl/split_and_scatter_processor.go:	go func() {
pkg/util/netutil/net.go:		go func() {
pkg/util/log/clog.go:		go func() {
pkg/util/hlc/hlc.go:	go func() {
pkg/security/certificate_manager.go:	go func() {
pkg/util/tracing/grpc_interceptor.go:	go func() {
pkg/util/sdnotify/sdnotify_unix.go:	go func() {
pkg/util/sdnotify/sdnotify_unix.go:	go func() {

knz · 2020-12-22T18:57:58Z

I have looked at a few. I believe we need a mix of both. A large number can+should use a stopper and a context.Context, which will also ensure they are properly shut down. A few of them would need something smaller, so I'd be in favor of adding utils.Go() as well.

knz · 2021-01-18T11:38:26Z

@tbg this connects well to our discussion earlier last week to promote the use of the stopper to ensure that logging contexts get properly propagated.

tbg · 2021-01-18T11:48:17Z

Thanks! I wasn't aware of this issue, glad it already exists.First I wanted to add that we'd also have to do something about errgroup, ctxgroup, etc. These generally expose a .Go method, so it shouldn't be so hard to lint against them as well (and they're also straightforward enough to wrap).

My (unordered) running list of benefits for reworking our Stopper to be the go replacement primitive is the following (makes no attempts to avoid duplicating arguments mentioned above):

can recover/report panics automatically (caveat: more work needed to recover so without causing further issues like tagging mutex critical sections etc)
integration with background tracing
possible path towards Span memory reuse (since we can prevent spans from crossing goroutine boundaries)
can automatically assign goroutine labels
^-- enables a better implementation of trace-based deterministic goroutine testing as spearheaded by @nvanbenschoten here and possibly allows us to write many more tests in this fashion
removing the log singleton (one log instance per stopper, handed to each task via ctx)
avoid test failures where the stopper stops but an RPC handler still fires and runs into, e.g. panic("pebble: closed").

tbg · 2021-01-29T10:19:14Z

I am looking at our existing use of profiler labels (https://rakyll.org/profiler-labels/) right now and what it would take to make them usable for observability - say you're looking at a node that hovers at 95% CPU, what operation causes the CPU to be consumed? I think that in the short term, profiler labels are our best bet at developing a solution here.

Profiler labels are interesting because they transition into child goroutines, meaning that if we do a good job at labeling goroutines "where the work starts" and making sure that these labels hop across RPCs, we should be in a fairly good place. Right now, it's good enough so that if some SQL operation is burning CPU, you will be able to pick out which one it is. But what isn't yet possible (as far as I can tell) is that if you start with an overloaded CPU, you can tell from the profiles conclusively that a statement is responsible for it. Sure, you see some CPU attributed to that statement, but not, like, the lion's share. It seems that most of the work done on behalf of the statement is elsewhere. I still have to understand why that is, but it's possible that it ties in with this issue. One issue is definitely that when you delegate work to a thread pool (as happens for raft operations), the association breaks. But at least for read-heavy workloads, I'm not aware of such a switching point. Possibly DistSQL has one, though.

We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as cockroachdb#58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in cockroachdb#51544 and makes sense for all of the reasons mentioned in issue cockroachdb#58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes cockroachdb#52894. Release note: None

59647: stop: rip out expensive task tracking r=knz a=tbg First commit was put up for PR separately, ignore it here. ---- We are likely going to invest more in the stopper-conferred observability in the near future as part of initiatives such as #58164, but the task tracking that has been a part of the stopper since near its conception has not proven to be useful in practice, while at the same time raising concern about stopper use in hot paths. When shutting down a running server, we don't particularly care about leaking goroutines (as the process will end anyway). In tests, we want to ensure goroutine hygiene, but if a test hangs during `Stop`, it is easier to look at the stacks to find out why than to consult the task map. Together, this left little reason to do anything more complicated than what's left after this commit: we keep track of the running number of tasks, and wait until this drops to zero. With this change in, we should feel comfortable using the stopper extensively and, for example, ensuring that any CRDB goroutine is anchored in a Stopper task; this is the right approach for test flakes such as in #51544 and makes sense for all of the reasons mentioned in issue #58164 as well. In a future change, we should make the Stopper more configurable and, through this configurability, we could in principle bring a version of the task map back (in debug builds) without backing it into the stopper, though I don't anticipate that we'll want to. Closes #52894. Release note: None 59732: backupccl: add an owner column behind the WITH PRIVILEGES option r=pbardea a=Elliebababa Previously, when users perform RESTORE, they are ignorant of the original owner. This PR gives ownership data as a column behind privileges. Resolves: #57906. Release note: None. 59746: opt: switch checks to use CrdbTestBuild instead of RaceEnabled r=RaduBerinde a=RaduBerinde The RaceEnabled flag is not very useful for checks; e.g. apparently execbuilder tests aren't run routinely in race mode. These checks are now "live" in any test build, using the crdb_test build tag. Release note: None 59747: tree: correct StatementTag of ALTER TABLE ... LOCALITY r=ajstorm a=otan Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: elliebababa <[email protected]> Co-authored-by: Radu Berinde <[email protected]> Co-authored-by: Oliver Tan <[email protected]>

This helps move the needle on cockroachdb#58164 by introducing linters that force both the use of a `ctxgroup` over an `errgroup` and prevent direct use of the `go` keyword. It's part of `make lint`, but can be invoked stand-alone as well: ``` go vet -vettool ./bin/roachvet -errgroupgo ./pkg/somewhere go vet -vettool ./bin/roachvet -nakedgo ./pkg/somewhere ``` The lint currently fails with over 200 [infractions], but I suppose a little elbow grease is all that's needed here. [infractions]: https://gist.github.com/tbg/7c1af445007650387841359de6f8fbbc Release note: None

This helps move the needle on cockroachdb#58164 by introducing linters that force both the use of a `ctxgroup` over an `errgroup` and prevent direct use of the `go` keyword. It's part of `make lint`, but can be invoked stand-alone as well: ``` go vet -vettool ./bin/roachvet -errgroupgo ./pkg/somewhere go vet -vettool ./bin/roachvet -nakedgo ./pkg/somewhere ``` Release note: None

…nd `go` This helps move the needle on cockroachdb#58164 by introducing linters that force both the use of a `ctxgroup` over an `errgroup` and prevent direct use of the `go` keyword. They are disabled by default because we need to fix the issues they find first. This can be done (for development) in `forbiddenmethod.Analyzers`. They can then also be invoked in a targeted fashion: ``` go vet -vettool ./bin/roachvet -errgroupgo ./pkg/somewhere go vet -vettool ./bin/roachvet -nakedgo ./pkg/somewhere ``` Release note: None

@taroface

62243: forbiddenmethod: add (disabled) lint against `(*errgroup.Group).Go` and `go` r=knz,erikgrinaker,jordanlewis a=tbg This helps move the needle on #58164 by introducing linters that force both the use of a `ctxgroup` over an `errgroup` and prevent direct use of the `go` keyword. They are disabled by default because we need to fix the issues they find first. This can be done (for development) in `forbiddenmethod.Analyzers`. They can then also be invoked in a targeted fashion: ``` go vet -vettool ./bin/roachvet -errgroupgo ./pkg/somewhere go vet -vettool ./bin/roachvet -nakedgo ./pkg/somewhere ``` Release note: None 62629: docs, server: update auto-generated logging docs r=knz a=taroface The auto-generated docs at https://github.com/cockroachdb/cockroach/tree/master/docs/generated will be used as includes on the public docs site. This PR makes copyedits and some corrections to the text. - In the log format docs, I added formatting that I think will add code ticks to field names. In some cases (e.g., ‹...›) I may not have done this correctly. - The structured event description for `node_decommissioning` is not appearing correctly in eventlog.md. The generated sentence for this event does not match the corresponding line in the other event descriptions. I fixed a typo where this was written as "NodeDecommissioned" but I'm not sure if that fixes the issue. - In a few cases I pasted lines that were much longer than what was originally in the commented-out blocks. I hope this doesn't break the generation! Release note: none 63841: pgwire,logpb,eventpb: various structured logging doc updates r=taroface a=knz First commit from #62629. Fixes #63764. Fixes #63762. See individual commits for details. cc @taroface 64706: pkg/cli: new flag `--log-config-file` to simplify log configuration r=rauchenstein a=knz Fixes #64349. cc @thtruo Release note (cli change): The new parameter `--log-config-file` simplifies the process of loading the logging configuration from a YAML file. Instead of passing the content of the file via e.g. `--log=$(cat file.yaml)`, it is now possible to pass the path to the file `--log-config-file=file.yaml`. Note: each occurrence of `--log` and `--log-config-file` on the command line overrides the configuration set from previous occurrences. 64776: workload: mark tpcc idle-conns flag as runtime-only r=dt a=dt Closes #64678. Release note: none. 64777: ccl/backupccl: skip TestBackupWorkerFailure r=stevendanna a=adityamaru Refs: #64773 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Co-authored-by: Tobias Grieger <[email protected]> Co-authored-by: taroface <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: David Taylor <[email protected]> Co-authored-by: Aditya Maru <[email protected]>

72492: server: fix span use after Finish() r=andreimatei a=andreimatei The closure serving a pgwire connection was capturing a context with a long-lived span, and so all connections were logging to that span. That was bad, but what was even worse is that the long-lived span can end before the closures, on server shutdown. That was use-after-Finish(), which is currently reluctantly tolerated, but won't be tolerated much longer. This patch fixes it by giving each connection its own span. Touches #58164 Release note: None Co-authored-by: Andrei Matei <[email protected]>

The backup processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None

The splitAndScatter processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None

The backup processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None

The splitAndScatter processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None

The backup processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None

The splitAndScatter processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None

The backup processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None

The splitAndScatter processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None

The backup processor was spawning a naked goroutine. We don't like that very much - see cockroachdb#58164. This patch puts that goroutine under the Stopper. One benefit is that the goroutine gets its own span, so it's resilient to the parent span being Finish()ed from under it (which was a bug until the prior commit). Release note: None Release justification (bug fix): goroutines spawned by the backup processor were mutating BoundAccounts used for memory monitoring, after the BoundAccount was closed by the processor on shutdown. To prevent this, we teach the processor to wait for all goroutines to terminate before running its cleanup on shutdown.

ajwerner · 2023-03-13T17:05:53Z

#98269 is another example of code where a panic shouldn't bring down the node. Distsql flows should recover panics.

jordanlewis added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Dec 22, 2020

knz added the S-3-productivity Severe issues that impede the productivity of CockroachDB developers. label Jan 18, 2021

tbg mentioned this issue Feb 1, 2021

stop: rip out expensive task tracking #59647

Merged

tbg mentioned this issue Feb 22, 2021

*: conclusively attribute CPU usage to SQL queries and sessions #60508

Closed

tbg mentioned this issue Mar 19, 2021

forbiddenmethod: add (disabled) lint against (*errgroup.Group).Go and go #62243

Merged

jlinder added T-sql-queries SQL Queries Team T-server-and-security DB Server & Security labels Jun 16, 2021

This was referenced Feb 17, 2022

schemachanger,backupccl: support for backup and restore mid-change #76715

Merged

ctxgroup,*: catch panics when using workers in jobs #76734

Open

irfansharif mentioned this issue Feb 25, 2022

tenantrate: use measured on-cpu time for rate limiting #77041

Open

jeffswenson mentioned this issue Mar 11, 2022

SQL Proxy Misuses Stopper's Context #77689

Open

irfansharif mentioned this issue Jun 30, 2022

span stats collector powering the key visualizer #83131

Closed

knz mentioned this issue Nov 11, 2022

log: panics during server startup don't make it to the console #91700

Closed

ajwerner mentioned this issue Mar 13, 2023

Ability to crash crdb on DDL using cluster_logical_timestamp() #98269

Closed

mgartner added this to SQL Queries Jul 24, 2023

mgartner moved this to 23.2 Release in SQL Queries Jul 24, 2023

mgartner added the quality-friday A good issue to work on on Quality Friday label Aug 3, 2023

mgartner moved this from 23.2 Release to Bugs to Fix in SQL Queries Aug 3, 2023

renatolabs mentioned this issue Jan 23, 2024

roachtest: split Monitor functionality in two components #118214

Open

renatolabs mentioned this issue Sep 17, 2024

roachtest: remove all panics from perturbation/* tests #130866

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

*: ban the go keyword in non-test code #58164

*: ban the go keyword in non-test code #58164

jordanlewis commented Dec 22, 2020 •

edited by cockroach-jira-scripts

Loading

jordanlewis commented Dec 22, 2020

knz commented Dec 22, 2020

knz commented Jan 18, 2021

tbg commented Jan 18, 2021 •

edited

Loading

tbg commented Jan 29, 2021

ajwerner commented Mar 13, 2023

*: ban the go keyword in non-test code #58164

*: ban the go keyword in non-test code #58164

Comments

jordanlewis commented Dec 22, 2020 • edited by cockroach-jira-scripts Loading

jordanlewis commented Dec 22, 2020

knz commented Dec 22, 2020

knz commented Jan 18, 2021

tbg commented Jan 18, 2021 • edited Loading

tbg commented Jan 29, 2021

ajwerner commented Mar 13, 2023

jordanlewis commented Dec 22, 2020 •

edited by cockroach-jira-scripts

Loading

tbg commented Jan 18, 2021 •

edited

Loading