Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cli: make --global flag for demo public #62435

Merged
merged 4 commits into from
Apr 27, 2021

Conversation

otan
Copy link
Contributor

@otan otan commented Mar 23, 2021

See individual commits for details.

Refs: #62025

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@otan otan marked this pull request as ready for review March 23, 2021 11:44
@otan otan requested a review from a team as a code owner March 23, 2021 11:44
@otan otan force-pushed the demo_test branch 4 times, most recently from b3a9fb5 to 95a1ae4 Compare March 23, 2021 23:02
@otan
Copy link
Contributor Author

otan commented Mar 24, 2021

not sure how to debug that test failure, can't repro it locally or on roachprod :\

@otan
Copy link
Contributor Author

otan commented Mar 24, 2021

ah got one! @knz would you know what this is about:

=== CONT  TestTransientClusterSimulateLatencies
    demo_test.go:217: Leaked goroutine: goroutine 6942 [chan receive]:
        net/http.(*persistConn).addTLS(0xc004752a20, 0xc00472d8e0, 0x18, 0x0, 0xc00472d8f9, 0x3)
        	/usr/local/go/src/net/http/transport.go:1515 +0x1a5
        net/http.(*Transport).dialConn(0xc000cec000, 0x53ffbe0, 0xc00453b200, 0x0, 0xc00128b0e0, 0x5, 0xc00472d8e0, 0x1c, 0x0, 0xc004752a20, ...)
        	/usr/local/go/src/net/http/transport.go:1585 +0x1d25
        net/http.(*Transport).dialConnFor(0xc000cec000, 0xc000e75d90)
        	/usr/local/go/src/net/http/transport.go:1421 +0xc6
        created by net/http.(*Transport).queueForDial
        	/usr/local/go/src/net/http/transport.go:1390 +0x40f
        Leaked goroutine: goroutine 7106 [IO wait]:
        internal/poll.runtime_pollWait(0x7f2cf9c6cc28, 0x72, 0x5372300)
        	/usr/local/go/src/runtime/netpoll.go:222 +0x55
        internal/poll.(*pollDesc).wait(0xc009b6a098, 0x72, 0x5372300, 0x7488f98, 0x0)
        	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
        internal/poll.(*pollDesc).waitRead(...)
        	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
        internal/poll.(*FD).Read(0xc009b6a080, 0xc00178d680, 0x205, 0x205, 0x0, 0x0, 0x0)
        	/usr/local/go/src/internal/poll/fd_unix.go:159 +0x1a5
        net.(*netFD).Read(0xc009b6a080, 0xc00178d680, 0x205, 0x205, 0x203002, 0x58, 0xc000a00000)
        	/usr/local/go/src/net/fd_posix.go:55 +0x4f
        net.(*conn).Read(0xc00ca528a8, 0xc00178d680, 0x205, 0x205, 0x0, 0x0, 0x0)
        	/usr/local/go/src/net/net.go:182 +0x8e
        crypto/tls.(*atLeastReader).Read(0xc009426b80, 0xc00178d680, 0x205, 0x205, 0xc00178d680, 0x0, 0xc001574688)
        	/usr/local/go/src/crypto/tls/conn.go:779 +0x62
        bytes.(*Buffer).ReadFrom(0xc006e7b080, 0x53660c0, 0xc009426b80, 0x43c1a5, 0x40ba480, 0x45dde60)
        	/usr/local/go/src/bytes/buffer.go:204 +0xb1
        crypto/tls.(*Conn).readFromUntil(0xc006e7ae00, 0x536e880, 0xc00ca528a8, 0x5, 0xc00ca528a8, 0x100000451d360)
        	/usr/local/go/src/crypto/tls/conn.go:801 +0xf3
        crypto/tls.(*Conn).readRecordOrCCS(0xc006e7ae00, 0xc003a10000, 0x114, 0x120)
        	/usr/local/go/src/crypto/tls/conn.go:608 +0x115
        crypto/tls.(*Conn).readRecord(...)
        	/usr/local/go/src/crypto/tls/conn.go:576
        crypto/tls.(*Conn).readHandshake(0xc006e7ae00, 0xc008a7f216, 0xc008a7f200, 0x10f, 0x180)
        	/usr/local/go/src/crypto/tls/conn.go:992 +0x6d
        crypto/tls.(*Conn).clientHandshake(0xc006e7ae00, 0x0, 0x0)
        	/usr/local/go/src/crypto/tls/handshake_client.go:170 +0x2a7
        crypto/tls.(*Conn).Handshake(0xc006e7ae00, 0x0, 0x0)
        	/usr/local/go/src/crypto/tls/conn.go:1362 +0xc9
        net/http.(*persistConn).addTLS.func2(0x0, 0xc006e7ae00, 0x0, 0xc000a791a0)
        	/usr/local/go/src/net/http/transport.go:1509 +0x45
        created by net/http.(*persistConn).addTLS
        	/usr/local/go/src/net/http/transport.go:1505 +0x177

i'll try look at this more tomorrow, but curious to have pointers as to where these goroutines get created

@otan otan force-pushed the demo_test branch 2 times, most recently from 0d78d29 to d08ef61 Compare March 25, 2021 03:04
@otan
Copy link
Contributor Author

otan commented Mar 25, 2021

here are the goroutines during the leak:

goroutines:

goroutine 213 [running]:
github.com/cockroachdb/cockroach/pkg/util/leaktest.AfterTest.func1()
	/go/src/github.com/cockroachdb/cockroach/pkg/util/leaktest/leaktest.go:141 +0x1e9
github.com/cockroachdb/cockroach/pkg/cli.TestTransientClusterSimulateLatencies(0xc000701e00)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/demo_test.go:211 +0x662
testing.tRunner(0xc000701e00, 0x4daae18)
	/usr/local/go/src/testing/testing.go:1123 +0xef
created by testing.(*T).Run
	/usr/local/go/src/testing/testing.go:1168 +0x2b3

goroutine 1 [chan receive]:
testing.(*T).Run(0xc000701e00, 0x474bbc3, 0x25, 0x4daae18, 0x4c9e01)
	/usr/local/go/src/testing/testing.go:1169 +0x2da
testing.runTests.func1(0xc000701c80)
	/usr/local/go/src/testing/testing.go:1439 +0x78
testing.tRunner(0xc000701c80, 0xc001827d58)
	/usr/local/go/src/testing/testing.go:1123 +0xef
testing.runTests(0xc001292e40, 0x74854e0, 0x43, 0x43, 0xc00f1f2f5679042b, 0xe1697bfcc, 0x77afbc0, 0x203000)
	/usr/local/go/src/testing/testing.go:1437 +0x2fe
testing.(*M).Run(0xc001182180, 0x0)
	/usr/local/go/src/testing/testing.go:1345 +0x1eb
github.com/cockroachdb/cockroach/pkg/cli_test.TestMain(0xc001182180)
	/go/src/github.com/cockroachdb/cockroach/pkg/cli/main_test.go:34 +0x8e
main.main()
	_testmain.go:225 +0x165

goroutine 8 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x73
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x35

goroutine 9 [chan receive]:
github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher()
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:98 +0x12c
created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:42 +0x4d

goroutine 11 [syscall]:
os/signal.signal_recv(0x0)
	/usr/local/go/src/runtime/sigqueue.go:147 +0x9d
os/signal.loop()
	/usr/local/go/src/os/signal/signal_unix.go:23 +0x25
created by os/signal.Notify.func1.1
	/usr/local/go/src/os/signal/signal.go:150 +0x45

goroutine 21 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc000b823c0)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view/worker.go:154 +0x105
created by go.opencensus.io/stats/view.init.0
	/go/src/github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view/worker.go:32 +0x57

goroutine 9877 [IO wait]:
internal/poll.runtime_pollWait(0x7fab07b72c18, 0x72, 0x537e520)
	/usr/local/go/src/runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc00b0a5f98, 0x72, 0x537e500, 0x7499f98, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc00b0a5f80, 0xc007db0480, 0x205, 0x205, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:159 +0x1a5
net.(*netFD).Read(0xc00b0a5f80, 0xc007db0480, 0x205, 0x205, 0x203003, 0x58, 0xc00078a000)
	/usr/local/go/src/net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc00b624498, 0xc007db0480, 0x205, 0x205, 0x0, 0x0, 0x0)
	/usr/local/go/src/net/net.go:182 +0x8e
crypto/tls.(*atLeastReader).Read(0xc00d45c120, 0xc007db0480, 0x205, 0x205, 0xc007db0480, 0x0, 0xc00e6b4688)
	/usr/local/go/src/crypto/tls/conn.go:779 +0x62
bytes.(*Buffer).ReadFrom(0xc00d443780, 0x53722c0, 0xc00d45c120, 0x43c1a5, 0x40c3860, 0x45e7a20)
	/usr/local/go/src/bytes/buffer.go:204 +0xb1
crypto/tls.(*Conn).readFromUntil(0xc00d443500, 0x537aaa0, 0xc00b624498, 0x5, 0xc00b624498, 0x1000004526ec0)
	/usr/local/go/src/crypto/tls/conn.go:801 +0xf3
crypto/tls.(*Conn).readRecordOrCCS(0xc00d443500, 0xc0075ab800, 0x114, 0x120)
	/usr/local/go/src/crypto/tls/conn.go:608 +0x115
crypto/tls.(*Conn).readRecord(...)
	/usr/local/go/src/crypto/tls/conn.go:576
crypto/tls.(*Conn).readHandshake(0xc00d443500, 0xc000e9d016, 0xc000e9d080, 0x10f, 0x180)
	/usr/local/go/src/crypto/tls/conn.go:992 +0x6d
crypto/tls.(*Conn).clientHandshake(0xc00d443500, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/handshake_client.go:170 +0x2a7
crypto/tls.(*Conn).Handshake(0xc00d443500, 0x0, 0x0)
	/usr/local/go/src/crypto/tls/conn.go:1362 +0xc9
net/http.(*persistConn).addTLS.func2(0x0, 0xc00d443500, 0x0, 0xc007000e40)
	/usr/local/go/src/net/http/transport.go:1509 +0x45
created by net/http.(*persistConn).addTLS
	/usr/local/go/src/net/http/transport.go:1505 +0x177

goroutine 9927 [chan receive]:
net/http.(*persistConn).addTLS(0xc00d2c3b00, 0xc00d2cf280, 0x18, 0x0, 0xc00d2cf299, 0x3)
	/usr/local/go/src/net/http/transport.go:1515 +0x1a5
net/http.(*Transport).dialConn(0xc000d7c640, 0x540c040, 0xc000923200, 0x0, 0xc008f7f200, 0x5, 0xc00d2cf280, 0x1c, 0x0, 0xc00d2c3b00, ...)
	/usr/local/go/src/net/http/transport.go:1585 +0x1d25
net/http.(*Transport).dialConnFor(0xc000d7c640, 0xc002957340)
	/usr/local/go/src/net/http/transport.go:1421 +0xc6
created by net/http.(*Transport).queueForDial
	/usr/local/go/src/net/http/transport.go:1390 +0x40f
=== CONT  TestTransientClusterSimulateLatencies
    demo_test.go:211: Leaked goroutine: goroutine 9877 [IO wait]:
        internal/poll.runtime_pollWait(0x7fab07b72c18, 0x72, 0x537e520)
        	/usr/local/go/src/runtime/netpoll.go:222 +0x55
        internal/poll.(*pollDesc).wait(0xc00b0a5f98, 0x72, 0x537e500, 0x7499f98, 0x0)
        	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
        internal/poll.(*pollDesc).waitRead(...)
        	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
        internal/poll.(*FD).Read(0xc00b0a5f80, 0xc007db0480, 0x205, 0x205, 0x0, 0x0, 0x0)
        	/usr/local/go/src/internal/poll/fd_unix.go:159 +0x1a5
        net.(*netFD).Read(0xc00b0a5f80, 0xc007db0480, 0x205, 0x205, 0x203003, 0x58, 0xc00078a000)
        	/usr/local/go/src/net/fd_posix.go:55 +0x4f
        net.(*conn).Read(0xc00b624498, 0xc007db0480, 0x205, 0x205, 0x0, 0x0, 0x0)
        	/usr/local/go/src/net/net.go:182 +0x8e
        crypto/tls.(*atLeastReader).Read(0xc00d45c120, 0xc007db0480, 0x205, 0x205, 0xc007db0480, 0x0, 0xc00e6b4688)
        	/usr/local/go/src/crypto/tls/conn.go:779 +0x62
        bytes.(*Buffer).ReadFrom(0xc00d443780, 0x53722c0, 0xc00d45c120, 0x43c1a5, 0x40c3860, 0x45e7a20)
        	/usr/local/go/src/bytes/buffer.go:204 +0xb1
        crypto/tls.(*Conn).readFromUntil(0xc00d443500, 0x537aaa0, 0xc00b624498, 0x5, 0xc00b624498, 0x1000004526ec0)
        	/usr/local/go/src/crypto/tls/conn.go:801 +0xf3
        crypto/tls.(*Conn).readRecordOrCCS(0xc00d443500, 0xc0075ab800, 0x114, 0x120)
        	/usr/local/go/src/crypto/tls/conn.go:608 +0x115
        crypto/tls.(*Conn).readRecord(...)
        	/usr/local/go/src/crypto/tls/conn.go:576
        crypto/tls.(*Conn).readHandshake(0xc00d443500, 0xc000e9d016, 0xc000e9d080, 0x10f, 0x180)
        	/usr/local/go/src/crypto/tls/conn.go:992 +0x6d
        crypto/tls.(*Conn).clientHandshake(0xc00d443500, 0x0, 0x0)
        	/usr/local/go/src/crypto/tls/handshake_client.go:170 +0x2a7
        crypto/tls.(*Conn).Handshake(0xc00d443500, 0x0, 0x0)
        	/usr/local/go/src/crypto/tls/conn.go:1362 +0xc9
        net/http.(*persistConn).addTLS.func2(0x0, 0xc00d443500, 0x0, 0xc007000e40)
        	/usr/local/go/src/net/http/transport.go:1509 +0x45
        created by net/http.(*persistConn).addTLS
        	/usr/local/go/src/net/http/transport.go:1505 +0x177
        Leaked goroutine: goroutine 9927 [chan receive]:
        net/http.(*persistConn).addTLS(0xc00d2c3b00, 0xc00d2cf280, 0x18, 0x0, 0xc00d2cf299, 0x3)
        	/usr/local/go/src/net/http/transport.go:1515 +0x1a5
        net/http.(*Transport).dialConn(0xc000d7c640, 0x540c040, 0xc000923200, 0x0, 0xc008f7f200, 0x5, 0xc00d2cf280, 0x1c, 0x0, 0xc00d2c3b00, ...)
        	/usr/local/go/src/net/http/transport.go:1585 +0x1d25
        net/http.(*Transport).dialConnFor(0xc000d7c640, 0xc002957340)
        	/usr/local/go/src/net/http/transport.go:1421 +0xc6
        created by net/http.(*Transport).queueForDial
        	/usr/local/go/src/net/http/transport.go:1390 +0x40f
--- FAIL: TestTransientClusterSimulateLatencies (13.55s)
    --- PASS: TestTransientClusterSimulateLatencies/from_us-east1 (0.59s)
    --- PASS: TestTransientClusterSimulateLatencies/from_us-west1 (0.41s)
    --- PASS: TestTransientClusterSimulateLatencies/from_europe-west1 (0.53s)
FAIL

is it...


goroutine 21 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc000b823c0)
	/go/src/github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view/worker.go:154 +0x105
created by go.opencensus.io/stats/view.init.0
	/go/src/github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view/worker.go:32 +0x57

?

@otan
Copy link
Contributor Author

otan commented Mar 25, 2021

I wonder if this is because the leak detector takes 5s to detect leaks, but due to the RPCs being slowed down & this being longer, that's not enough time. in which case, we should ignore the leak detector.

do you reckon this is feasible or nah?

i'm still curious why this only reproduces under make roachprod-stress but not locally for me....

@otan
Copy link
Contributor Author

otan commented Mar 25, 2021

changing the leaktest timeout to 60s seems to unblock some things, but introduce a panic during the stop phase: https://gist.github.com/otan/12e1e6d7aab9236e3590660ef570aee8

@otan
Copy link
Contributor Author

otan commented Mar 25, 2021

Yeah spent too long on this. Maybe one day...

@otan otan marked this pull request as draft March 25, 2021 06:42
@otan otan changed the title cli: make --global flag for demo public [WIP] cli: make --global flag for demo public Mar 25, 2021
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 2 of 2 files at r2, 3 of 3 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 2 of 4 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @otan)


pkg/cli/demo_cluster.go, line 129 at r1 (raw file):

	latencyMapWaitCh := make(chan struct{})

	// serverReadyCh is used .

nit: sentence in comment


pkg/cli/demo_cluster.go, line 190 at r1 (raw file):

		// We force a wait for all servers until they are ready.
		nodeReadyCh := make(chan struct{}, 1)

the channel handling throughout this function is super confusing (and probably confused)

I would encourage you to do the following. For each channel:

  1. document at the point the channel is created, what are the readers and what are the writers, and the exact conditions under which channels are read and written

  2. where it makes sense, use close(chan) on the writer to ensure that the number of readers doesn't matter afterwards.


pkg/cli/demo_cluster.go, line 202 at r1 (raw file):

				err := serv.Start()
				nodeErrCh <- err
				errCh <- err

this channel handling is messed up. when err !=nil, the write to errCh will block and this goroutine will remain running.


pkg/cli/demo_cluster.go, line 407 at r3 (raw file):

func (c *transientCluster) DrainAndShutdown(nodeID roachpb.NodeID) error {
	if demoCtx.simulateLatency {
		return errors.Errorf("shutting down nodes is not supported in --global configurations")

here and throughout the code:

  1. it's super confusing to have the field in the ctx struct named differently than the flag. I may encourage you to find a way to make them match.

  2. please factor the check and the definition of the error message in a single function and call it from the multiple places

  3. the proper way to construct an error message is errors.Errorf("... --%s ...", cliflags.YourFlag.Name) (or errors.Newf)


pkg/cli/demo_test.go, line 127 at r6 (raw file):

	defer log.Scope(t).Close(t)

	// Set the leak test timeout to be 60s, as due to the delay

That should not be necessary, if all the close / cleanup calls are performed properly. If you think you need this, that means that a close function is not being called, or a goroutine is left to hang for longer than necessary.

@otan
Copy link
Contributor Author

otan commented Mar 25, 2021

As I'm done with this for now, hints for the next person.

Spent one more day working on tests demo --global. I'm done with it - as in - I don't think I can get it in. There are two remaining issues with the bespoke test setup:

  • It panics during shutdown as it does not receive the magic bytes header. I don't know where it comes from, but know how to reproduce it quickly (\demo shutdown X during --global). I've disabled \demo shutdown X in one of the commits here, but the error still pops up when you try cleanup the cluster.
  • Sometimes there is a leaking goroutine from grpc's AddTLS. The goroutine trace doesn't show anything obvious, so I have no idea what is wrong here. Can't find the spawning goroutine (not obvious from the stack traces), and this code is all in a third party library.

@otan
Copy link
Contributor Author

otan commented Mar 25, 2021


pkg/cli/demo_cluster.go, line 202 at r1 (raw file):

Previously, knz (kena) wrote…

this channel handling is messed up. when err !=nil, the write to errCh will block and this goroutine will remain running.

both channels are buffered, so i don't think we ever block.

@knz
Copy link
Contributor

knz commented Mar 26, 2021

Ok thank you for all your work and your investigative steps.
Let's pick this back up when the roadmap planning work has settled a bit.

@otan
Copy link
Contributor Author

otan commented Mar 29, 2021

It panics during shutdown as it does not receive the magic bytes header.

this is now resolved. just the leaking goroutine left.

@knz
Copy link
Contributor

knz commented Apr 2, 2021

There are multiple goroutines leaked, but this is clearly the smoking gun:

    demo_test.go:213: Leaked goroutine: goroutine 1111 [select]:
        github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc001277d38, 0xc0035964e0)
                /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:128 +0x151
        github.com/cockroachdb/cockroach/pkg/sqlmigrations.(*Manager).EnsureMigrations(0xc0013a5980, 0x1f8f770, 0xc0035964e0, 0x200000014, 0x3000000000, 0x0, 0x0)
                /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/sqlmigrations/migrations.go:578 +0x3d6
        github.com/cockroachdb/cockroach/pkg/server.(*SQLServer).preStart(0xc004b03400, 0x1f8f770, 0xc0035964e0, 0xc002a9a600, 0x1f07280, 0xc0014bdba0, 0x0, 0x0, 0x0, 0x0, ...)
                /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/server/server_sql.go:839 +0x738
        github.com/cockroachdb/cockroach/pkg/server.(*Server).PreStart(0xc00458e800, 0x1f8f700, 0xc000074188, 0x0, 0x0)
                /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1822 +0x3408
        github.com/cockroachdb/cockroach/pkg/server.(*Server).Start(0xc00458e800, 0x1f8f700, 0xc000074188, 0x0, 0x0)
                /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/server/server.go:1108 +0x45
        github.com/cockroachdb/cockroach/pkg/server.(*TestServer).Start(...)
                /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/server/testserver.go:465
        github.com/cockroachdb/cockroach/pkg/cli.(*transientCluster).start.func2(0xc0042cd680, 0xc004e30e40, 0xc0017b92c0, 0xc002d3b8c0, 0xc00004aae0)
                /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/cli/demo_cluster.go:203 +0x4f
        created by github.com/cockroachdb/cockroach/pkg/cli.(*transientCluster).start
                /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/cli/demo_cluster.go:202 +0x67d

In other words, there's this goroutine launched under if demoCtx.simulateLatencies in (*transientCluster).start() which does not properly stop even after the test complete.

I already mentioned that the pattern of interlocking channels was problematic in my previous review.

I will now investigate this further—exactly like I suggested in my previous review: by analyzing exactly where they get written and read, and discover which synchronization point is missing—and report my findings.

@jordanlewis
Copy link
Member

@knz --insecure is deprecated in demo, so I don't understand why you are insisting we should try to fix the global latencies for it, since global latencies is a feature only for demo.

@jordanlewis
Copy link
Member

Oops, I will re-create this comment on the issue you created.

@knz
Copy link
Contributor

knz commented Apr 22, 2021

See this comment: #63033 (comment)

We can now make progress on this, thanks to #63853 being merged.

@otan
Copy link
Contributor Author

otan commented Apr 22, 2021

thanks for the heads up!

do we still want to fix SHOW ALL CLUSTER QUERIES in --insecure before making it public? FWIW, from experimentation it's just that one that doesn't respect latency but i'm not sure why :\

@knz
Copy link
Contributor

knz commented Apr 23, 2021

do we still want to fix SHOW ALL CLUSTER QUERIES in --insecure before making it public?

I think we can make --global public but it still needs to be marked as experimental (at least in the docstring that shows up in --help and in docs)

otan added 3 commits April 27, 2021 06:57
We previously had no tests simulating the latency in demo using the
--global flag.

Release note: None
In addition to the release note, reworded the add node (which already
errored beforehand), and adding a clause in RestartNode.

Note there is no code path that allows a restart, since shutdown is
required before a node restart.

Release note (cli change): Previously, `\demo shutdown <node_idx>` would
error if `--global` was set. This will now error gracefully as an
unsupported behavior.
Release note (cli change): The --global flag for cockroach demo is now
advertised. This flag simulates latencies in multi-node demo clusters
when the nodes are set in different regions to simulate real-life global
latencies.
@otan otan changed the title [WIP] cli: make --global flag for demo public cli: make --global flag for demo public Apr 26, 2021
@otan otan marked this pull request as ready for review April 26, 2021 20:57
@otan
Copy link
Contributor Author

otan commented Apr 26, 2021

ok, i've pulled your commit and added one on top @knz!

@otan otan requested a review from knz April 26, 2021 21:03
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok so this is now on the finish line.

Let's discuss some specifics:

  • I'm not too fond of the last commit. Even if we accept the command-line flag's name for the sake of PR and marketing, we still want our code to be descriptive. A boolean field should indicate what the boolean is about. I don't believe that the word "global" is descriptive of anything. In comparison, "simulate latencies" is better. I'd say "simulateGeoLatencies" would even be better.

  • I'm curious if there is a way to announce the fact the cluster is not "local" when the session starts. Maybe via the prompt? or a banner when the shell starts?

Reviewed 15 of 15 files at r16.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @otan)

@otan
Copy link
Contributor Author

otan commented Apr 27, 2021

i'd be happy to rename it if we also changed the flag as appropriate, but that would also ideally involve some PM input. (the original complaint i was addressing was the flag name and variable name did not match)
i can also just take out the last commit for now and we can table that for another time.

@knz
Copy link
Contributor

knz commented Apr 27, 2021

the original complaint i was addressing was the flag name and variable name did not match

I may have missed that - where was this complaint voiced? (my apologies if I was the person who voiced it, I likely have forgotten)

We have this situation in a couple of other places, and I agree we probably want to do something about it, but maybe we can do this in a principled manner all at once.

i can also just take out the last commit for now and we can table that for another time.

if there's an issue filed for it, yes that would make sense.

Release note (cli change): There will now be a message upon start-up on
cockroach demo --global indicating latencies between nodes will simulate
real world latencies.
@otan
Copy link
Contributor Author

otan commented Apr 27, 2021

I may have missed that - where was this complaint voiced? (my apologies if I was the person who voiced it, I likely have forgotten)

would be from #62435 (review)

here and throughout the code:

it's super confusing to have the field in the ctx struct named differently than the flag. I may encourage you to find a way to make them match.

if there's an issue filed for it, yes that would make sense.

alrighty. #64269, removed the commit.

I'm curious if there is a way to announce the fact the cluster is not "local" when the session starts. Maybe via the prompt? or a banner when the shell starts?

I've added some help text.

Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 5 files at r17.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @otan)

@otan
Copy link
Contributor Author

otan commented Apr 27, 2021

thanks for the guidance and all the help!

bors r=knz

@craig
Copy link
Contributor

craig bot commented Apr 27, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants