Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/logictestccl: TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup timed out #60773

Closed
cockroach-teamcity opened this issue Feb 19, 2021 · 22 comments · Fixed by #62954
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. skipped-test

Comments

@cockroach-teamcity
Copy link
Member

(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@3e4d80ddb4b913d8af5d66a364b3870e1f7a49fa:

Slow failing tests:
TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup - 773.62s

Slow passing tests:
TestLogic - 711.40s
TestRestoreMidSchemaChange - 105.72s
TestImportIntoCSV - 87.04s
TestBTreeDeleteInsertCloneOnce - 68.79s
TestImportCSVStmt - 68.45s
TestAlterTableLocalityRegionalByRowError - 66.53s
TestExecBuild - 62.52s
TestRemoveDeadReplicas - 60.02s
TestBTreeDeleteInsertCloneEachTime - 58.75s
Example_demo - 52.98s
TestFullClusterBackup - 52.65s
TestImportData - 46.23s
TestBTree - 38.73s
TestImportDefault - 38.33s
TestAllRegisteredSetup - 37.16s
TestBTree - 36.52s
TestMergeQueueSeesNonVoters - 35.28s
TestBTreeDeleteInsertCloneEachTime - 34.62s
TestScatterRandomizeLeases - 32.49s
TestStreamIngestionJobWithRandomClient - 31.27s

More

Parameters:

  • GOFLAGS=-json
make stressrace TESTS=TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup PKG=./pkg/ccl/logictestccl TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Feb 19, 2021
@cockroach-teamcity
Copy link
Member Author

(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@83e70ce84b740e27e721c3b73c38a4b8b515094a:

Slow failing tests:
TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup - 0.00s

Slow passing tests:
<none>

More

Parameters:

  • TAGS=
  • GOFLAGS=-parallel=4
make stressrace TESTS=TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup PKG=./pkg/ccl/logictestccl TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

See this test on roachdash
powered by pkg/cmd/internal/issues

@otan
Copy link
Contributor

otan commented Feb 19, 2021

may be a storage/kv/logging/server shutdown issue.

[06:55:05] :	 [Step 2/2] panic: test timed out after 40m0s
[06:55:05] :	 [Step 2/2] 
[06:55:05] :	 [Step 2/2] goroutine 3435966 [running]:
[06:55:05] :	 [Step 2/2] testing.(*M).startAlarm.func1()
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/testing/testing.go:1618 +0xe5
[06:55:05] :	 [Step 2/2] created by time.goFunc
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/time/sleep.go:167 +0x45
[06:55:05] :	 [Step 2/2] 
[06:55:05] :	 [Step 2/2] goroutine 1 [chan receive, 40 minutes]:
[06:55:05] :	 [Step 2/2] testing.(*T).Run(0xc000582c00, 0x4b0fe09, 0xc, 0x51e39c0, 0x4df086)
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/testing/testing.go:1169 +0x2da
[06:55:05] :	 [Step 2/2] testing.runTests.func1(0xc000582a80)
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/testing/testing.go:1439 +0x78
[06:55:05] :	 [Step 2/2] testing.tRunner(0xc000582a80, 0xc00169fd60)
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/testing/testing.go:1123 +0xef
[06:55:05] :	 [Step 2/2] testing.runTests(0xc000f3b260, 0x78b96a0, 0x3, 0x3, 0xc003f6123167e221, 0x22ed00276c2, 0x7c49300, 0xc00104d470)
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/testing/testing.go:1437 +0x2fe
[06:55:05] :	 [Step 2/2] testing.(*M).Run(0xc0005f3800, 0x0)
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/testing/testing.go:1345 +0x1eb
[06:55:05] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/ccl/logictestccl.TestMain(0xc0005f3800)
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/logictestccl/main_test.go:32 +0xde
[06:55:05] :	 [Step 2/2] main.main()
[06:55:05] :	 [Step 2/2] 	_testmain.go:49 +0x165
[06:55:05] :	 [Step 2/2] 
[06:55:05] :	 [Step 2/2] goroutine 6 [chan receive]:
[06:55:05] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x73
[06:55:05] :	 [Step 2/2] created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x35
[06:55:05] :	 [Step 2/2] 
[06:55:05] :	 [Step 2/2] goroutine 7 [chan receive, 40 minutes]:
[06:55:05] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher()
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:98 +0x12c
[06:55:05] :	 [Step 2/2] created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:42 +0x4d
[06:55:05] :	 [Step 2/2] 
[06:55:05] :	 [Step 2/2] goroutine 19 [syscall, 40 minutes]:
[06:55:05] :	 [Step 2/2] os/signal.signal_recv(0x0)
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/runtime/sigqueue.go:147 +0x9d
[06:55:05] :	 [Step 2/2] os/signal.loop()
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/os/signal/signal_unix.go:23 +0x25
[06:55:05] :	 [Step 2/2] created by os/signal.Notify.func1.1
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/os/signal/signal.go:150 +0x45
[06:55:05] :	 [Step 2/2] 
[06:55:05] :	 [Step 2/2] goroutine 35 [select]:
[06:55:05] :	 [Step 2/2] go.opencensus.io/stats/view.(*worker).start(0xc00053b0c0)
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view/worker.go:152 +0xdd
[06:55:05] :	 [Step 2/2] created by go.opencensus.io/stats/view.init.0
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/vendor/go.opencensus.io/stats/view/worker.go:29 +0x57
[06:55:05] :	 [Step 2/2] 
[06:55:05] :	 [Step 2/2] goroutine 15 [chan receive, 39 minutes]:
[06:55:05] :	 [Step 2/2] testing.(*T).Run(0xc0145c4600, 0x4b61379, 0x1e, 0xc012740540, 0x1)
[06:55:05] :	 [Step 2/2] 	/usr/local/go/src/testing/testing.go:1169 +0x2da
[06:55:05] :	 [Step 2/2] 

...

@ajstorm
Copy link
Collaborator

ajstorm commented Feb 19, 2021

@otan As a complete noob at debugging these types of issues, what about the above snippet indicates kv/storage?

@otan
Copy link
Contributor

otan commented Feb 19, 2021

my guess is


[06:55:05] :	 [Step 2/2] goroutine 6 [chan receive]:
[06:55:05] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/util/log.flushDaemon()
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:75 +0x73
[06:55:05] :	 [Step 2/2] created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:41 +0x35
[06:55:05] :	 [Step 2/2] 
[06:55:05] :	 [Step 2/2] goroutine 7 [chan receive, 40 minutes]:
[06:55:05] :	 [Step 2/2] github.com/cockroachdb/cockroach/pkg/util/log.signalFlusher()
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:98 +0x12c
[06:55:05] :	 [Step 2/2] created by github.com/cockroachdb/cockroach/pkg/util/log.init.5
[06:55:05] :	 [Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log_flush.go:42 +0x4d

relates to logs not cleanly exiting, but also a lot of kv / storage related goroutines further down. but hard to read, and it's saturday ;)

@ajstorm
Copy link
Collaborator

ajstorm commented Feb 19, 2021

Thanks for the help on a Saturday. Will ping #engineering on Monday if I can't figure this out over the weekend.

@thoszhang
Copy link
Contributor

For what it's worth, I spent a bit of time trying to look into this, and it seems like the test has gotten stuck here multiple times when I stressed it:

RESTORE DATABASE "mr-backup-1", "mr-backup-2" FROM 'nodelocal://1/mr-backup-combined/'

In the logs I see the job (or at least what I think is that job) start, and then just stop emitting logs. No idea what is getting stuck specifically.

@thoszhang
Copy link
Contributor

I should clarify that this was on #60804 (specifically 0c5bed9, https://teamcity.cockroachdb.com/viewLog.html?buildId=2695724&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test), which does make changes to jobs (but I wouldn't expect this RESTORE statement to be affected, since there aren't any tables/types undergoing schema changes involved). So it's possible I was seeing something different, but it looks like the same thing.

@cockroach-teamcity
Copy link
Member Author

(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@8b6f3c84cc256debeeb4d4055c5f0d5c9a481213:

Slow failing tests:
TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup - 762.75s

Slow passing tests:
TestLogic - 739.04s
TestRestoreMidSchemaChange - 92.34s
TestImportIntoCSV - 81.75s
TestBTreeDeleteInsertCloneOnce - 70.41s
TestRaceWithBackfill - 67.71s
TestBTreeDeleteInsertCloneEachTime - 63.76s
Example_demo - 63.13s
TestImportCSVStmt - 61.90s
TestExecBuild - 61.00s
TestBTreeDeleteInsertCloneEachTime - 60.08s
TestAlterTableLocalityRegionalByRowError - 57.98s
TestRingBuffer - 53.15s
TestRemoveDeadReplicas - 51.43s
TestBTree - 40.78s
TestImportData - 39.46s
TestBTree - 39.35s
TestScatterRandomizeLeases - 39.27s
TestFullClusterBackup - 39.04s
TestImportDefault - 38.05s
TestAllRegisteredSetup - 33.34s

More

Parameters:

  • GOFLAGS=-json
make stressrace TESTS=TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup PKG=./pkg/ccl/logictestccl TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@8b6f3c84cc256debeeb4d4055c5f0d5c9a481213:

Slow failing tests:
TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup - 0.00s

Slow passing tests:
<none>

More

Parameters:

  • TAGS=
  • GOFLAGS=-parallel=4
make stressrace TESTS=TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup PKG=./pkg/ccl/logictestccl TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

See this test on roachdash
powered by pkg/cmd/internal/issues

@cockroach-teamcity
Copy link
Member Author

(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@64c4aef909f4382523cd9248341ca9f4448d841a:

Slow failing tests:
TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup - 0.00s

Slow passing tests:
<none>

More

Parameters:

  • TAGS=
  • GOFLAGS=-parallel=4
make stressrace TESTS=TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup PKG=./pkg/ccl/logictestccl TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

See this test on roachdash
powered by pkg/cmd/internal/issues

@otan
Copy link
Contributor

otan commented Feb 21, 2021

brief curiousity glance

seems stuck at executing a statement:

[06:55:14][Step 2/2] goroutine 87174 [IO wait, 39 minutes]:
[06:55:14][Step 2/2] internal/poll.runtime_pollWait(0x7f7a8a764fd8, 0x72, 0x58b06c0)
[06:55:14][Step 2/2] 	/usr/local/go/src/runtime/netpoll.go:222 +0x55
[06:55:14][Step 2/2] internal/poll.(*pollDesc).wait(0xc009298418, 0x72, 0x58b0600, 0x7946d10, 0x0)
[06:55:14][Step 2/2] 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
[06:55:14][Step 2/2] internal/poll.(*pollDesc).waitRead(...)
[06:55:14][Step 2/2] 	/usr/local/go/src/internal/poll/fd_poll_runtime.go:92
[06:55:14][Step 2/2] internal/poll.(*FD).Read(0xc009298400, 0xc01ce35300, 0x18bf, 0x18bf, 0x0, 0x0, 0x0)
[06:55:14][Step 2/2] 	/usr/local/go/src/internal/poll/fd_unix.go:159 +0x1a5
[06:55:14][Step 2/2] net.(*netFD).Read(0xc009298400, 0xc01ce35300, 0x18bf, 0x18bf, 0x203009, 0xc025bf3e30, 0xc0118b9e58)
[06:55:14][Step 2/2] 	/usr/local/go/src/net/fd_posix.go:55 +0x4f
[06:55:14][Step 2/2] net.(*conn).Read(0xc014daee78, 0xc01ce35300, 0x18bf, 0x18bf, 0x0, 0x0, 0x0)
[06:55:14][Step 2/2] 	/usr/local/go/src/net/net.go:182 +0x8e
[06:55:14][Step 2/2] crypto/tls.(*atLeastReader).Read(0xc025c8ca60, 0xc01ce35300, 0x18bf, 0x18bf, 0xc0118b9e98, 0x44ff5f, 0xc0118b9e98)
[06:55:14][Step 2/2] 	/usr/local/go/src/crypto/tls/conn.go:779 +0x62
[06:55:14][Step 2/2] bytes.(*Buffer).ReadFrom(0xc022ff3080, 0x58a3900, 0xc025c8ca60, 0x451265, 0x44bd960, 0x4a30720)
[06:55:14][Step 2/2] 	/usr/local/go/src/bytes/buffer.go:204 +0xb1
[06:55:14][Step 2/2] crypto/tls.(*Conn).readFromUntil(0xc022ff2e00, 0x58ac860, 0xc014daee78, 0x5, 0xc014daee78, 0x58a97c0)
[06:55:14][Step 2/2] 	/usr/local/go/src/crypto/tls/conn.go:801 +0xf3
[06:55:14][Step 2/2] crypto/tls.(*Conn).readRecordOrCCS(0xc022ff2e00, 0x0, 0x0, 0x898525)
[06:55:14][Step 2/2] 	/usr/local/go/src/crypto/tls/conn.go:608 +0x115
[06:55:14][Step 2/2] crypto/tls.(*Conn).readRecord(...)
[06:55:14][Step 2/2] 	/usr/local/go/src/crypto/tls/conn.go:576
[06:55:14][Step 2/2] crypto/tls.(*Conn).Read(0xc022ff2e00, 0xc023b97000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
[06:55:14][Step 2/2] 	/usr/local/go/src/crypto/tls/conn.go:1252 +0x15f
[06:55:14][Step 2/2] bufio.(*Reader).Read(0xc0234b7f80, 0xc023700de0, 0x5, 0x200, 0x0, 0x0, 0x0)
[06:55:14][Step 2/2] 	/usr/local/go/src/bufio/bufio.go:227 +0x222
[06:55:14][Step 2/2] io.ReadAtLeast(0x58a36c0, 0xc0234b7f80, 0xc023700de0, 0x5, 0x200, 0x5, 0x0, 0xc0118ba530, 0x4b6927)
[06:55:14][Step 2/2] 	/usr/local/go/src/io/io.go:314 +0x87
[06:55:14][Step 2/2] io.ReadFull(...)
[06:55:14][Step 2/2] 	/usr/local/go/src/io/io.go:333
[06:55:14][Step 2/2] github.com/lib/pq.(*conn).recvMessage(0xc023700dc0, 0xc00bc185c0, 0x46, 0x200, 0x46)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:941 +0x105
[06:55:14][Step 2/2] github.com/lib/pq.(*conn).recv1Buf(0xc023700dc0, 0xc00bc185c0, 0x40)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:996 +0x39
[06:55:14][Step 2/2] github.com/lib/pq.(*conn).recv1(...)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:1023
[06:55:14][Step 2/2] github.com/lib/pq.(*conn).simpleExec(0xc023700dc0, 0xc0011ab1d0, 0x40, 0x7f7a90711a60, 0x6, 0x7f7a8a7d8888, 0x1229035, 0xc000074110, 0x0)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:615 +0x1be
[06:55:14][Step 2/2] github.com/lib/pq.(*conn).Exec(0xc023700dc0, 0xc0011ab1d0, 0x40, 0x7cdf6f8, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:869 +0x28c
[06:55:14][Step 2/2] github.com/lib/pq.(*conn).ExecContext(0xc023700dc0, 0x5945f60, 0xc000074110, 0xc0011ab1d0, 0x40, 0x7cdf6f8, 0x0, 0x0, 0x0, 0x0, ...)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn_go18.go:42 +0x1c9
[06:55:14][Step 2/2] database/sql.ctxDriverExec(0x5945f60, 0xc000074110, 0x7f7a90711a40, 0xc023700dc0, 0x0, 0x0, 0xc0011ab1d0, 0x40, 0x7cdf6f8, 0x0, ...)
[06:55:14][Step 2/2] 	/usr/local/go/src/database/sql/ctxutil.go:31 +0x227
[06:55:14][Step 2/2] database/sql.(*DB).execDC.func2()
[06:55:14][Step 2/2] 	/usr/local/go/src/database/sql/sql.go:1570 +0x1de
[06:55:14][Step 2/2] database/sql.withLock(0x58edae0, 0xc023e9afc0, 0xc0118baa40)
[06:55:14][Step 2/2] 	/usr/local/go/src/database/sql/sql.go:3284 +0x69
[06:55:14][Step 2/2] database/sql.(*DB).execDC(0xc001be0d00, 0x5945f60, 0xc000074110, 0xc023e9afc0, 0xc0118bab68, 0xc0011ab1d0, 0x40, 0x0, 0x0, 0x0, ...)
[06:55:14][Step 2/2] 	/usr/local/go/src/database/sql/sql.go:1565 +0x495
[06:55:14][Step 2/2] database/sql.(*DB).exec(0xc001be0d00, 0x5945f60, 0xc000074110, 0xc0011ab1d0, 0x40, 0x0, 0x0, 0x0, 0xc0011ab101, 0x40, ...)
[06:55:14][Step 2/2] 	/usr/local/go/src/database/sql/sql.go:1550 +0x139
[06:55:14][Step 2/2] database/sql.(*DB).ExecContext(0xc001be0d00, 0x5945f60, 0xc000074110, 0xc0011ab1d0, 0x40, 0x0, 0x0, 0x0, 0xc0011ab100, 0x39, ...)
[06:55:14][Step 2/2] 	/usr/local/go/src/database/sql/sql.go:1528 +0xe5
[06:55:14][Step 2/2] database/sql.(*DB).Exec(...)
[06:55:14][Step 2/2] 	/usr/local/go/src/database/sql/sql.go:1542
[06:55:14][Step 2/2] github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).execStatement(0xc012efea00, 0xc01bfdfce0, 0x2c, 0xc0011ab1d0, 0x40, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:2574 +0x131
[06:55:14][Step 2/2] github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).processSubtest(0xc012efea00, 0x0, 0x0, 0xc0168977d0, 0x0, 0xc000ada0f0, 0x27, 0x4b813fe, 0x1e, 0x9, ...)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:2000 +0x45b8
[06:55:14][Step 2/2] github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).processTestFile(0xc012efea00, 0xc000ada0f0, 0x27, 0x4b813fe, 0x1e, 0x9, 0x0, 0x0, 0x0, 0x0, ...)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:1825 +0x818
[06:55:14][Step 2/2] github.com/cockroachdb/cockroach/pkg/sql/logictest.(*logicTest).runFile(0xc012efea00, 0xc000ada0f0, 0x27, 0x4b813fe, 0x1e, 0x9, 0x0, 0x0, 0x0, 0x0, ...)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:3117 +0xe5
[06:55:14][Step 2/2] github.com/cockroachdb/cockroach/pkg/sql/logictest.RunLogicTestWithDefaultConfig.func1.1(0xc0027c8600)
[06:55:14][Step 2/2] 	/go/src/github.com/cockroachdb/cockroach/pkg/sql/logictest/logic.go:3316 +0x318
[06:55:14][Step 2/2] testing.tRunner(0xc0027c8600, 0xc012aae800)
[06:55:14][Step 2/2] 	/usr/local/go/src/testing/testing.go:1123 +0xef
[06:55:14][Step 2/2] created by testing.(*T).Run
[06:55:14][Step 2/2] 	/usr/local/go/src/testing/testing.go:1168 +0x2b3

@otan
Copy link
Contributor

otan commented Feb 21, 2021

logs have a lot of


I210221 06:55:13.539622 120200 kv/kvserver/replica_raft.go:277 ⋮ [n4,s4,r62/2:‹/{Table/73-Max}›] 370312  proposing ENTER_JOINT(r7073 l7073 v7074) ‹[(n3,s3):7074VOTER_INCOMING]›, ‹[(n7,s7):7073VOTER_DEMOTING]›: after=‹[(n7,s7):7073VOTER_DEMOTING (n4,s4):2 (n5,s5):3 (n9,s9):4NON_VOTER (n8,s8):5NON_VOTER (n3,s3):7074VOTER_INCOMING]› next=7075
I210221 06:55:13.553018 120200 kv/kvserver/replica_command.go:1798 ⋮ [n4,s4,r62/2:‹/{Table/73-Max}›] 370313  change replicas (add ‹[]› remove ‹[]›): existing descriptor r62:‹/{Table/73-Max}› [(n7,s7):7073VOTER_DEMOTING, (n4,s4):2, (n5,s5):3, (n9,s9):4NON_VOTER, (n8,s8):5NON_VOTER, (n3,s3):7074VOTER_INCOMING, next=7075, gen=28305, sticky=1613891734.810409302,0]
I210221 06:55:13.592568 120200 kv/kvserver/replica_raft.go:277 ⋮ [n4,s4,r62/2:‹/{Table/73-Max}›] 370314  proposing LEAVE_JOINT: after=‹[(n7,s7):7073LEARNER (n4,s4):2 (n5,s5):3 (n9,s9):4NON_VOTER (n8,s8):5NON_VOTER (n3,s3):7074]› next=7075
I210221 06:55:13.604509 120200 kv/kvserver/replica_command.go:1798 ⋮ [n4,s4,r62/2:‹/{Table/73-Max}›] 370315  change replicas (add ‹[]› remove ‹[(n7,s7):7073LEARNER]›): existing descriptor r62:‹/{Table/73-Max}› [(n7,s7):7073LEARNER, (n4,s4):2, (n5,s5):3, (n9,s9):4NON_VOTER, (n8,s8):5NON_VOTER, (n3,s3):7074, next=7075, gen=28306, sticky=1613891734.810409302,0]
I210221 06:55:13.643008 120200 kv/kvserver/replica_raft.go:277 ⋮ [n4,s4,r62/2:‹/{Table/73-Max}›] 370316  proposing SIMPLE(r7073) ‹[(n7,s7):7073LEARNER]›: after=‹[(n3,s3):7074 (n4,s4):2 (n5,s5):3 (n9,s9):4NON_VOTER (n8,s8):5NON_VOTER]› next=7075
I210221 06:55:13.644258 99397 kv/kvserver/store_remove_replica.go:127 ⋮ [n7,s7,r62/7073:‹/{Table/73-Max}›] 370317  removing replica r62/7073
I210221 06:55:13.656195 120200 kv/kvserver/replica_command.go:1798 ⋮ [n4,s4,r62/2:‹/{Table/73-Max}›] 370318  change replicas (add ‹[(n7,s7):7075LEARNER]› remove ‹[]›): existing descriptor r62:‹/{Table/73-Max}› [(n3,s3):7074, (n4,s4):2, (n5,s5):3, (n9,s9):4NON_VOTER, (n8,s8):5NON_VOTER, next=7075, gen=28307, sticky=1613891734.810409302,0]
I210221 06:55:13.696423 120200 kv/kvserver/replica_raft.go:277 ⋮ [n4,s4,r62/2:‹/{Table/73-Max}›] 370319  proposing SIMPLE(l7075) ‹[(n7,s7):7075LEARNER]›: after=‹[(n3,s3):7074 (n4,s4):2 (n5,s5):3 (n9,s9):4NON_VOTER (n8,s8):5NON_VOTER (n7,s7):7075LEARNER]› next=7076
I210221 06:55:13.710684 3299507 kv/kvserver/raft_snapshot_queue.go:126 ⋮ [n4,raftsnapshot,s4,r62/2:‹/{Table/73-Max}›] 370320  skipping snapshot; replica is likely a learner in the process of being added: (n7,s7):7075LEARNER

@otan
Copy link
Contributor

otan commented Feb 21, 2021

it seems it is stuck because the restore is never finishing. i don't think it's multi-region code that's causing the regression - my theory is it's multi-node related, as the test starts up a cluster of 9 nodes and i'm not sure other restore ccl logic tests do the same, so may be something new that has been unearthed. it seems to be stuck at a kvserver rate limit?

you can see the plan node waiting


goroutine 106696 [select, 39 minutes]:
github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).Next(0xc025c8a380, 0x5945f20, 0xc025bbb740, 0xc0209c18c0, 0xc023faa4b8, 0x7f7a8b23fb20, 0x120, 0x120)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/planhook.go:157 +0x116
github.com/cockroachdb/cockroach/pkg/sql.(*planNodeToRowSource).Next(0xc024969800, 0x1, 0x6, 0xc025cba6c0, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/plan_node_to_row_source.go:179 +0x447
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Columnarizer).Next(0xc023d07600, 0x5965a20, 0xc025bf13e8, 0x5947ce0, 0xc025c8a380)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/columnarizer.go:175 +0xf2
github.com/cockroachdb/cockroach/pkg/sql/colexec.invariantsChecker.Next(0x59491a0, 0xc023d07600, 0x5965a20, 0xc025bf13e8, 0xc0209c18c0, 0xc023faa4b8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/invariants_checker.go:44 +0x63
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).next(0xc024969200, 0xc025bbb740, 0xc0265e7a98, 0x30c5902)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:254 +0x163
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).nextAdapter(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:279
github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror.CatchVectorizedRuntimeError(0xc0265e7af0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexecbase/colexecerror/error.go:91 +0x57
github.com/cockroachdb/cockroach/pkg/sql/colexec.(*Materializer).Next(0xc024969200, 0x5965a20, 0xc025bf13e8, 0x31efca0, 0x594a660)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/colexec/materializer.go:285 +0x5b
github.com/cockroachdb/cockroach/pkg/sql/execinfra.Run(0x5965a20, 0xc025bf13e8, 0x596c5e0, 0xc024969200, 0x590e360, 0xc01d3d5c00)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/base.go:171 +0x35
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBase).Run(0xc024969200, 0x5945f20, 0xc025bbb740)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:765 +0x92
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run(0xc025cba480, 0x5945f20, 0xc025bbb740, 0x520d350, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:382 +0x2e2
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run(0xc0060124d0, 0xc025bfc8c0, 0xc0254a30e0, 0xc025c8a500, 0xc01d3d5c00, 0xc023faa6d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:367 +0x471
github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).PlanAndRun(0xc0060124d0, 0x5945fe0, 0xc025cae540, 0xc023faa6d0, 0xc025bfc8c0, 0xc0254a30e0, 0x5947ce0, 0xc025c8a380, 0x0, 0xc01d3d5c00, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:981 +0x1d2
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithDistSQLEngine(0xc023faa000, 0x5945fe0, 0xc025cae540, 0xc023faa4b8, 0x3, 0x7f7a8de4aff0, 0xc023ecef80, 0xc0265e8300, 0xc025c88898, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1018 +0x428
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine(0xc023faa000, 0x5945fe0, 0xc025cae540, 0xc023faa4b8, 0x7f7a8de4aff0, 0xc023ecef80, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:890 +0x706
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState(0xc023faa000, 0x5945fe0, 0xc025cae540, 0x59646a0, 0xc025bfbd40, 0xc023fb0abc, 0x40, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:652 +0xf86
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt(0xc023faa000, 0x5965a20, 0xc025bf1028, 0x59646a0, 0xc025bfbd40, 0xc023fb0abc, 0x40, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:120 +0xb14
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1(0x59646a0, 0xc025bfbd40, 0xc023fb0abc, 0x40, 0x0, 0x0, 0x22ef3423, 0xed7c3f186, 0x0, 0x22ef3de5, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1475 +0x225
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd(0xc023faa000, 0x5945fe0, 0xc025cae510, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1477 +0x1413
github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run(0xc023faa000, 0x5945f20, 0xc023073ac0, 0xc014530fa0, 0x5400, 0x15000, 0xc014531040, 0xc023ead0d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1402 +0x1cd
github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn(0xc002dc0000, 0x5945f20, 0xc023073ac0, 0xc023faa000, 0x5400, 0x15000, 0xc014531040, 0xc023ead0d0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:490 +0xce
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1(0xc023e99cb0, 0xc023ee5920, 0x5945f20, 0xc023073ac0, 0xc023ead0d0, 0xc002dc0000, 0xc023ece900, 0x59abce0, 0xc002b10b00, 0xc023ec9020, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:625 +0x3ea
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:553 +0x185

seems to be a restore:


goroutine 119496 [select, 39 minutes]:
github.com/cockroachdb/cockroach/pkg/jobs.(*StartableJob).AwaitCompletion(0xc026145d40, 0x5965a20, 0xc025bf1528, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/jobs/jobs.go:886 +0xbb
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.doRestorePlan(0x5965a20, 0xc025bf1528, 0xc025bfbd40, 0x5aa1500, 0xc023faa4b8, 0xc025c8d0a0, 0x1, 0x1, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_planning.go:1788 +0x1ce5
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.restorePlanHook.func2(0x5965a20, 0xc025bf1528, 0x0, 0x0, 0x0, 0xc02494db60, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_planning.go:1461 +0x67d
github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).startExec.func1(0xc025c8a380, 0x5945f20, 0xc025bbb740, 0xc0209c18c0, 0xc023faa4b8)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/planhook.go:145 +0x7d
created by github.com/cockroachdb/cockroach/pkg/sql.(*hookFnNode).startExec
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/planhook.go:144 +0xcc

and

goroutine 120034 [chan receive, 39 minutes]:
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*RowChannel).Next(0xc027862db0, 0x4b52e50, 0x14, 0x0, 0xc0227c8a40)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/base.go:482 +0x4a
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*restoreDataProcessor).Next(0xc027d11b00, 0xc022ccd100, 0x4b52e50, 0x14, 0x5965a20)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/restore_data_processor.go:95 +0x62
github.com/cockroachdb/cockroach/pkg/sql/execinfra.Run(0x5965a20, 0xc022ca12a8, 0x596b6a0, 0xc027d11b00, 0x5911fe0, 0xc026a3e380)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/base.go:171 +0x35
github.com/cockroachdb/cockroach/pkg/sql/execinfra.(*ProcessorBase).Run(0xc027d11b00, 0x5945f20, 0xc022ccd100)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfra/processorsbase.go:765 +0x92
github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).startInternal.func1(0xc02258f900, 0x2, 0x2, 0x5945f20, 0xc022ccd100, 0xc005faa6c0, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:327 +0x5c
created by github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).startInternal
	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:326 +0x2e8

spawned by a split and scatter:


goroutine 120199 [semacquire, 39 minutes]:
sync.runtime_Semacquire(0xc027ad55d0)
	/usr/local/go/src/runtime/sema.go:56 +0x45
sync.(*WaitGroup).Wait(0xc027ad55c8)
	/usr/local/go/src/sync/waitgroup.go:130 +0x65
golang.org/x/sync/errgroup.(*Group).Wait(0xc027ad55c0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:40 +0x31
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.Wait(0xc027ad55c0, 0x5945f20, 0xc02294b240, 0xc02294b300, 0xc02294b240)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:142 +0x57
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*splitAndScatterProcessor).runSplitAndScatter(0xc00ca21800, 0x5945f20, 0xc02294afc0, 0xc005f645a0, 0xc00ca21d58, 0x58ad960, 0x7cdf6f8, 0x52111a0, 0x5965a20)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/split_and_scatter_processor.go:340 +0x310
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*splitAndScatterProcessor).Start.func1(0x5945f20, 0xc0225c19c0, 0xc00ca21800)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/split_and_scatter_processor.go:214 +0xd3
created by github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*splitAndScatterProcessor).Start
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/split_and_scatter_processor.go:206 +0x57

seems like the split and scatter is blocked waiting for kv: something is rate limited?!


goroutine 120200 [select]:
golang.org/x/time/rate.(*Limiter).WaitN(0xc0050fcdc0, 0x5945fe0, 0xc02461fb90, 0x1, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/time/rate/rate.go:253 +0x2f2
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*kvBatchSnapshotStrategy).sendBatch(0xc0276a20c0, 0x5945fe0, 0xc02461fb90, 0x7f7a88f41408, 0xc013e835b0, 0x5a9a600, 0xc010fc2c00, 0x7f7a5e806dd0, 0x4e)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_snapshot.go:499 +0x50
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*kvBatchSnapshotStrategy).Send(0xc0276a20c0, 0x5945fe0, 0xc02461fb90, 0x7f7a88f41408, 0xc013e835b0, 0xf143, 0xe514, 0xc017f36d20, 0xc0050fc2d0, 0xc0135c6a80, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_snapshot.go:368 +0x339
github.com/cockroachdb/cockroach/pkg/kv/kvserver.sendSnapshot(0x5945fe0, 0xc02461fb90, 0xc00323ca80, 0x7f7a88f41408, 0xc013e835b0, 0x58a5620, 0xc003afe240, 0xf143, 0xe514, 0xc017f36d20, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_snapshot.go:1142 +0xbf1
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).SendSnapshot(0xc00a60b500, 0x5945fe0, 0xc02461fb90, 0xc003afe240, 0xf143, 0xe514, 0xc017f36d20, 0xc0050fc2d0, 0xc0135c6a80, 0xc0135c68a0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:671 +0x288
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendSnapshot(0xc023e40900, 0x5945fe0, 0xc02461fb90, 0x700000007, 0xc000001ba3, 0xc011ed2558, 0x200000001, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:2165 +0x605
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).atomicReplicationChange(0xc023e40900, 0x5945fe0, 0xc02461fb90, 0xc01c038a80, 0x2, 0x4b26de6, 0x9, 0xc021bd6b40, 0x9f, 0xc014cb1100, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:1467 +0x4b0
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).changeReplicasImpl(0xc023e40900, 0x5945fe0, 0xc02461fb90, 0xc01c0385b0, 0x2, 0x4b26de6, 0x9, 0xc021bd6b40, 0x9f, 0xc014cb1100, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:1059 +0x532
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicateQueue).changeReplicas(0xc016b36770, 0x5945fe0, 0xc02461fb90, 0xc023e40900, 0xc014cb1100, 0x2, 0x2, 0xc01c0385b0, 0x2, 0x4b26de6, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicate_queue.go:1110 +0xdd
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicateQueue).considerRebalance(0xc016b36770, 0x5945fe0, 0xc02461fb90, 0xc023e40900, 0xc024c86730, 0x3, 0x3, 0xc00fa8f230, 0x0, 0x1, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicate_queue.go:970 +0x735
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicateQueue).processOneChange(0xc016b36770, 0x5945fe0, 0xc02461fb90, 0xc023e40900, 0xc00fa8f230, 0x0, 0x3fc3333333333301, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replicate_queue.go:409 +0x1032
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).adminScatter(0xc023e40900, 0x5945fe0, 0xc02461fb90, 0xc023424c18, 0x1, 0x8, 0xc023424c18, 0x2, 0x8, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_command.go:2795 +0x1fa
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).executeAdminBatch(0xc023e40900, 0x5945fe0, 0xc02461fb90, 0xc023975cc0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:590 +0x13eb
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendWithRangeID(0xc023e40900, 0x5945fe0, 0xc02461fb60, 0x3e, 0xc023975cc0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:98 +0x705
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).Send(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_send.go:34
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).Send(0xc013e26a00, 0x5965a20, 0xc0255403a8, 0x1665aeb56c2f7466, 0x0, 0x400000004, 0x2, 0x0, 0x3e, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_send.go:195 +0x5cc
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Stores).Send(0xc000e12e00, 0x5965a20, 0xc0255403a8, 0x0, 0x0, 0x400000004, 0x2, 0x0, 0x3e, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/stores.go:177 +0x118
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal.func1(0x5945fe0, 0xc02461fb30, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:875 +0x225
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunTaskWithErr(0xc00970b480, 0x5945fe0, 0xc02461fb30, 0x4b3fdb1, 0x10, 0xc015070a08, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:330 +0xb2
github.com/cockroachdb/cockroach/pkg/server.(*Node).batchInternal(0xc001eec100, 0x5945fe0, 0xc02461fb30, 0xc023975c20, 0xc02461fb30, 0x98, 0x489e0c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:863 +0x165
github.com/cockroachdb/cockroach/pkg/server.(*Node).Batch(0xc001eec100, 0x5945fe0, 0xc02461fb00, 0xc023975c20, 0xc025540268, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/server/node.go:901 +0xa5
github.com/cockroachdb/cockroach/pkg/rpc.internalClientAdapter.Batch(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/context.go:465
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).sendBatch(0xc00da59380, 0x5945fe0, 0xc02461fb00, 0x4, 0x5992c60, 0xc0159f5610, 0x0, 0x0, 0x400000004, 0x2, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:193 +0x176
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*grpcTransport).SendNext(0xc00da59380, 0x5965a20, 0xc025540268, 0x0, 0x0, 0x400000004, 0x2, 0x0, 0x3e, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/transport.go:175 +0x198
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendToReplicas(0xc0114ec400, 0x5965a20, 0xc025540268, 0x0, 0x0, 0x0, 0x0, 0x0, 0x3e, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1886 +0xa75
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(0xc0114ec400, 0x5965a20, 0xc025540268, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1507 +0x3a7
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(0xc0114ec400, 0x5965a20, 0xc025540268, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1145 +0x1885
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(0xc0114ec400, 0x5945f20, 0xc02294b240, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:784 +0x59b
github.com/cockroachdb/cockroach/pkg/kv.(*CrossRangeTxnWrapperSender).Send(0xc000e12608, 0x5945f20, 0xc02294b240, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/db.go:219 +0x9f
github.com/cockroachdb/cockroach/pkg/kv.SendWrappedWith(0x5945f20, 0xc02294b240, 0x58a53c0, 0xc000e12608, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/sender.go:421 +0x155
github.com/cockroachdb/cockroach/pkg/kv.SendWrapped(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/sender.go:438
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.dbSplitAndScatterer.splitAndScatterKey(0x5945f20, 0xc02294b240, 0xc0069a3c40, 0xc0069a3c40, 0x0, 0xc000e125b0, 0xc02294b200, 0xc022bfc8d0, 0x2, 0x8, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/split_and_scatter_processor.go:98 +0x58f
github.com/cockroachdb/cockroach/pkg/ccl/backupccl.(*splitAndScatterProcessor).runSplitAndScatter.func1(0x5945f20, 0xc02294b240, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/ccl/backupccl/split_and_scatter_processor.go:296 +0x1cb
github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1(0xc00ca21d58, 0x58ad960)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:166 +0x3a
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc027ad55c0, 0xc027ad56b0)
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x66

otan added a commit to otan-cockroach/cockroach that referenced this issue Feb 21, 2021
craig bot pushed a commit that referenced this issue Feb 21, 2021
60855: logictestccl: split rename column into separate test r=arulajmani a=otan

This hasn't flaked on a master build for a while, but people have
reported it locally. Going to try split the rename test into a separate
logic test with it's own cluster and see if we see similar symptoms.

Release note: None

60902: logictestccl: skip multi_region_backup r=ajstorm a=otan

Refs: #60773

Release note: None

Co-authored-by: Oliver Tan <[email protected]>
@cockroach-teamcity
Copy link
Member Author

(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@682582fd65a512d90f187a3f7c8e368a43bd89e9:

Slow failing tests:
TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup - 757.29s
TestStrictGCEnforcement - 1.00s

Slow passing tests:
TestLogic - 713.96s
TestAllRegisteredImportFixture - 224.52s
TestAlterTableLocalityRegionalByRowError - 101.11s
TestRestoreMidSchemaChange - 91.27s
TestBTreeDeleteInsertCloneOnce - 71.79s
TestRaceWithBackfill - 68.33s
TestBTreeDeleteInsertCloneEachTime - 59.28s
TestExecBuild - 58.04s
TestRemoveDeadReplicas - 54.39s
Example_demo - 51.99s
TestImportCSVStmt - 51.24s
TestAllRegisteredSetup - 44.61s
TestFullClusterBackup - 38.35s
TestBTree - 37.67s
TestBTree - 36.20s
TestScatterRandomizeLeases - 35.80s
TestImportIntoCSV - 35.14s
TestImportData - 34.54s
TestMergeQueueSeesNonVoters - 34.14s
TestBTreeDeleteInsertCloneEachTime - 28.87s

More

Parameters:

  • GOFLAGS=-json
make stressrace TESTS=TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup PKG=./pkg/ccl/logictestccl TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich
Copy link
Member

yuzefovich commented Feb 22, 2021

I think #60642 might be to blame.

@yuzefovich yuzefovich assigned yuzefovich and unassigned ajstorm Feb 22, 2021
@yuzefovich
Copy link
Member

Never mind, I can easily reproduce stuckness with make stress when #60642 is reverted. Assigning it back to @ajstorm for further triage.

@yuzefovich yuzefovich removed their assignment Feb 22, 2021
@cockroach-teamcity
Copy link
Member Author

(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@2572200f7612c6508a52735a6a18767cfb7cc09d:

Slow failing tests:
TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup - 0.00s

Slow passing tests:
<none>

More

Parameters:

  • TAGS=
  • GOFLAGS=-parallel=4
make stressrace TESTS=TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup PKG=./pkg/ccl/logictestccl TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

See this test on roachdash
powered by pkg/cmd/internal/issues

@ajstorm
Copy link
Collaborator

ajstorm commented Mar 18, 2021

Can't seem to repro this anymore, and I'm hoping it means that the underlying hang has been resolved. Trying to reenable 🤞.

> make stressrace TESTS=multi_region_backup PKG=github.com/cockroachdb/cockroach/pkg/ccl/logictestccl TESTTIMEOUT=5m STRESSFLAGS='-p 4' 2>&1 | tee stress.out
Running make with -j16
GOPATH set to /Users/storm/go
mkdir -p lib
rm -f lib/lib{geos,geos_c}.dylib
cp -L /Users/storm/go/native/x86_64-apple-darwin19.6.0/geos/lib/lib{geos,geos_c}.dylib lib
GOFLAGS= go test  -race -mod=vendor -exec 'stress -p 4' -tags 'make x86_64_apple_darwin19.6.0 crdb_test' -ldflags '-X github.com/cockroachdb/cockroach/pkg/build.typ=development -extldflags "" -X "github.com/cockroachdb/cockroach/pkg/build.tag=v21.1.0-alpha.3-1561-gba8f03eeea-dirty" -X "github.com/cockroachdb/cockroach/pkg/build.rev=ba8f03eeea14a155c6f6317865b43a64effd2831" -X "github.com/cockroachdb/cockroach/pkg/build.cgoTargetTriple=x86_64-apple-darwin19.6.0"  ' -run "multi_region_backup" -timeout 0 github.com/cockroachdb/cockroach/pkg/ccl/logictestccl  -v -args -test.timeout 5m
24 runs so far, 0 failures, over 5s
80 runs so far, 0 failures, over 10s
136 runs so far, 0 failures, over 15s
188 runs so far, 0 failures, over 20s
244 runs so far, 0 failures, over 25s
296 runs so far, 0 failures, over 30s
352 runs so far, 0 failures, over 35s
408 runs so far, 0 failures, over 40s
464 runs so far, 0 failures, over 45s
521 runs so far, 0 failures, over 50s
580 runs so far, 0 failures, over 55s
636 runs so far, 0 failures, over 1m0s
snip
44811 runs so far, 0 failures, over 1h17m20s
44859 runs so far, 0 failures, over 1h17m25s
44910 runs so far, 0 failures, over 1h17m30s
44955 runs so far, 0 failures, over 1h17m35s
44999 runs so far, 0 failures, over 1h17m40s
45039 runs so far, 0 failures, over 1h17m45s
45086 runs so far, 0 failures, over 1h17m50s

@otan
Copy link
Contributor

otan commented Mar 18, 2021

did you ensure multi_region_backup have sufficiently beefy machines?

@ajstorm
Copy link
Collaborator

ajstorm commented Mar 18, 2021 via email

@ajstorm ajstorm added GA-blocker branch-release-21.1 release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 22, 2021
@ajstorm
Copy link
Collaborator

ajstorm commented Mar 22, 2021

Re-enabling this test exposed the fact that restore is currently broken for multi-region databases. Will be addressed with #62215.

@ajstorm
Copy link
Collaborator

ajstorm commented Mar 24, 2021

Needs #60835 to merge before we can re-enable this test case.

@nvanbenschoten nvanbenschoten removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 31, 2021
craig bot pushed a commit that referenced this issue Apr 1, 2021
62954: sql: Re-enable multi_region_backup test r=arulajmani a=ajstorm

With #60835 merged, this test no longer flakes. I've stressed it on my
GCE worker now for a while an it's all good.

Resolves #60773.

Release note: None

62959: sql: lease acquisition of OFFLINE descs may starve bulk operations r=ajwerner a=fqazi

Fixes: #61798

Previously, offline descriptors would never have their leases
cached and they would be released once the reference count hit zero.
This was inadequate because when attempting to online these tables
again the lease acquisition could be pushed back by other operations,
leading to starvation / live locks. To address this, this patch will
allow the leases of offline descriptors to be cached.

Release note (bug fix): Lease acquisitions of descriptor in a
offline state may starve out bulk operations (backup / restore)

Co-authored-by: Adam Storm <[email protected]>
Co-authored-by: Faizan Qazi <[email protected]>
@craig craig bot closed this as completed in 82df3a1 Apr 1, 2021
ajstorm added a commit to ajstorm/cockroach that referenced this issue Apr 5, 2021
With cockroachdb#60835 merged, this test no longer flakes. I've stressed it on my
GCE worker now for a while an it's all good.

Resolves cockroachdb#60773.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). GA-blocker O-robot Originated from a bot. skipped-test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants