Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: disk-full failed #78337

Closed
cockroach-teamcity opened this issue Mar 23, 2022 · 4 comments
Closed

roachtest: disk-full failed #78337

cockroach-teamcity opened this issue Mar 23, 2022 · 4 comments
Assignees
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 23, 2022

roachtest.disk-full failed with artifacts on release-22.1 @ 26c05e15d3752f47a91592b5b870360c30c57dd5:

The test failed on branch=release-22.1, cloud=gce:
test artifacts and logs in: /artifacts/disk-full/run_1
	disk_full.go:71,monitor.go:105,errgroup.go:57: dial tcp 34.139.206.51:26257: connect: connection refused

	monitor.go:127,disk_full.go:150,test_runner.go:866: monitor failure: monitor task failed: t.Fatal() was called
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerDiskFull.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/disk_full.go:150
		  | main.(*testRunner).runTest.func2
		  | 	main/pkg/cmd/roachtest/test_runner.go:866
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func2
		  | 	main/pkg/cmd/roachtest/monitor.go:171
		Wraps: (4) monitor task failed
		Wraps: (5) attached stack trace
		  -- stack trace:
		  | main.init
		  | 	main/pkg/cmd/roachtest/monitor.go:80
		  | runtime.doInit
		  | 	GOROOT/src/runtime/proc.go:6498
		  | runtime.main
		  | 	GOROOT/src/runtime/proc.go:238
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1581
		Wraps: (6) t.Fatal() was called
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.leafError

	test_runner.go:997,test_runner.go:896: test timed out (0s)
Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-14075

@cockroach-teamcity cockroach-teamcity added branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 23, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label Mar 23, 2022
@nicktrav
Copy link
Collaborator

nicktrav commented Mar 23, 2022

This looks to be the same as #78270. Blocked on a goroutine issuing a query that never returns.

goroutine 4003382 [IO wait, 600 minutes]:
internal/poll.runtime_pollWait(0x7f1369903bf0, 0x72)
	/usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc000bcb500, 0xc0011ad000, 0x0)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
	/usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc000bcb500, {0xc0011ad000, 0x1000, 0x1000})
	/usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc000bcb500, {0xc0011ad000, 0x14f, 0x47})
	/usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc000e66370, {0xc0011ad000, 0xc0045085a0, 0x0})
	/usr/local/go/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc0020bfb00, {0xc0045085a0, 0x5, 0xc001a45438})
	/usr/local/go/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x5d83700, 0xc0020bfb00}, {0xc0045085a0, 0x5, 0x200}, 0x5)
	/usr/local/go/src/io/io.go:328 +0x9a
io.ReadFull(...)
	/usr/local/go/src/io/io.go:347
github.com/lib/pq.(*conn).recvMessage(0xc004508580, 0xc001a455a0)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:983 +0xca
github.com/lib/pq.(*conn).recv1Buf(0xc004508580, 0x0)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:1038 +0x2e
github.com/lib/pq.(*conn).recv1(...)
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:1065
github.com/lib/pq.(*conn).simpleQuery(0xc004508580, {0x493a840, 0x41})
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:675 +0x231
github.com/lib/pq.(*conn).query(0xc004508580, {0x493a840, 0xc0000780a8}, {0x8988558, 0x0, 0x0})
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:868 +0x426
github.com/lib/pq.(*conn).QueryContext(0x8, {0x5e0d390, 0xc0000780a8}, {0x493a840, 0x41}, {0x8988558, 0x0, 0x8})
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn_go18.go:21 +0xd7
database/sql.ctxDriverQuery({0x5e0d390, 0xc0000780a8}, {0x7f136acb4d40, 0xc004508580}, {0x0, 0x0}, {0x493a840, 0xc0010847e0}, {0x8988558, 0x0, ...})
	/usr/local/go/src/database/sql/ctxutil.go:48 +0x17d
database/sql.(*DB).queryDC.func1()
	/usr/local/go/src/database/sql/sql.go:1722 +0x175
database/sql.withLock({0x5da4f38, 0xc0010847e0}, 0xc001a45ae0)
	/usr/local/go/src/database/sql/sql.go:3396 +0x8c
database/sql.(*DB).queryDC(0xc001a45b01, {0x5e0d390, 0xc0000780a8}, {0x0, 0x0}, 0xc0010847e0, 0xc0056502c0, {0x493a840, 0x41}, {0x0, ...})
	/usr/local/go/src/database/sql/sql.go:1717 +0x211
database/sql.(*DB).query(0x532fe5, {0x5e0d390, 0xc0000780a8}, {0x493a840, 0x41}, {0x0, 0x0, 0x0}, 0x58)
	/usr/local/go/src/database/sql/sql.go:1700 +0xfd
database/sql.(*DB).QueryContext(0x8, {0x5e0d390, 0xc0000780a8}, {0x493a840, 0x41}, {0x0, 0x0, 0x0})
	/usr/local/go/src/database/sql/sql.go:1674 +0xdf
database/sql.(*DB).QueryRowContext(...)
	/usr/local/go/src/database/sql/sql.go:1778
database/sql.(*DB).QueryRow(0xc000dbb540, {0x493a840, 0xc001579640}, {0x0, 0x4821cdb, 0x300000002})
	/usr/local/go/src/database/sql/sql.go:1792 +0x4a
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerDiskFull.func1.2({0x5e0d358, 0xc001579640})
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/disk_full.go:68 +0x205
main.(*monitorImpl).Go.func1()
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:106 +0xb0
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x67
created by golang.org/x/sync/errgroup.(*Group).Go
	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x92

We probably want to set a short context on these queries and then retry them. Additionally, it would be good to try other nodes, not just node 2.

nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 24, 2022
Currently, when querying for node liveness in `disk-full`, a query may
hang indefinitely, failing the test only after it has timed out
(currently after 10 hours).

Add retry logic for the node liveness query. The query is cancelled
after a second, and exponentially retries for up to (approximately) one
minute.

Release note: None.

Touches cockroachdb#78337, cockroachdb#78349.
@nicktrav
Copy link
Collaborator

@jbowens - moving the conversation about the node_liveness query failing over here (from PR #78380). While not reproducible on demand, this issue and #78349 are both similar in that they exhibit the same blocked goroutine in the test runner (happens in 21.2 and 22.2).

I looked into what was executing on node 2, and it looks like the node liveness query was received, and was still in pending state when the debug zip was take (~10 hours after the query was executed!).

From crdb_internal.cluster_queries.txt on node 2:

query_id	txn_id	node_id	session_id	user_name	start	query	client_address	application_name	distributed	phase
16deed3c4e32fdd40000000000000002	c3474258-f7b5-44b0-a925-886413d2e77b	2	16deed3c4e28d1260000000000000002	root	2022-03-23 06:02:36.929749	SELECT is_live FROM crdb_internal.gossip_nodes WHERE node_id = 1	34.138.211.10:35408		NULL	preparing

I was able to use the client IP address (presumably the roachtest worker IP) to look in the logs. It looks as if the execution of this query had to reach out to node 1 (which by this time was it was dead, as its disk had been filled up by the ballast file).

I see the following in the logs on node 2:

W220323 06:03:37.730052 1499 kv/kvclient/kvcoord/dist_sender.go:1615 ⋮ [n2,client=34.138.211.10:35408,user=root] 324  slow range RPC: have been waiting 60.80s (56 attempts) for RPC Get [‹/NamespaceTable/30/1/0/0/"defaultdb"/4/1›,‹/Min›), [txn: 5d8c1a9e] to r26:‹/NamespaceTable/{30-Max}› [(n1,s1):1, next=2, gen=0]; resp: ‹failed to send RPC: sending to all replicas failed; last error: unable to dial n1: breaker open›

I then see the following repeating for the next 10 hours on node 2:

W220323 06:04:37.525777 1499 2@rpc/nodedialer/nodedialer.go:199 ⋮ [n2,client=34.138.211.10:35408,user=root] 407  unable to connect to n1: failed to connect to n1 at ‹10.142.0.27:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.142.0.27:26257: connect: connection refused"›

Looking at goroutine 1499 on node 2, it indeed looks like it's stuck in some kind of retry loop processing some query that came in:

goroutine 1499 [select]:
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc0023f2bc0)
	github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:127 +0x14a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1568 +0x778
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1210 +0x426
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:831 +0x725
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x269
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x118
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:129 +0x725
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:242 +0x262
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:177 +0x316
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:290 +0x286
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x9d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:232 +0x4e8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:532 +0x585
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:984 +0x14d
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:1111 +0x21d
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill({0x6226ab8, 0xc009646330}, 0xc0003ae918, 0xc001cad600)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:846 +0xfc
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run(0xc0030c8dc0, {0x6226ab8, 0xc009646330}, 0x4fb15da)
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:688 +0x74
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.catalogQuerier.query({0x0, {0x0, 0x0}, {{0xc0003ae918}, {0xc0003ae918}, 0x0}}, {0x6226ab8, 0xc009646330}, 0x183ed86, 0xc001d83c50)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/catalog_query.go:112 +0xdd
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.lookupIDs({0x6226ab8, 0xc009646330}, 0x13ce29f, {0x0, {0x0, 0x0}, {{0xc0003ae918}, {0xc0003ae918}, 0x0}}, {0xc0023f5ce0, ...})
	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/catalog_query.go:75 +0x117
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.LookupIDs(...)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/namespace.go:30
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.LookupID({0x6226ab8, 0xc009646330}, 0x8, {{0xc0003ae918}, {0xc0003ae918}, 0x0}, 0x1cad550, 0xc0, {0x4dbde5b, 0x9})
	github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/namespace.go:43 +0xfa
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.(*Manager).resolveName.func1({0x6226ab8, 0xc009646330}, 0xc009646300)
	github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:888 +0x12f
github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1({0x6226ab8, 0xc009646330}, 0x62a1df8)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:948 +0x27
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec(0xc0030c8dc0, {0x6226ab8, 0xc009646330}, 0xc0023f6138)
	github.com/cockroachdb/cockroach/pkg/kv/txn.go:980 +0xae
github.com/cockroachdb/cockroach/pkg/kv.runTxn({0x6226ab8, 0xc009646330}, 0xc000a32c60, 0x0)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:947 +0x5a
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn(0x1, {0x6226ab8, 0xc009646330}, 0x0)
	github.com/cockroachdb/cockroach/pkg/kv/db.go:910 +0x85
...
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:724 +0x3b8
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:643 +0x273

I'm not full bottle on the sequencing, but it seems like the range r26:/NamespaceTable/{30-Max} [(n1,s1):1, next=2, gen=0] was in the request path. This effectively deadlocks the test, as we need this query to complete, but node 1 is never coming back as we're blocked on that query returning.

This feels more like an issue with the test, which #78380 should solve, but I'm curious as to whether this request to node 1 could have been serviced by another node?

nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 24, 2022
Currently, the `disk-full` roachtest creates a cluster and immediately
places a ballast file on one node, which causes it to crash. If this
node is the only replica for a range containing a system table, when the
node crashes due to a full disk certain system queries may not complete.
This results in the test being unable to make forward progress, as the
one dead node prevents a system query from completing, and this query
prevents the node from being restarted.

Wait for all ranges to have at least two replicas before placing the
ballast file on the one node.

Touches cockroachdb#78337, cockroachdb#78270.

Release note: None.
@nicktrav nicktrav removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 24, 2022
@nicktrav nicktrav self-assigned this Mar 25, 2022
nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 25, 2022
Currently, the `disk-full` roachtest creates a cluster and immediately
places a ballast file on one node, which causes it to crash. If this
node is the only replica for a range containing a system table, when the
node crashes due to a full disk certain system queries may not complete.
This results in the test being unable to make forward progress, as the
one dead node prevents a system query from completing, and this query
prevents the node from being restarted.

Wait for all ranges to have at least two replicas before placing the
ballast file on the one node.

Touches cockroachdb#78337, cockroachdb#78270.

Release note: None.
@nicktrav
Copy link
Collaborator

#78456 should fix this, once backported.

craig bot pushed a commit that referenced this issue Mar 25, 2022
78014: streamingccl: make producer job exit smoothly after ingestion cutover r=gh-casper a=gh-casper

Previously producer job will time out and fail automatically after
ingestion cutover as consumer stops sending heartbeats.
This is not a good UX experience since stream replication is successful
but showed up failed.

This PR adds a new crdb builtin "crdb_internal.complete_replication_stream"
to let consumer send signal to source cluster that ingestion happens.

Closes: #76954
Release justification: Cat 4.
Release note: none.

78302: sql: fix migration with new system.table_statistics column r=rharding6373 a=rharding6373

Before this change, the new `system.table_statistics` column `avgSize`
introduced in version 22.1.12 was appended to the end of the table
during migration, but the system schema had the new column in a
different order. The column was also not added to the existing column
family containing all table columns during migration.

This change fixes both the system schema and the migration commands so
that the column ordering is the same and the new column is added to the
existing column family. Unfortunately, this means that the existing
column family name is unable to be updated to include the column.

Fixes: #77979

Release justification: Fixes a schema migration bug in the
table_statistics table.

Release note: None

78410: changefeedccl: remove tenant timestamp protection gates r=samiskin a=samiskin

Now that protected timestamps function in tenants in 22.1 the pts gates in
changefeeds can be removed.

Resolves #76936

Release justification: low risk change turning off now-unneeded gates
Release note (enterprise change): changefeeds can now protect targets
from gc on user tenants

78445: colexec: use Bytes.Copy instead of Get and Set in most places r=yuzefovich a=yuzefovich

**coldata: fix the usage of Bytes.Copy in CopyWithReorderedSource**

This was the intention but wasn't working because the call happens
inside a separate template.

Release note: None

**colexec: use Bytes.Copy instead of Get and Set in most places**

This commit audits our code for the usage of `Bytes.Get` followed by
`Bytes.Set` pattern and replaces those with `Bytes.Copy` (which is
faster for inlined values) in non-test code.

Release note: None

78456: roachtest: wait for ranges to replicate before filling disk r=tbg a=nicktrav

Currently, the `disk-full` roachtest creates a cluster and immediately
places a ballast file on one node, which causes it to crash. If this
node is the only replica for a range containing a system table, when the
node crashes due to a full disk certain system queries may not complete.
This results in the test being unable to make forward progress, as the
one dead node prevents a system query from completing, and this query
prevents the node from being restarted.

Wait for all ranges to have at least two replicas before placing the
ballast file on the one node.

Touches #78337, #78270.

Release note: None.

78468: sql: return an error when partition spans has no healthy instances r=rharding6373 a=rharding6373

If there are no SQL instances available for planning,
partitionSpansTenant in the DistSQL planner will panic. This PR fixes
the issue so that it instead returns an error if there are no instances
available.

Fixes: #77590

Release justification: Fixes a bug in DistSQL that can cause a panic for
non-system tenants.

Release note: None

Co-authored-by: Casper <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: Shiranka Miskin <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Nick Travers <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Mar 25, 2022
Currently, the `disk-full` roachtest creates a cluster and immediately
places a ballast file on one node, which causes it to crash. If this
node is the only replica for a range containing a system table, when the
node crashes due to a full disk certain system queries may not complete.
This results in the test being unable to make forward progress, as the
one dead node prevents a system query from completing, and this query
prevents the node from being restarted.

Wait for all ranges to have at least two replicas before placing the
ballast file on the one node.

Touches #78337, #78270.

Release note: None.
blathers-crl bot pushed a commit that referenced this issue Mar 25, 2022
Currently, the `disk-full` roachtest creates a cluster and immediately
places a ballast file on one node, which causes it to crash. If this
node is the only replica for a range containing a system table, when the
node crashes due to a full disk certain system queries may not complete.
This results in the test being unable to make forward progress, as the
one dead node prevents a system query from completing, and this query
prevents the node from being restarted.

Wait for all ranges to have at least two replicas before placing the
ballast file on the one node.

Touches #78337, #78270.

Release note: None.
nicktrav added a commit to nicktrav/cockroach that referenced this issue Mar 28, 2022
Currently, the `disk-full` roachtest creates a cluster and immediately
places a ballast file on one node, which causes it to crash. If this
node is the only replica for a range containing a system table, when the
node crashes due to a full disk certain system queries may not complete.
This results in the test being unable to make forward progress, as the
one dead node prevents a system query from completing, and this query
prevents the node from being restarted.

Wait for all ranges to have at least two replicas before placing the
ballast file on the one node.

Touches cockroachdb#78337, cockroachdb#78270.

Release note: None.
nicktrav added a commit that referenced this issue Mar 28, 2022
Currently, the `disk-full` roachtest creates a cluster and immediately
places a ballast file on one node, which causes it to crash. If this
node is the only replica for a range containing a system table, when the
node crashes due to a full disk certain system queries may not complete.
This results in the test being unable to make forward progress, as the
one dead node prevents a system query from completing, and this query
prevents the node from being restarted.

Wait for all ranges to have at least two replicas before placing the
ballast file on the one node.

Touches #78337, #78270.

Release note: None.
@nicktrav
Copy link
Collaborator

Closed by #78538.

nicktrav added a commit that referenced this issue Mar 29, 2022
Currently, the `disk-full` roachtest creates a cluster and immediately
places a ballast file on one node, which causes it to crash. If this
node is the only replica for a range containing a system table, when the
node crashes due to a full disk certain system queries may not complete.
This results in the test being unable to make forward progress, as the
one dead node prevents a system query from completing, and this query
prevents the node from being restarted.

Wait for all ranges to have at least two replicas before placing the
ballast file on the one node.

Touches #78337, #78270.

Release note: None.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team
Projects
None yet
Development

No branches or pull requests

2 participants