-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: disk-full failed #78337
Comments
This looks to be the same as #78270. Blocked on a goroutine issuing a query that never returns.
We probably want to set a short context on these queries and then retry them. Additionally, it would be good to try other nodes, not just node 2. |
Currently, when querying for node liveness in `disk-full`, a query may hang indefinitely, failing the test only after it has timed out (currently after 10 hours). Add retry logic for the node liveness query. The query is cancelled after a second, and exponentially retries for up to (approximately) one minute. Release note: None. Touches cockroachdb#78337, cockroachdb#78349.
@jbowens - moving the conversation about the I looked into what was executing on node 2, and it looks like the node liveness query was received, and was still in From
I was able to use the client IP address (presumably the roachtest worker IP) to look in the logs. It looks as if the execution of this query had to reach out to node 1 (which by this time was it was dead, as its disk had been filled up by the ballast file). I see the following in the logs on node 2:
I then see the following repeating for the next 10 hours on node 2:
Looking at goroutine 1499 on node 2, it indeed looks like it's stuck in some kind of retry loop processing some query that came in: goroutine 1499 [select]:
github.com/cockroachdb/cockroach/pkg/util/retry.(*Retry).Next(0xc0023f2bc0)
github.com/cockroachdb/cockroach/pkg/util/retry/retry.go:127 +0x14a
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).sendPartialBatch(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1568 +0x778
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).divideAndSendBatchToRanges(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:1210 +0x426
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*DistSender).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go:831 +0x725
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnLockGatekeeper).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_lock_gatekeeper.go:82 +0x269
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnMetricRecorder).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_metric_recorder.go:46 +0x118
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnCommitter).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_committer.go:129 +0x725
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).sendLockedWithRefreshAttempts(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:242 +0x262
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSpanRefresher).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go:177 +0x316
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnPipeliner).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:290 +0x286
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnSeqNumAllocator).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_seq_num_allocator.go:105 +0x9d
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*txnHeartbeater).SendLocked(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_heartbeater.go:232 +0x4e8
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord.(*TxnCoordSender).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord/txn_coord_sender.go:532 +0x585
github.com/cockroachdb/cockroach/pkg/kv.(*DB).sendUsingSender(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...}, ...)
github.com/cockroachdb/cockroach/pkg/kv/db.go:984 +0x14d
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Send(_, {_, _}, {{{0x0, 0x0, 0x0}, 0x0, {0x0, 0x0, 0x0, ...}, ...}, ...})
github.com/cockroachdb/cockroach/pkg/kv/txn.go:1111 +0x21d
github.com/cockroachdb/cockroach/pkg/kv.sendAndFill({0x6226ab8, 0xc009646330}, 0xc0003ae918, 0xc001cad600)
github.com/cockroachdb/cockroach/pkg/kv/db.go:846 +0xfc
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).Run(0xc0030c8dc0, {0x6226ab8, 0xc009646330}, 0x4fb15da)
github.com/cockroachdb/cockroach/pkg/kv/txn.go:688 +0x74
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.catalogQuerier.query({0x0, {0x0, 0x0}, {{0xc0003ae918}, {0xc0003ae918}, 0x0}}, {0x6226ab8, 0xc009646330}, 0x183ed86, 0xc001d83c50)
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/catalog_query.go:112 +0xdd
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.lookupIDs({0x6226ab8, 0xc009646330}, 0x13ce29f, {0x0, {0x0, 0x0}, {{0xc0003ae918}, {0xc0003ae918}, 0x0}}, {0xc0023f5ce0, ...})
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/catalog_query.go:75 +0x117
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.LookupIDs(...)
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/namespace.go:30
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv.LookupID({0x6226ab8, 0xc009646330}, 0x8, {{0xc0003ae918}, {0xc0003ae918}, 0x0}, 0x1cad550, 0xc0, {0x4dbde5b, 0x9})
github.com/cockroachdb/cockroach/pkg/sql/catalog/internal/catkv/namespace.go:43 +0xfa
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease.(*Manager).resolveName.func1({0x6226ab8, 0xc009646330}, 0xc009646300)
github.com/cockroachdb/cockroach/pkg/sql/catalog/lease/lease.go:888 +0x12f
github.com/cockroachdb/cockroach/pkg/kv.runTxn.func1({0x6226ab8, 0xc009646330}, 0x62a1df8)
github.com/cockroachdb/cockroach/pkg/kv/db.go:948 +0x27
github.com/cockroachdb/cockroach/pkg/kv.(*Txn).exec(0xc0030c8dc0, {0x6226ab8, 0xc009646330}, 0xc0023f6138)
github.com/cockroachdb/cockroach/pkg/kv/txn.go:980 +0xae
github.com/cockroachdb/cockroach/pkg/kv.runTxn({0x6226ab8, 0xc009646330}, 0xc000a32c60, 0x0)
github.com/cockroachdb/cockroach/pkg/kv/db.go:947 +0x5a
github.com/cockroachdb/cockroach/pkg/kv.(*DB).Txn(0x1, {0x6226ab8, 0xc009646330}, 0x0)
github.com/cockroachdb/cockroach/pkg/kv/db.go:910 +0x85
...
github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1()
github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:724 +0x3b8
created by github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync
github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:643 +0x273 I'm not full bottle on the sequencing, but it seems like the range This feels more like an issue with the test, which #78380 should solve, but I'm curious as to whether this request to node 1 could have been serviced by another node? |
Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches cockroachdb#78337, cockroachdb#78270. Release note: None.
Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches cockroachdb#78337, cockroachdb#78270. Release note: None.
#78456 should fix this, once backported. |
78014: streamingccl: make producer job exit smoothly after ingestion cutover r=gh-casper a=gh-casper Previously producer job will time out and fail automatically after ingestion cutover as consumer stops sending heartbeats. This is not a good UX experience since stream replication is successful but showed up failed. This PR adds a new crdb builtin "crdb_internal.complete_replication_stream" to let consumer send signal to source cluster that ingestion happens. Closes: #76954 Release justification: Cat 4. Release note: none. 78302: sql: fix migration with new system.table_statistics column r=rharding6373 a=rharding6373 Before this change, the new `system.table_statistics` column `avgSize` introduced in version 22.1.12 was appended to the end of the table during migration, but the system schema had the new column in a different order. The column was also not added to the existing column family containing all table columns during migration. This change fixes both the system schema and the migration commands so that the column ordering is the same and the new column is added to the existing column family. Unfortunately, this means that the existing column family name is unable to be updated to include the column. Fixes: #77979 Release justification: Fixes a schema migration bug in the table_statistics table. Release note: None 78410: changefeedccl: remove tenant timestamp protection gates r=samiskin a=samiskin Now that protected timestamps function in tenants in 22.1 the pts gates in changefeeds can be removed. Resolves #76936 Release justification: low risk change turning off now-unneeded gates Release note (enterprise change): changefeeds can now protect targets from gc on user tenants 78445: colexec: use Bytes.Copy instead of Get and Set in most places r=yuzefovich a=yuzefovich **coldata: fix the usage of Bytes.Copy in CopyWithReorderedSource** This was the intention but wasn't working because the call happens inside a separate template. Release note: None **colexec: use Bytes.Copy instead of Get and Set in most places** This commit audits our code for the usage of `Bytes.Get` followed by `Bytes.Set` pattern and replaces those with `Bytes.Copy` (which is faster for inlined values) in non-test code. Release note: None 78456: roachtest: wait for ranges to replicate before filling disk r=tbg a=nicktrav Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches #78337, #78270. Release note: None. 78468: sql: return an error when partition spans has no healthy instances r=rharding6373 a=rharding6373 If there are no SQL instances available for planning, partitionSpansTenant in the DistSQL planner will panic. This PR fixes the issue so that it instead returns an error if there are no instances available. Fixes: #77590 Release justification: Fixes a bug in DistSQL that can cause a panic for non-system tenants. Release note: None Co-authored-by: Casper <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Nick Travers <[email protected]>
Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches #78337, #78270. Release note: None.
Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches #78337, #78270. Release note: None.
Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches cockroachdb#78337, cockroachdb#78270. Release note: None.
Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches #78337, #78270. Release note: None.
Closed by #78538. |
Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches #78337, #78270. Release note: None.
roachtest.disk-full failed with artifacts on release-22.1 @ 26c05e15d3752f47a91592b5b870360c30c57dd5:
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-14075
The text was updated successfully, but these errors were encountered: