roachtest: disk-full failed #78270

cockroach-teamcity · 2022-03-22T17:17:52Z

roachtest.disk-full failed with artifacts on release-21.2 @ a5a88a4db163f0915ae65649aa0264c7a913fbfb:

The test failed on branch=release-21.2, cloud=gce:
test timed out (see artifacts for details)

Reproduce

See: roachtest README

/cc @cockroachdb/storage _{This test on roachdash | Improve this report!

Jira issue: CRDB-14040}

The text was updated successfully, but these errors were encountered:

nicktrav · 2022-03-22T21:41:15Z

Looks like this got a backport of the fix to the recent issue, so it's probably not that: https://github.com/cockroachdb/cockroach/blob/release-21.2/pkg/cmd/roachtest/tests/disk_full.go

nicktrav · 2022-03-22T22:04:45Z

It looks like the monitor goroutine got stuck waiting for a query response, and this timed out after 600 minutes, failing the test:

cockroach/pkg/cmd/roachtest/tests/disk_full.go

Line 68 in 2c0a06e

    
           err := db.QueryRow(`SELECT is_live FROM crdb_internal.gossip_nodes WHERE node_id = 1;`).Scan(&isLive)

goroutine 4050163 [IO wait, 600 minutes]:
internal/poll.runtime_pollWait(0x7ff475f65738, 0x72)
        /usr/local/go/src/runtime/netpoll.go:234 +0x89
internal/poll.(*pollDesc).wait(0xc00059ff00, 0xc000eba000, 0x0)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:84 +0x32
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:89
internal/poll.(*FD).Read(0xc00059ff00, {0xc000eba000, 0x1000, 0x1000})
        /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc00059ff00, {0xc000eba000, 0x15a, 0x47})
        /usr/local/go/src/net/fd_posix.go:56 +0x29
net.(*conn).Read(0xc000f10820, {0xc000eba000, 0xc0045c7620, 0x0})
        /usr/local/go/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc004d3ea20, {0xc0045c7620, 0x5, 0xc0019b7438})
        /usr/local/go/src/bufio/bufio.go:227 +0x1b4
io.ReadAtLeast({0x5d836e0, 0xc004d3ea20}, {0xc0045c7620, 0x5, 0x200}, 0x5)
        /usr/local/go/src/io/io.go:328 +0x9a
io.ReadFull(...)
        /usr/local/go/src/io/io.go:347
github.com/lib/pq.(*conn).recvMessage(0xc0045c7600, 0xc0019b75a0)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:983 +0xca
github.com/lib/pq.(*conn).recv1Buf(0xc0045c7600, 0x0)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:1038 +0x2e
github.com/lib/pq.(*conn).recv1(...)
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:1065
github.com/lib/pq.(*conn).simpleQuery(0xc0045c7600, {0x493a840, 0x41})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:675 +0x231
github.com/lib/pq.(*conn).query(0xc0045c7600, {0x493a840, 0xc000126000}, {0x8988558, 0x0, 0x0})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn.go:868 +0x426
github.com/lib/pq.(*conn).QueryContext(0x8, {0x5e0d370, 0xc000126000}, {0x493a840, 0x41}, {0x8988558, 0x0, 0x8})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/lib/pq/conn_go18.go:21 +0xd7
database/sql.ctxDriverQuery({0x5e0d370, 0xc000126000}, {0x7ff48b21af48, 0xc0045c7600}, {0x0, 0x0}, {0x493a840, 0xc001c7c990}, {0x8988558, 0x0, ...})
        /usr/local/go/src/database/sql/ctxutil.go:48 +0x17d
database/sql.(*DB).queryDC.func1()
        /usr/local/go/src/database/sql/sql.go:1722 +0x175
database/sql.withLock({0x5da4f18, 0xc001c7c990}, 0xc0019b7ae0)
        /usr/local/go/src/database/sql/sql.go:3396 +0x8c
database/sql.(*DB).queryDC(0xc0019b7b01, {0x5e0d370, 0xc000126000}, {0x0, 0x0}, 0xc001c7c990, 0xc001323470, {0x493a840, 0x41}, {0x0, ...})
        /usr/local/go/src/database/sql/sql.go:1717 +0x211
database/sql.(*DB).query(0x532fe5, {0x5e0d370, 0xc000126000}, {0x493a840, 0x41}, {0x0, 0x0, 0x0}, 0x58)
        /usr/local/go/src/database/sql/sql.go:1700 +0xfd
database/sql.(*DB).QueryContext(0x8, {0x5e0d370, 0xc000126000}, {0x493a840, 0x41}, {0x0, 0x0, 0x0})
        /usr/local/go/src/database/sql/sql.go:1674 +0xdf
database/sql.(*DB).QueryRowContext(...)
        /usr/local/go/src/database/sql/sql.go:1778
database/sql.(*DB).QueryRow(0xc0010023c0, {0x493a840, 0xc001cab600}, {0x0, 0x4821cdb, 0x300000002})
        /usr/local/go/src/database/sql/sql.go:1792 +0x4a
github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerDiskFull.func1.2({0x5e0d338, 0xc001cab600})
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/disk_full.go:68 +0x205
main.(*monitorImpl).Go.func1()
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/monitor.go:106 +0xb0
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x67
created by golang.org/x/sync/errgroup.(*Group).Go
        /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x92

I think we can chalk this one up to an infrastructure flake. Closing.

Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches cockroachdb#78337, cockroachdb#78270. Release note: None.

78014: streamingccl: make producer job exit smoothly after ingestion cutover r=gh-casper a=gh-casper Previously producer job will time out and fail automatically after ingestion cutover as consumer stops sending heartbeats. This is not a good UX experience since stream replication is successful but showed up failed. This PR adds a new crdb builtin "crdb_internal.complete_replication_stream" to let consumer send signal to source cluster that ingestion happens. Closes: #76954 Release justification: Cat 4. Release note: none. 78302: sql: fix migration with new system.table_statistics column r=rharding6373 a=rharding6373 Before this change, the new `system.table_statistics` column `avgSize` introduced in version 22.1.12 was appended to the end of the table during migration, but the system schema had the new column in a different order. The column was also not added to the existing column family containing all table columns during migration. This change fixes both the system schema and the migration commands so that the column ordering is the same and the new column is added to the existing column family. Unfortunately, this means that the existing column family name is unable to be updated to include the column. Fixes: #77979 Release justification: Fixes a schema migration bug in the table_statistics table. Release note: None 78410: changefeedccl: remove tenant timestamp protection gates r=samiskin a=samiskin Now that protected timestamps function in tenants in 22.1 the pts gates in changefeeds can be removed. Resolves #76936 Release justification: low risk change turning off now-unneeded gates Release note (enterprise change): changefeeds can now protect targets from gc on user tenants 78445: colexec: use Bytes.Copy instead of Get and Set in most places r=yuzefovich a=yuzefovich **coldata: fix the usage of Bytes.Copy in CopyWithReorderedSource** This was the intention but wasn't working because the call happens inside a separate template. Release note: None **colexec: use Bytes.Copy instead of Get and Set in most places** This commit audits our code for the usage of `Bytes.Get` followed by `Bytes.Set` pattern and replaces those with `Bytes.Copy` (which is faster for inlined values) in non-test code. Release note: None 78456: roachtest: wait for ranges to replicate before filling disk r=tbg a=nicktrav Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches #78337, #78270. Release note: None. 78468: sql: return an error when partition spans has no healthy instances r=rharding6373 a=rharding6373 If there are no SQL instances available for planning, partitionSpansTenant in the DistSQL planner will panic. This PR fixes the issue so that it instead returns an error if there are no instances available. Fixes: #77590 Release justification: Fixes a bug in DistSQL that can cause a panic for non-system tenants. Release note: None Co-authored-by: Casper <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: Shiranka Miskin <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Nick Travers <[email protected]>

Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches #78337, #78270. Release note: None.

Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches cockroachdb#78337, cockroachdb#78270. Release note: None.

Currently, the `disk-full` roachtest creates a cluster and immediately places a ballast file on one node, which causes it to crash. If this node is the only replica for a range containing a system table, when the node crashes due to a full disk certain system queries may not complete. This results in the test being unable to make forward progress, as the one dead node prevents a system query from completing, and this query prevents the node from being restarted. Wait for all ranges to have at least two replicas before placing the ballast file on the one node. Touches #78337, #78270. Release note: None.

nicktrav · 2022-03-29T19:21:28Z

Closed by #78537.

cockroach-teamcity added branch-release-21.2 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Mar 22, 2022

blathers-crl bot added the T-storage Storage Team label Mar 22, 2022

nicktrav self-assigned this Mar 22, 2022

nicktrav closed this as completed Mar 22, 2022

This was referenced Mar 23, 2022

roachtest: disk-full failed #78349

Closed

roachtest: disk-full failed #78337

Closed

nicktrav mentioned this issue Mar 24, 2022

roachtest: wait for ranges to replicate before filling disk #78456

Merged

nicktrav reopened this Mar 24, 2022

nicktrav removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Mar 24, 2022

blathers-crl bot mentioned this issue Mar 25, 2022

release-21.2: roachtest: wait for ranges to replicate before filling disk #78537

Merged

blathers-crl bot mentioned this issue Mar 25, 2022

release-22.1: roachtest: wait for ranges to replicate before filling disk #78538

Merged

cockroach-teamcity mentioned this issue Mar 29, 2022

roachtest: disk-full failed #78760

Closed

nicktrav closed this as completed Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: disk-full failed #78270

roachtest: disk-full failed #78270

cockroach-teamcity commented Mar 22, 2022 •

edited by cockroach-jira-scripts

Loading

nicktrav commented Mar 22, 2022

nicktrav commented Mar 22, 2022

nicktrav commented Mar 29, 2022

roachtest: disk-full failed #78270

roachtest: disk-full failed #78270

Comments

cockroach-teamcity commented Mar 22, 2022 • edited by cockroach-jira-scripts Loading

nicktrav commented Mar 22, 2022

nicktrav commented Mar 22, 2022

nicktrav commented Mar 29, 2022

cockroach-teamcity commented Mar 22, 2022 •

edited by cockroach-jira-scripts

Loading