ccl/streamingccl/streamproducer: TestStreamPartition failed #102286

cockroach-teamcity · 2023-04-25T21:08:46Z

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ ec0b5087392d49980a2141315aae62bb2563348f:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 58890 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38

Log preceding fatal error

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ec6a3f034611d027670bad7acfd7b3dd/logTestStreamPartition337479638
    test_log_scope.go:79: use -show-logs to present logs inline
=== RUN   TestStreamPartition/stream-table-cursor-error

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

ccl/streamingccl/streamproducer: TestStreamPartition failed #88429 ccl/streamingccl/streamproducer: TestStreamPartition failed [C-test-failure O-robot T-cdc branch-release-22.1]

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-27382}

The text was updated successfully, but these errors were encountered:

mgartner · 2023-04-26T16:02:09Z

Saw this flake here: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/9816959?showRootCauses=true&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

cockroach-teamcity · 2023-04-26T19:04:43Z

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ 0d8f02757d44e05435eb8de80a311fcfe625ef1b:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 60946 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38

Log preceding fatal error

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ec6a3f034611d027670bad7acfd7b3dd/logTestStreamPartition489391464
    test_log_scope.go:79: use -show-logs to present logs inline
=== RUN   TestStreamPartition/stream-table-cursor-error

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

ccl/streamingccl/streamproducer: TestStreamPartition failed #88429 ccl/streamingccl/streamproducer: TestStreamPartition failed [C-test-failure O-robot T-cdc branch-release-22.1]

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-04-26T23:39:48Z

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ 86a86b15c13afe0a580deed35ce7dafea7adad78:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 60650 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38

Log preceding fatal error

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ec6a3f034611d027670bad7acfd7b3dd/logTestStreamPartition2455180566
    test_log_scope.go:79: use -show-logs to present logs inline
=== RUN   TestStreamPartition/stream-table-cursor-error

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

Fixes cockroachdb#102523 Informs cockroachdb#102286 Informs cockroachdb#86206 Release note: none

Informs cockroachdb#102523 Informs cockroachdb#102286 Informs cockroachdb#86206 Release note: none

102593: c2c: skip and deflake a few unit tests r=benbardin a=msbutler c2c: skip a few unit tests Informs #102523 Informs #102286 Informs #86206 Release note: none c2c: close pgconn with correct context in test infra When a user opens a pgx connection, they pass it a context that must be used to close the connection, else pgx's contextWatcher will run indefinetly on a seperate goroutine. Previously, the `startReplication()` test helper was initializing its own context to open pgx connection, and did not pass this context back to the caller, and consequently, the user would close the connection with the incorrect context, leading to a leaky goroutine. This patch allows the user to pass a context in the startReplication(), preventing the user from closing the connection with the wrong goroutine. Fixes #102523 Release note: None Co-authored-by: Michael Butler <[email protected]>

Informs #102523 Informs #102286 Informs #86206 Release note: none

msbutler · 2023-04-30T22:19:03Z

I can reproduce this via:

./dev test pkg/ccl/streamingccl/streamproducer -f TestStreamPartition/stream-table-cursor-error --stress --race --vmodule "rangefeed=1"

and can confirm that the rangefeed does surface this error via this log line several minutes before the test times out.

I suspect there's a real bug in event_stream.go which causes SELECT * crdb_internal.stream_partition($1, $2) to hang. If this query returned early, the test infra would be able to exit and end the test.

It seems like we saw a similar bug a while back, where an internal error caused everything to hang: #85867

To get to the bottom of this, I'll need to read a bit more code.

cockroach-teamcity · 2023-05-01T04:11:32Z

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ fa8d237846052ae72cc44b3eaebe1f38c89834f8:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 57967 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38

Log preceding fatal error

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ec6a3f034611d027670bad7acfd7b3dd/logTestStreamPartition2388411531
    test_log_scope.go:79: use -show-logs to present logs inline
=== RUN   TestStreamPartition/stream-table-cursor-error

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

ccl/streamingccl/streamproducer: TestStreamPartition failed #102520 ccl/streamingccl/streamproducer: TestStreamPartition failed [C-test-failure O-robot T-disaster-recovery branch-release-23.1.0]

_{This test on roachdash | Improve this report!}

msbutler · 2023-05-01T23:05:34Z

On the client side, it seems that feed.Next() is hanging after we send the clear range request:

internal/poll.runtime_pollWait(0x7f1a4d571db0, 0x72)
    3   GOROOT/src/runtime/netpoll.go:305 +0x89
    2 internal/poll.(*pollDesc).wait(0xc0055eab98, 0xc00097ca00?, 0x0)
    1   GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0xbd
7935  internal/poll.(*pollDesc).waitRead(...)
    1   GOROOT/src/internal/poll/fd_poll_runtime.go:89
    2 internal/poll.(*FD).Read(0xc0055eab80, {0xc00097ca00, 0x700, 0x700})
    3   GOROOT/src/internal/poll/fd_unix.go:167 +0x415
    4 net.(*netFD).Read(0xc0055eab80, {0xc00097ca00, 0x700, 0x700})
    5   GOROOT/src/net/fd_posix.go:55 +0x51
    6 net.(*conn).Read(0xc001a12718, {0xc00097ca00, 0x700, 0x700})
    7   GOROOT/src/net/net.go:183 +0xb1
    8 crypto/tls.(*atLeastReader).Read(0xc005716e88, {0xc00097ca00, 0x700, 0x700})
    9   GOROOT/src/crypto/tls/conn.go:787 +0x86
   10 bytes.(*Buffer).ReadFrom(0xc0007bc5f8, {0xce213e0, 0xc005716e88})
   11   GOROOT/src/bytes/buffer.go:202 +0x113
   12 crypto/tls.(*Conn).readFromUntil(0xc0007bc380, {0xce2ff60?, 0xc001a12718}, 0x5)
   13   GOROOT/src/crypto/tls/conn.go:809 +0x1f3
   14 crypto/tls.(*Conn).readRecordOrCCS(0xc0007bc380, 0x0)
   15   GOROOT/src/crypto/tls/conn.go:616 +0x417
   16 crypto/tls.(*Conn).readRecord(...)
   17   GOROOT/src/crypto/tls/conn.go:582
   18 crypto/tls.(*Conn).Read(0xc0007bc380, {0xc0052de314, 0x1cec, 0xc005662618?})
   19   GOROOT/src/crypto/tls/conn.go:1287 +0x29d
   20 io.ReadAtLeast({0x7f1a49843708, 0xc0007bc380}, {0xc0052de314, 0x1cec, 0x1cec}, 0x5)
   21   GOROOT/src/io/io.go:332 +0xde
   22 github.com/jackc/chunkreader/v2.(*ChunkReader).appendAtLeast(...)
   23   github.com/jackc/chunkreader/v2/external/com_github_jackc_chunkreader_v2/chunkreader.go:88
   24 github.com/jackc/chunkreader/v2.(*ChunkReader).Next(0xc005a7e640, 0x5)
   25   github.com/jackc/chunkreader/v2/external/com_github_jackc_chunkreader_v2/chunkreader.go:78 +0x691
   26 github.com/jackc/pgproto3/v2.(*Frontend).Receive(0xc0015dd000)
   27   github.com/jackc/pgproto3/v2/external/com_github_jackc_pgproto3_v2/frontend.go:74 +0x93
   28 github.com/jackc/pgconn.(*PgConn).peekMessage(0xc004b31600)
   29   github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:531 +0x1ed
   30 github.com/jackc/pgconn.(*PgConn).receiveMessage(0xc004b31600)
   31   github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:551 +0x48
   32 github.com/jackc/pgconn.(*ResultReader).receiveMessage(0xc004b31698)
   33   github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:1581 +0xa5
   34 github.com/jackc/pgconn.(*ResultReader).NextRow(0xc004b31698)
   35   github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:1494 +0x4e
   36 github.com/jackc/pgx/v4.(*connRows).Next(0xc0030fd1d0)
   37   github.com/jackc/pgx/v4/external/com_github_jackc_pgx_v4/rows.go:186 +0x76
   38 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamproducer_test.(*pgConnReplicationFeedSource).Next(0xc0084b5310)
   39   github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamproducer_test/pkg/ccl/streamingccl/streamproducer/replication_stream_test.go:141 +0x115
   40 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/replicationtestutils.(*ReplicationFeed).consumeUntil.func1({0xce6e530?, 0xc0000c6098?})
   41   github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/replicationtestutils/replication_helpers.go:143 +0xe7
   42 github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout({0xce6e530, 0xc0000c6098}, {0x901afdf, 0x7}, 0x1bf08eb000, 0xc0040f7b70)
   43   github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0x143
   44 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/replicationtestutils.(*ReplicationFeed).consumeUntil(0xc00538f4d0, {0xce6e530, 0xc0000c6098}, 0x93d3e30, 0x93d3e28)

On the server side, I have confirmed the rangefeed surfaces the correct error and returns, via observing this log line in the failed test server side logs.

Now, I know the eventStream (i.e. the producer sql processor), never surfaces this error because Next() never gets called for some mysterious reason. I can confirm this because these added log lines never surface in the failed test logs. We also never Close() the eventStream, as these added log lines never surface either.

On the client side, next should never hang. I'll need to figure out why this occurs.

This seems related to a bug fixed by #85867. But there, eventStream.Close() was called. Here, it is not.

msbutler · 2023-05-03T15:53:08Z

Running ./dev test pkg/ccl/streamingccl/streamproducer -f TestStreamPartition/stream-table-cursor-error --stress --race --vmodule "rangefeed=1 --deadlock (i.e. with deadlock), causes cluster startup to fail....probably because all these test infra watchers are too heavy handed.

cockroach-teamcity · 2023-05-07T09:20:57Z

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ 4226a83871bbce776bc9389fca5cf084b4bb7632:

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/9ce24496530642208197dda818d65c9e/logTestStreamPartition2047022242
    test_log_scope.go:79: use -show-logs to present logs inline
=== CONT  TestStreamPartition
    replication_stream_test.go:478: -- test log scope end --
    replication_stream_test.go:478: Leaked goroutine: goroutine 4772422 [IO wait]:
        internal/poll.runtime_pollWait(0x7f25d2c85e28, 0x72)
        	GOROOT/src/runtime/netpoll.go:305 +0x89
        internal/poll.(*pollDesc).wait(0xc011148d80?, 0xc0076a5200?, 0x0)
        	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x32
        internal/poll.(*pollDesc).waitRead(...)
        	GOROOT/src/internal/poll/fd_poll_runtime.go:89
        internal/poll.(*FD).Read(0xc011148d80, {0xc0076a5200, 0x10, 0x10})
        	GOROOT/src/internal/poll/fd_unix.go:167 +0x25a
        net.(*netFD).Read(0xc011148d80, {0xc0076a5200?, 0xc01244de48?, 0x4a7a92?})
        	GOROOT/src/net/fd_posix.go:55 +0x29
        net.(*conn).Read(0xc012adbc18, {0xc0076a5200?, 0xc0159f8d20?, 0xc0159f8d20?})
        	GOROOT/src/net/net.go:183 +0x45
        github.com/jackc/pgconn.(*PgConn).CancelRequest(0xc00a862580, {0x70c0c50?, 0xc0159f8d20})
        	github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:934 +0x3db
        github.com/jackc/pgconn.(*PgConn).asyncClose.func1()
        	github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:667 +0x10f
        created by github.com/jackc/pgconn.(*PgConn).asyncClose
        	github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:658 +0x65
        Leaked goroutine: goroutine 4772457 [select]:
        github.com/jackc/pgconn/internal/ctxwatch.(*ContextWatcher).Watch.func1()
        	github.com/jackc/pgconn/internal/ctxwatch/external/com_github_jackc_pgconn/internal/ctxwatch/context_watcher.go:47 +0x6c
        created by github.com/jackc/pgconn/internal/ctxwatch.(*ContextWatcher).Watch
        	github.com/jackc/pgconn/internal/ctxwatch/external/com_github_jackc_pgconn/internal/ctxwatch/context_watcher.go:46 +0x12c
--- FAIL: TestStreamPartition (22.33s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

ccl/streamingccl/streamproducer: TestStreamPartition failed #102520 ccl/streamingccl/streamproducer: TestStreamPartition failed [C-test-failure O-robot T-disaster-recovery branch-release-23.1.0]

_{This test on roachdash | Improve this report!}

102876: c2c: fix rangefeed error propogration race r=stevendanna a=msbutler In the producer dist sql processor, rangefeed errors only propograte to the user if the error can be sent on a non-blocking channel. Because the channel was previously unbuffered, the dist sql processor would inadvertently swallow the error if the receiver was not actively waiting on the channel. This would then cause the sql processor to hang, as the underlying rangefeed would close after the ignored error message. This patch buffers the errCh, guaranteeing that the first rangefeed error will be processed by the sql processor. If the rangefeed surfaces several errors while the buffered channel is full, these errors will be swallowed, which is fine, as the first error will always shut down the sql processor. Fixes #102286 Release note: None Co-authored-by: Michael Butler <[email protected]>

102593: c2c: skip and deflake a few unit tests r=benbardin a=msbutler c2c: skip a few unit tests Informs cockroachdb#102523 Informs cockroachdb#102286 Informs cockroachdb#86206 Release note: none c2c: close pgconn with correct context in test infra When a user opens a pgx connection, they pass it a context that must be used to close the connection, else pgx's contextWatcher will run indefinetly on a seperate goroutine. Previously, the `startReplication()` test helper was initializing its own context to open pgx connection, and did not pass this context back to the caller, and consequently, the user would close the connection with the incorrect context, leading to a leaky goroutine. This patch allows the user to pass a context in the startReplication(), preventing the user from closing the connection with the wrong goroutine. Fixes cockroachdb#102523 Release note: None Co-authored-by: Michael Butler <[email protected]>

In the producer dist sql processor, rangefeed errors only propograte to the user if the error can be sent on a non-blocking channel. Because the channel was previously unbuffered, the dist sql processor would inadvertently swallow the error if the receiver was not actively waiting on the channel. This would then cause the sql processor to hang, as the underlying rangefeed would close after the ignored error message. This patch buffers the errCh, guaranteeing that the first rangefeed error will be processed by the sql processor. If the rangefeed surfaces several errors while the buffered channel is full, these errors will be swallowed, which is fine, as the first error will always shut down the sql processor. Fixes cockroachdb#102286 Release note: None

cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Apr 25, 2023

cockroach-teamcity added this to the 23.1 milestone Apr 25, 2023

blathers-crl bot added the T-disaster-recovery label Apr 25, 2023

msbutler self-assigned this Apr 26, 2023

cockroach-teamcity mentioned this issue Apr 28, 2023

ccl/streamingccl/streamproducer: TestStreamPartition failed #102520

Closed

dt mentioned this issue Apr 28, 2023

backupccl: don't restore sql_instances, sqlliveness or lease rows #102413

Merged

msbutler added a commit to msbutler/cockroach that referenced this issue Apr 28, 2023

c2c: deflake and skip a few unit tests

5cb1684

Fixes cockroachdb#102523 Informs cockroachdb#102286 Informs cockroachdb#86206 Release note: none

msbutler mentioned this issue Apr 28, 2023

c2c: skip and deflake a few unit tests #102593

Merged

msbutler added the skipped-test label Apr 28, 2023

msbutler added a commit to msbutler/cockroach that referenced this issue Apr 28, 2023

c2c: skip a few unit tests

bfeaf2b

Informs cockroachdb#102523 Informs cockroachdb#102286 Informs cockroachdb#86206 Release note: none

blathers-crl bot pushed a commit that referenced this issue Apr 28, 2023

c2c: skip a few unit tests

c9455f5

Informs #102523 Informs #102286 Informs #86206 Release note: none

blathers-crl bot mentioned this issue Apr 28, 2023

release-23.1: c2c: skip and deflake a few unit tests #102642

Merged

This was referenced May 8, 2023

c2c: fix rangefeed error propogration race #102876

Merged

bazel: print path to test log and crdb server logs after stress failure #102879

Closed

craig bot closed this as completed in 70a9246 May 10, 2023

blathers-crl bot mentioned this issue May 10, 2023

release-23.1: c2c: fix rangefeed error propogration race #103014

Merged

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/streamingccl/streamproducer: TestStreamPartition failed #102286

ccl/streamingccl/streamproducer: TestStreamPartition failed #102286

cockroach-teamcity commented Apr 25, 2023 •

edited by cockroach-jira-scripts

Loading

mgartner commented Apr 26, 2023

cockroach-teamcity commented Apr 26, 2023

cockroach-teamcity commented Apr 26, 2023

msbutler commented Apr 30, 2023 •

edited

Loading

cockroach-teamcity commented May 1, 2023

msbutler commented May 1, 2023 •

edited

Loading

msbutler commented May 3, 2023 •

edited

Loading

cockroach-teamcity commented May 7, 2023

ccl/streamingccl/streamproducer: TestStreamPartition failed #102286

ccl/streamingccl/streamproducer: TestStreamPartition failed #102286

Comments

cockroach-teamcity commented Apr 25, 2023 • edited by cockroach-jira-scripts Loading

mgartner commented Apr 26, 2023

cockroach-teamcity commented Apr 26, 2023

cockroach-teamcity commented Apr 26, 2023

msbutler commented Apr 30, 2023 • edited Loading

cockroach-teamcity commented May 1, 2023

msbutler commented May 1, 2023 • edited Loading

msbutler commented May 3, 2023 • edited Loading

cockroach-teamcity commented May 7, 2023

cockroach-teamcity commented Apr 25, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Apr 30, 2023 •

edited

Loading

msbutler commented May 1, 2023 •

edited

Loading

msbutler commented May 3, 2023 •

edited

Loading