Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/streamingccl/streamproducer: TestStreamPartition failed #102286

Closed
cockroach-teamcity opened this issue Apr 25, 2023 · 8 comments · Fixed by #102876
Closed

ccl/streamingccl/streamproducer: TestStreamPartition failed #102286

cockroach-teamcity opened this issue Apr 25, 2023 · 8 comments · Fixed by #102876
Assignees
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Apr 25, 2023

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ ec0b5087392d49980a2141315aae62bb2563348f:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 58890 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38
Log preceding fatal error

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ec6a3f034611d027670bad7acfd7b3dd/logTestStreamPartition337479638
    test_log_scope.go:79: use -show-logs to present logs inline
=== RUN   TestStreamPartition/stream-table-cursor-error

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-27382

@cockroach-teamcity cockroach-teamcity added branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Apr 25, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Apr 25, 2023
@msbutler msbutler self-assigned this Apr 26, 2023
@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ 0d8f02757d44e05435eb8de80a311fcfe625ef1b:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 60946 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38
Log preceding fatal error

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ec6a3f034611d027670bad7acfd7b3dd/logTestStreamPartition489391464
    test_log_scope.go:79: use -show-logs to present logs inline
=== RUN   TestStreamPartition/stream-table-cursor-error

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ 86a86b15c13afe0a580deed35ce7dafea7adad78:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 60650 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38
Log preceding fatal error

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ec6a3f034611d027670bad7acfd7b3dd/logTestStreamPartition2455180566
    test_log_scope.go:79: use -show-logs to present logs inline
=== RUN   TestStreamPartition/stream-table-cursor-error

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

msbutler added a commit to msbutler/cockroach that referenced this issue Apr 28, 2023
msbutler added a commit to msbutler/cockroach that referenced this issue Apr 28, 2023
craig bot pushed a commit that referenced this issue Apr 28, 2023
102593: c2c: skip and deflake a few unit tests r=benbardin a=msbutler

c2c: skip a few unit tests
Informs #102523
Informs #102286
Informs #86206

Release note: none

c2c: close pgconn with correct context in test infra
When a user opens a pgx connection, they pass it a context that must be used to
close the connection, else pgx's contextWatcher will run indefinetly on a
seperate goroutine.  Previously, the `startReplication()` test helper was
initializing its own context to open pgx connection, and did not pass this
context back to the caller, and consequently, the user would close the
connection with the incorrect context, leading to a leaky goroutine.

This patch allows the user to pass a context in the startReplication(),
preventing the user from closing the connection with the wrong goroutine.

Fixes #102523

Release note: None

Co-authored-by: Michael Butler <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Apr 28, 2023
Informs #102523
Informs #102286
Informs #86206

Release note: none
@msbutler
Copy link
Collaborator

msbutler commented Apr 30, 2023

I can reproduce this via:

./dev test pkg/ccl/streamingccl/streamproducer -f TestStreamPartition/stream-table-cursor-error --stress --race --vmodule "rangefeed=1"

and can confirm that the rangefeed does surface this error via this log line several minutes before the test times out.

I suspect there's a real bug in event_stream.go which causes SELECT * crdb_internal.stream_partition($1, $2) to hang. If this query returned early, the test infra would be able to exit and end the test.

It seems like we saw a similar bug a while back, where an internal error caused everything to hang: #85867

To get to the bottom of this, I'll need to read a bit more code.

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ fa8d237846052ae72cc44b3eaebe1f38c89834f8:

Fatal error:

panic: test timed out after 14m55s

Stack:

goroutine 57967 [running]:
testing.(*M).startAlarm.func1()
	GOROOT/src/testing/testing.go:2036 +0x88
created by time.goFunc
	GOROOT/src/time/sleep.go:176 +0x38
Log preceding fatal error

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/ec6a3f034611d027670bad7acfd7b3dd/logTestStreamPartition2388411531
    test_log_scope.go:79: use -show-logs to present logs inline
=== RUN   TestStreamPartition/stream-table-cursor-error

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

@msbutler
Copy link
Collaborator

msbutler commented May 1, 2023

On the client side, it seems that feed.Next() is hanging after we send the clear range request:

internal/poll.runtime_pollWait(0x7f1a4d571db0, 0x72)
    3   GOROOT/src/runtime/netpoll.go:305 +0x89
    2 internal/poll.(*pollDesc).wait(0xc0055eab98, 0xc00097ca00?, 0x0)
    1   GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0xbd
7935  internal/poll.(*pollDesc).waitRead(...)
    1   GOROOT/src/internal/poll/fd_poll_runtime.go:89
    2 internal/poll.(*FD).Read(0xc0055eab80, {0xc00097ca00, 0x700, 0x700})
    3   GOROOT/src/internal/poll/fd_unix.go:167 +0x415
    4 net.(*netFD).Read(0xc0055eab80, {0xc00097ca00, 0x700, 0x700})
    5   GOROOT/src/net/fd_posix.go:55 +0x51
    6 net.(*conn).Read(0xc001a12718, {0xc00097ca00, 0x700, 0x700})
    7   GOROOT/src/net/net.go:183 +0xb1
    8 crypto/tls.(*atLeastReader).Read(0xc005716e88, {0xc00097ca00, 0x700, 0x700})
    9   GOROOT/src/crypto/tls/conn.go:787 +0x86
   10 bytes.(*Buffer).ReadFrom(0xc0007bc5f8, {0xce213e0, 0xc005716e88})
   11   GOROOT/src/bytes/buffer.go:202 +0x113
   12 crypto/tls.(*Conn).readFromUntil(0xc0007bc380, {0xce2ff60?, 0xc001a12718}, 0x5)
   13   GOROOT/src/crypto/tls/conn.go:809 +0x1f3
   14 crypto/tls.(*Conn).readRecordOrCCS(0xc0007bc380, 0x0)
   15   GOROOT/src/crypto/tls/conn.go:616 +0x417
   16 crypto/tls.(*Conn).readRecord(...)
   17   GOROOT/src/crypto/tls/conn.go:582
   18 crypto/tls.(*Conn).Read(0xc0007bc380, {0xc0052de314, 0x1cec, 0xc005662618?})
   19   GOROOT/src/crypto/tls/conn.go:1287 +0x29d
   20 io.ReadAtLeast({0x7f1a49843708, 0xc0007bc380}, {0xc0052de314, 0x1cec, 0x1cec}, 0x5)
   21   GOROOT/src/io/io.go:332 +0xde
   22 github.com/jackc/chunkreader/v2.(*ChunkReader).appendAtLeast(...)
   23   github.com/jackc/chunkreader/v2/external/com_github_jackc_chunkreader_v2/chunkreader.go:88
   24 github.com/jackc/chunkreader/v2.(*ChunkReader).Next(0xc005a7e640, 0x5)
   25   github.com/jackc/chunkreader/v2/external/com_github_jackc_chunkreader_v2/chunkreader.go:78 +0x691
   26 github.com/jackc/pgproto3/v2.(*Frontend).Receive(0xc0015dd000)
   27   github.com/jackc/pgproto3/v2/external/com_github_jackc_pgproto3_v2/frontend.go:74 +0x93
   28 github.com/jackc/pgconn.(*PgConn).peekMessage(0xc004b31600)
   29   github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:531 +0x1ed
   30 github.com/jackc/pgconn.(*PgConn).receiveMessage(0xc004b31600)
   31   github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:551 +0x48
   32 github.com/jackc/pgconn.(*ResultReader).receiveMessage(0xc004b31698)
   33   github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:1581 +0xa5
   34 github.com/jackc/pgconn.(*ResultReader).NextRow(0xc004b31698)
   35   github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:1494 +0x4e
   36 github.com/jackc/pgx/v4.(*connRows).Next(0xc0030fd1d0)
   37   github.com/jackc/pgx/v4/external/com_github_jackc_pgx_v4/rows.go:186 +0x76
   38 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamproducer_test.(*pgConnReplicationFeedSource).Next(0xc0084b5310)
   39   github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/streamproducer_test/pkg/ccl/streamingccl/streamproducer/replication_stream_test.go:141 +0x115
   40 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/replicationtestutils.(*ReplicationFeed).consumeUntil.func1({0xce6e530?, 0xc0000c6098?})
   41   github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/replicationtestutils/replication_helpers.go:143 +0xe7
   42 github.com/cockroachdb/cockroach/pkg/util/contextutil.RunWithTimeout({0xce6e530, 0xc0000c6098}, {0x901afdf, 0x7}, 0x1bf08eb000, 0xc0040f7b70)
   43   github.com/cockroachdb/cockroach/pkg/util/contextutil/context.go:91 +0x143
   44 github.com/cockroachdb/cockroach/pkg/ccl/streamingccl/replicationtestutils.(*ReplicationFeed).consumeUntil(0xc00538f4d0, {0xce6e530, 0xc0000c6098}, 0x93d3e30, 0x93d3e28)

On the server side, I have confirmed the rangefeed surfaces the correct error and returns, via observing this log line in the failed test server side logs.

Now, I know the eventStream (i.e. the producer sql processor), never surfaces this error because Next() never gets called for some mysterious reason. I can confirm this because these added log lines never surface in the failed test logs. We also never Close() the eventStream, as these added log lines never surface either.

On the client side, next should never hang. I'll need to figure out why this occurs.

This seems related to a bug fixed by #85867. But there, eventStream.Close() was called. Here, it is not.

@msbutler
Copy link
Collaborator

msbutler commented May 3, 2023

Running ./dev test pkg/ccl/streamingccl/streamproducer -f TestStreamPartition/stream-table-cursor-error --stress --race --vmodule "rangefeed=1 --deadlock (i.e. with deadlock), causes cluster startup to fail....probably because all these test infra watchers are too heavy handed.

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamproducer.TestStreamPartition failed with artifacts on release-23.1 @ 4226a83871bbce776bc9389fca5cf084b4bb7632:

=== RUN   TestStreamPartition
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/9ce24496530642208197dda818d65c9e/logTestStreamPartition2047022242
    test_log_scope.go:79: use -show-logs to present logs inline
=== CONT  TestStreamPartition
    replication_stream_test.go:478: -- test log scope end --
    replication_stream_test.go:478: Leaked goroutine: goroutine 4772422 [IO wait]:
        internal/poll.runtime_pollWait(0x7f25d2c85e28, 0x72)
        	GOROOT/src/runtime/netpoll.go:305 +0x89
        internal/poll.(*pollDesc).wait(0xc011148d80?, 0xc0076a5200?, 0x0)
        	GOROOT/src/internal/poll/fd_poll_runtime.go:84 +0x32
        internal/poll.(*pollDesc).waitRead(...)
        	GOROOT/src/internal/poll/fd_poll_runtime.go:89
        internal/poll.(*FD).Read(0xc011148d80, {0xc0076a5200, 0x10, 0x10})
        	GOROOT/src/internal/poll/fd_unix.go:167 +0x25a
        net.(*netFD).Read(0xc011148d80, {0xc0076a5200?, 0xc01244de48?, 0x4a7a92?})
        	GOROOT/src/net/fd_posix.go:55 +0x29
        net.(*conn).Read(0xc012adbc18, {0xc0076a5200?, 0xc0159f8d20?, 0xc0159f8d20?})
        	GOROOT/src/net/net.go:183 +0x45
        github.com/jackc/pgconn.(*PgConn).CancelRequest(0xc00a862580, {0x70c0c50?, 0xc0159f8d20})
        	github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:934 +0x3db
        github.com/jackc/pgconn.(*PgConn).asyncClose.func1()
        	github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:667 +0x10f
        created by github.com/jackc/pgconn.(*PgConn).asyncClose
        	github.com/jackc/pgconn/external/com_github_jackc_pgconn/pgconn.go:658 +0x65
        Leaked goroutine: goroutine 4772457 [select]:
        github.com/jackc/pgconn/internal/ctxwatch.(*ContextWatcher).Watch.func1()
        	github.com/jackc/pgconn/internal/ctxwatch/external/com_github_jackc_pgconn/internal/ctxwatch/context_watcher.go:47 +0x6c
        created by github.com/jackc/pgconn/internal/ctxwatch.(*ContextWatcher).Watch
        	github.com/jackc/pgconn/internal/ctxwatch/external/com_github_jackc_pgconn/internal/ctxwatch/context_watcher.go:46 +0x12c
--- FAIL: TestStreamPartition (22.33s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

Same failure on other branches

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue May 10, 2023
102876: c2c: fix rangefeed error propogration race r=stevendanna a=msbutler

In the producer dist sql processor, rangefeed errors only propograte to the user if the error can be sent on a non-blocking channel. Because the channel was previously unbuffered, the dist sql processor would inadvertently swallow the error if the receiver was not actively waiting on the channel. This would then cause the sql processor to hang, as the underlying rangefeed would close after the ignored error message.

This patch buffers the errCh, guaranteeing that the first rangefeed error will be processed by the sql processor. If the rangefeed surfaces several errors while the buffered channel is full, these errors will be swallowed, which is fine, as the first error will always shut down the sql processor.

Fixes #102286

Release note: None

Co-authored-by: Michael Butler <[email protected]>
@craig craig bot closed this as completed in 70a9246 May 10, 2023
fqazi pushed a commit to fqazi/cockroach that referenced this issue May 11, 2023
102593: c2c: skip and deflake a few unit tests r=benbardin a=msbutler

c2c: skip a few unit tests
Informs cockroachdb#102523
Informs cockroachdb#102286
Informs cockroachdb#86206

Release note: none

c2c: close pgconn with correct context in test infra
When a user opens a pgx connection, they pass it a context that must be used to
close the connection, else pgx's contextWatcher will run indefinetly on a
seperate goroutine.  Previously, the `startReplication()` test helper was
initializing its own context to open pgx connection, and did not pass this
context back to the caller, and consequently, the user would close the
connection with the incorrect context, leading to a leaky goroutine.

This patch allows the user to pass a context in the startReplication(),
preventing the user from closing the connection with the wrong goroutine.

Fixes cockroachdb#102523

Release note: None

Co-authored-by: Michael Butler <[email protected]>
msbutler added a commit to msbutler/cockroach that referenced this issue May 11, 2023
In the producer dist sql processor, rangefeed errors only propograte to the
user if the error can be sent on a non-blocking channel. Because the channel
was previously unbuffered, the dist sql processor would inadvertently swallow
the error if the receiver was not actively waiting on the channel. This would
then cause the sql processor to hang, as the underlying rangefeed would close
after the ignored error message.

This patch buffers the errCh, guaranteeing that the first rangefeed
error will be processed by the sql processor. If the rangefeed surfaces several
errors while the buffered channel is full, these errors will be swallowed,
which is fine, as the first error will always shut down the sql processor.

Fixes cockroachdb#102286

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-23.1 Used to mark GA and release blockers, technical advisories, and bugs for 23.1 C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants