Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/streamingccl/streamingest: TestTenantStreamingUnavailableStreamAddress failed #112917

Closed
cockroach-teamcity opened this issue Oct 23, 2023 · 7 comments · Fixed by #113038
Closed
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-remote-execution An issue found only using Bazel Remote Execution function. O-robot Originated from a bot. T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Oct 23, 2023

ccl/streamingccl/streamingest.TestTenantStreamingUnavailableStreamAddress failed on master @ bf9a22dc85683966ee65e84d3eeadf2b44185127:

=== RUN   TestTenantStreamingUnavailableStreamAddress
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestTenantStreamingUnavailableStreamAddress3498977487
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:149: ((*TenantStreamingClusters).init)
	NOTICE: .WaitForTenantCapabilities() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().WaitForTenantCapabilities() instead.
    testutils.go:488: condition failed to evaluate within 45s: from testutils.go:496: stream ingestion has not recorded any progress yet, waiting to advance pos 1698091303.051538873,0
    panic.go:522: -- test log scope end --
test logs left over in: outputs.zip/logTestTenantStreamingUnavailableStreamAddress3498977487
--- FAIL: TestTenantStreamingUnavailableStreamAddress (315.21s)

Parameters: attempt=1 , run=21 , shard=9

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-32679

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-remote-execution An issue found only using Bazel Remote Execution function. O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Oct 23, 2023
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Oct 23, 2023
@rickystewart
Copy link
Collaborator

FYI: This test failure was before #112827, so this may be just a late-coming flake.

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamingest.TestTenantStreamingUnavailableStreamAddress failed on master @ f3c7285d89afd63b2f7cae762638b86f237c1d72:

=== RUN   TestTenantStreamingUnavailableStreamAddress
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestTenantStreamingUnavailableStreamAddress2260474056
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:149: ((*TenantStreamingClusters).init)
	NOTICE: .WaitForTenantCapabilities() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().WaitForTenantCapabilities() instead.
    testutils.go:488: condition failed to evaluate within 45s: from testutils.go:496: stream ingestion has not recorded any progress yet, waiting to advance pos 1698134525.556270989,0
    panic.go:522: -- test log scope end --
test logs left over in: outputs.zip/logTestTenantStreamingUnavailableStreamAddress2260474056
--- FAIL: TestTenantStreamingUnavailableStreamAddress (94.85s)

Parameters: attempt=1 , run=3 , shard=9

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@stevendanna stevendanna self-assigned this Oct 24, 2023
@msbutler
Copy link
Collaborator

@rickystewart do we expect a test, like this one, with a skip.UnderStress() incantation to have several runs in the remote execution stress job?

skip.UnderStress(t, "multi node test times out under stress")

@msbutler msbutler assigned msbutler and unassigned stevendanna Oct 24, 2023
@rickystewart
Copy link
Collaborator

rickystewart commented Oct 24, 2023

do we expect a test, like this one, with a skip.UnderStress() incantation to have several runs in the remote execution stress job?

Yes, it will run multiple times. skip.UnderStress solves problems when one machine is "stressing" the same test repeatedly and multiple times concurrently in a way that starves resources or prevents the test from completing. In the remote execution cluster, it's impossible for one test to do this to such an extent that it brings down the cluster, tests are fairly well-isolated.

Looking at the code:

skip.UnderStress(t, "multi node test times out under stress") 

You can see the failure under EngFlow is not a timeout, i.e., the problem you're claiming to solve with skip.UnderStress() is not actually a problem. The test is just flaky. Note that skip.UnderStress() may have just been hiding the flakiness of this test rather than solving any actual problems.

@msbutler msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 24, 2023
@rickystewart
Copy link
Collaborator

If you want this test to be set up with a new skip function that will skip the test in this scenario, we can set that up I suppose. You can imagine that resolving the flakiness is far and away my preference, rather than just "hiding" the failures in this scenario.

BTW: This issue is tagged O-remote-execution and the issue is not under the purview of the CY. That will change in all likelihood when the EngFlow nightly is productionized, in the next week or two.

@msbutler
Copy link
Collaborator

msbutler commented Oct 24, 2023

Thanks for the explanation! This test has been plagued by flakiness, most likely because cpu contention leads us to timeout on a condition that we expect to achieve in 45 seconds. I'll huddle with my team on next steps here.

I think there's a general sentiment that multinode tests just flake -- and this one is particularly bad because it has two virtual clusters (8 sql servers, and 4 kv servers).

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamingest.TestTenantStreamingUnavailableStreamAddress failed on master @ 8315a4bc997fb8b8d679079e14d6d7ca94d53bc6:

=== RUN   TestTenantStreamingUnavailableStreamAddress
    test_log_scope.go:170: test logs captured to: outputs.zip/logTestTenantStreamingUnavailableStreamAddress2121284687
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:149: ((*TenantStreamingClusters).init)
	NOTICE: .WaitForTenantCapabilities() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().WaitForTenantCapabilities() instead.
    testutils.go:488: condition failed to evaluate within 45s: from testutils.go:496: stream ingestion has not recorded any progress yet, waiting to advance pos 1698218368.621699832,0
    panic.go:523: -- test log scope end --
test logs left over in: outputs.zip/logTestTenantStreamingUnavailableStreamAddress2121284687
--- FAIL: TestTenantStreamingUnavailableStreamAddress (118.42s)

Parameters: attempt=1 , run=13 , shard=9

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Oct 25, 2023
112992: kvclient: draining not started SQL r=JeffSwenson a=andrewbaptist

Previously a drain would assume the SQL instance had been started on a node prior to draining. This would result in a failure if the node attempting to drain had never started SQL.

Epic: none

Release note: None

112995: backupccl: deflake TestCleanupIntentsDuringBackupPerformanceRegression r=msbutler a=miraradeva

The test was counting batches of pushes for the entire duration of the test, not just the back itself. This patch resets the counters to do a more targeted assertion.

Fixes: #112812

Release note: None

113038: c2c: remove TestTenantStreamingUnavailableAddress r=stevendanna a=msbutler

All this test does is flake and essentially tests that dsp.PartitionsSpans() excludes the shutdown node. We already have coverage for this in our roachtest -- if dsp.PartitionsSpans() did include a shutdown node, our shutdown tests would never complete.

Fixes #112917

Release note: none

Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Mira Radeva <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
@craig craig bot closed this as completed in a8333a5 Oct 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-remote-execution An issue found only using Bazel Remote Execution function. O-robot Originated from a bot. T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants