Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

Closed
cockroach-teamcity opened this issue Sep 30, 2023 · 15 comments · Fixed by #112543
Closed

ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

cockroach-teamcity opened this issue Sep 30, 2023 · 15 comments · Fixed by #112543
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-disaster-recovery
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 30, 2023

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ fad649d89721ddb3e9f3dcab1ad5d14f74c91bc9:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint954244461
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
	NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
	NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.
    replication_stream_e2e_test.go:1217: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 2
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint954244461
--- FAIL: TestStreamingRegionalConstraint (108.11s)

Parameters: TAGS=bazel,gss,deadlock , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-31957

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-disaster-recovery labels Sep 30, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Sep 30, 2023
@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/74f3eeb4519148f71482aeda8a540383/logTestStreamingRegionalConstraint2541785297
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
	NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
	NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
	NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 3
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/74f3eeb4519148f71482aeda8a540383/logTestStreamingRegionalConstraint2541785297
--- FAIL: TestStreamingRegionalConstraint (64.20s)
Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint2950345968
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
	NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
	NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
	NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 3
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint2950345968
--- FAIL: TestStreamingRegionalConstraint (123.16s)

Parameters: TAGS=bazel,gss,deadlock , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint48417678
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
	NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
	NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
	NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 2
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint48417678
--- FAIL: TestStreamingRegionalConstraint (52.66s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@msbutler msbutler self-assigned this Oct 2, 2023
craig bot pushed a commit that referenced this issue Oct 2, 2023
111574: c2c: skip TestStreamingRegionalConstraint r=msbutler a=msbutler

Informs #111541

Release note: none

Co-authored-by: Michael Butler <[email protected]>
craig bot pushed a commit that referenced this issue Oct 2, 2023
111571: tests: silence some warnings r=yuzefovich a=knz

This will improve investigations for failures like #111541.

Epic: CRDB-18499.

111590: github-pull-request-make: longer overall timeout for `stressrace` r=jlinder a=rickystewart

Multiple people have seen this timeout for `race`. Let's bump this timeout only for `race`.

Epic: none
Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@msbutler
Copy link
Collaborator

msbutler commented Oct 3, 2023

@stevendanna a git bisect revealed that #111178 causes

./dev test pkg/ccl/streamingccl/streamingest -f TestStreamingRegionalConstraint --stress --cpus=8 --count=20

to consisently fail on my gce worker. If I run the same command on the previous commit, the stress test passes no problem. Do you have any intuition on why the splits pr could affect our ability replicate spanconfigs?

@msbutler
Copy link
Collaborator

msbutler commented Oct 3, 2023

The most benign explanation of this regression: the initial splits lead to an occasional test suffering from more severe hardware starvation. When I modify the test to relax the time requirement to observe the replicated span configs by 5x, the test passes under stress 100 runs on my gce worker:

On 4 parallel cpus:

//pkg/ccl/streamingccl/streamingest:streamingest_test                    PASSED in 67.0s
  Stats over 100 runs: max = 67.0s, min = 7.2s, avg = 8.8s, dev = 5.9s

On 8 parallel cpus:

//pkg/ccl/streamingccl/streamingest:streamingest_test                    PASSED in 202.9s
  Stats over 100 runs: max = 202.9s, min = 9.2s, avg = 28.8s, dev = 39.7s

But with 12 parallel cpus: a test will fail after 3m45s. So, all this suggests that hardware exhaustion correlates to the this test timeout. We still ought to figure out how to speed up this test.

@msbutler
Copy link
Collaborator

msbutler commented Oct 4, 2023

Just noticed something fishy after bumping the vmodule on ingest_span_configs.go. Before I explain that, recall the timeline of this test:

  1. Init a single MR host cluster with a source app tenant and destination app tenant
  2. Start the replication stream
  3. Create a database with a regional constraint on the source app tenant
  4. Wait for the initial scan to complete
  5. Check that the source side has observed the regional constraint
  6. Check that the destination side has replicated and observed the regional constraint -- we fail here

On a green run: we observe the flushing of an incremental span config update, which presumably is the region config update:

❯ grep "ingest_span" out.txt
I231004 18:12:19.029687 21135 ccl/streamingccl/streamingest/ingest_span_configs.go:108  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 704  initialized span config ingestor
I231004 18:12:19.042198 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:255  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 708  flushing span config 45 updates and 0 deletes
I231004 18:12:19.042242 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:307  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 709  flushing full span configuration state (45 records)
I231004 18:12:20.193548 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:255  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 718  flushing span config 1 updates and 0 deletes

On a failed run, we don't observe the incremental update on the destination side. But we do see the initial flush of the span config state:

❯ grep "ingest_span" streamingesttest.log
I231004 18:07:10.480362 55 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹/home/michaelbutler/.cache/bazel/_bazel_michaelbutler/338fb3aeb5cca80cb3648387a412fd28/sandbox/linux-sandbox/3802/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test› ‹-test.timeout=895s› ‹-vmodule=ingest_span_configs=2›]
I231004 18:07:21.285026 23789 ccl/streamingccl/streamingest/ingest_span_configs.go:108 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 698  initialized span config ingestor
I231004 18:07:21.319160 24050 ccl/streamingccl/streamingest/ingest_span_configs.go:255 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 702  flushing span config 46 updates and 0 deletes
I231004 18:07:21.319200 24050 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 703  flushing full span configuration state (46 records)

So for whatever reason, either the span config event stream has not sent over the update or a subsequent checkpoint that would induce a destination side flush.

@msbutler
Copy link
Collaborator

msbutler commented Oct 4, 2023

Alright, I've confirmed with a bit more logging that the regional constraint was replicated and flushed to the destination side span config table (see below). The remaining mystery, which is outside of c2c land: why does it take so long to observe the regional constraints on the destination side.

I231004 19:16:43.593958 66 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹/home/michaelbutler/.cache/bazel/_bazel_michaelbutler/338fb3aeb5cca80cb3648387a412fd28/sandbox/linux-sandbox/5529/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test› ‹-test.timeout=895s› ‹-vmodule=ingest_span_configs=2,span_config_event_stream=2›]
I231004 19:16:54.714223 24378 ccl/streamingccl/streamingest/ingest_span_configs.go:108 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 716  initialized span config ingestor
I231004 19:16:54.741312 24567 ccl/streamingccl/streamproducer/span_config_event_stream.go:186 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 718  observed complete scan
I231004 19:16:54.741400 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:267 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 719  observed constraint update [‹+region=mars›]
I231004 19:16:54.742470 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:274 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 720  checkpointing span config stream at 2023-10-04 19:16:54.740267269 +0000 UTC
I231004 19:16:54.742510 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:276 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 721  sending 49 span config events
I231004 19:16:54.758117 24466 ccl/streamingccl/streamingest/ingest_span_configs.go:255 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 723  flushing span config 46 updates and 0 deletes
I231004 19:16:54.758150 24466 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 724  flushing full span configuration state (46 records)

@msbutler
Copy link
Collaborator

msbutler commented Oct 4, 2023

I have one theory for why the split pr could be causing this test to slowdown: According to the test logs, we issue one initial split at the tenant start key:

❯ cat out.txt | grep "splitting and scattering"
I231004 19:16:54.651275 22316 ccl/streamingccl/streamingest/stream_ingestion_dist.go:362 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 712  splitting and scattering at /Tenant/2

I would expect this split to noop, since I believe there already exists a split at the tenant start key. More interestingly, we also call scatter on this tenant's key space, which will induce some allocator work. If some of this allocator work is asynchronous, then the span config replication stream will begin while the allocator is handling the scatter. The work to apply the regional constraint would then get put in the allocator queue behind this scatter request.

Another thought with no evidence: perhaps admission control is throttling the allocator more, after it handled the admin scatter request.

@msbutler
Copy link
Collaborator

msbutler commented Oct 4, 2023

Okie dokie, my theory above has some weight: commenting out the scatter request in stream_ingestion_dist.go erases the regression.

msbutler added a commit to msbutler/cockroach that referenced this issue Oct 5, 2023
It was helpful while looking into cockroachdb#111541

Release note: None
craig bot pushed a commit that referenced this issue Oct 5, 2023
111803: streamproducer: add verbose logging to span config event stream r=stevendanna a=msbutler

It was helpful while looking into #111541

Epic: none

Release Note: none

Co-authored-by: Michael Butler <[email protected]>
@msbutler
Copy link
Collaborator

msbutler commented Oct 5, 2023

Per @kvoli 's advice, I took a look at the kv distributions logs of a failed run. I noticed that after the scatter request and the span config update, I see the allocator attempt to transfer the tenant's range to s1 (perhaps to obey the regional constraint), but the attempt got rejected, then there's nothing in the logs for 20 seconds.

I231005 15:42:37.155598 24698 13@kv/kvserver/allocator/plan/replicate.go:1046 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 1180  transferring lease to s1
I231005 15:42:37.155635 24698 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 1181  transferring lease to s1
I231005 15:42:37.158345 25416 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n1,replicate,s1,r86/1:‹/Tenant/{2/Table/…-3}›] 1182  error processing replica: no removable replicas from range that needs a removal: ‹[1*:19, 3:0]›
I231005 15:42:56.761708 49682 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n1,replicate,s1,r34/1:‹/{NamespaceTab…-Table/32}›] 1183  transferring lease to s3

here's a zip of the latest logs:

testlogs.zip

@msbutler
Copy link
Collaborator

msbutler commented Oct 5, 2023

Two more things to note about the kv distribution log:

  • I see no mention of a scatter request, even though the cockroach.log indicates that it occurred successfully.
  • the above error message is the last log line in the distribution log on the relevant tenant's keyspace

@msbutler
Copy link
Collaborator

msbutler commented Oct 5, 2023

I added more verbose logging (replicate=6) which revealed the allocator tried, for about 10 seconds, to rebalance the range to obey the regional constraint, but I don't see any logs which indicate any success. I'm not sure how to dig further in these logs.

For example:

❯ cat streamingesttest-kv-distribution.log| grep "mars" | { head -n 5; tail -n 5; }
I231005 18:11:50.502240 23678 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2975  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.502598 24272 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2979  planning range change desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.502745 24402 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2981  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.503042 24272 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2984  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.503276 24219 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2986  planning range change desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.332612 25644 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3088  computing range action desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.332873 25874 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3090  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.415052 25874 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3101  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:12:01.823578 37698 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3127  computing range action desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=5, gen=17] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:12:01.824062 37609 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3129  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=5, gen=17] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›

To repro, run the following command on this branch:

 ./dev test pkg/ccl/streamingccl/streamingest -f TestStreamingRegionalConstraint --vmodule ingest_span_configs=2,span_config_event_stream=2,replicate=6 --stress --count=100 --cpus=8

Here are the test logs with verbose logging:
testlogs.zip

@msbutler
Copy link
Collaborator

msbutler commented Oct 5, 2023

handing this over @stevendanna while I'm out.

@msbutler
Copy link
Collaborator

I'm removing the release blocker on this as the failure is to due test environment constraints outlined in #112541

@msbutler msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 17, 2023
craig bot pushed a commit that referenced this issue Oct 17, 2023
112470: upgrades: make stmt diag upgrade idempotent r=yuzefovich a=yuzefovich

All upgrades are expected to be idempotent, but the stmt diag upgrade (needed for plan-gist batched matching) wasn't - we needed to add `IF EXISTS` clause to the `DROP INDEX` stmt (which doesn't have a meaningful `schemaExistsFn`). Additionally, we can combine two stmts that add a single column into one that adds two.

Epic: None

Release note: None

112496: roachtest: use tpch workload in import-cancellation r=yuzefovich a=yuzefovich

Previously, we were using `querybench` to run TPCH queries after the import succeeded. The comments around the code suggest that we wanted to assert that the correct results were obtained, meaning that there was no data corruption during cancelled imports. However, `querybench` doesn't do any kind of verification, so this commit switches to using `tpch` workload with `--enable-checks=true` which does the desired verification.

Noticed this when looking at #111985.

Epic: None

Release note: None

112532: upgrade: Increase timeout for TestTenantAutoUpgrade under stress r=stevendanna a=ajstorm

Test times out under stress. With updated timeout it now passes on a local repro.

Fixes: #112158

Release note: None

112543: streamingccl: unskip TestStreamingRegionalConstraint r=kvoli a=msbutler

This patch unskips TestStreamingRegionalConstraint under a non-stress build.

Fixes #111541

Release note: none

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Adam Storm <[email protected]>
Co-authored-by: Michael Butler <[email protected]>
@craig craig bot closed this as completed in 7ace7d5 Oct 17, 2023
blathers-crl bot pushed a commit that referenced this issue Oct 17, 2023
This patch unskips TestStreamingRegionalConstraint under a non-stress build.

Fixes #111541

Release note: none
msbutler added a commit to msbutler/cockroach that referenced this issue Oct 18, 2023
craig bot pushed a commit that referenced this issue Oct 18, 2023
112597: streamingst: increase timeout on TestStreamingRegionalConstraint r=msbutler a=msbutler

Informs #111541

Release note: none

Co-authored-by: Michael Butler <[email protected]>
blathers-crl bot pushed a commit that referenced this issue Oct 23, 2023
msbutler added a commit that referenced this issue Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. skipped-test T-disaster-recovery
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants