ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

cockroach-teamcity · 2023-09-30T08:07:22Z

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ fad649d89721ddb3e9f3dcab1ad5d14f74c91bc9:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint954244461
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
	NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
	NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.
    replication_stream_e2e_test.go:1217: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 2
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint954244461
--- FAIL: TestStreamingRegionalConstraint (108.11s)

Parameters: TAGS=bazel,gss,deadlock , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/disaster-recovery _{This test on roachdash | Improve this report!

Jira issue: CRDB-31957}

The text was updated successfully, but these errors were encountered:

michae2 · 2023-10-02T04:08:14Z

Another instance: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelEssentialCi/12008226?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildChangesSection=true&expandBuildProblemsSection=true&expandBuildTestsSection=true

cockroach-teamcity · 2023-10-02T06:28:14Z

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/74f3eeb4519148f71482aeda8a540383/logTestStreamingRegionalConstraint2541785297
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
	NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
	NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
	NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 3
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/74f3eeb4519148f71482aeda8a540383/logTestStreamingRegionalConstraint2541785297
--- FAIL: TestStreamingRegionalConstraint (64.20s)

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-10-02T08:32:31Z

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint2950345968
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
	NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
	NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
	NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 3
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint2950345968
--- FAIL: TestStreamingRegionalConstraint (123.16s)

Parameters: TAGS=bazel,gss,deadlock , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

cockroach-teamcity · 2023-10-02T08:36:44Z

ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:

=== RUN   TestStreamingRegionalConstraint
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint48417678
    test_log_scope.go:81: use -show-logs to present logs inline

pkg/ccl/streamingccl/replicationtestutils/testutils.go:333: (startC2CTestCluster)
	NOTICE: .Clock() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().Clock() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization

pkg/ccl/streamingccl/replicationtestutils/testutils.go:128: ((*TenantStreamingClusters).setupSrcTenant)
	NOTICE: .StartSharedProcessTenant() called via implicit interface TenantControlInterface;
HINT: consider using .TenantController().StartSharedProcessTenant() instead.

pkg/ccl/streamingccl/streamingest/replication_stream_e2e_test.go:1221: (TestStreamingRegionalConstraint)
	NOTICE: .DB() called via implicit interface ApplicationLayerInterface;
HINT: consider using .ApplicationLayer().DB() instead.
TIP: consider replacing the test server initialization from:
    ts, ... := serverutils.StartServer(t, ...)
    defer ts.Stopper().Stop(...)
to:
    srv, ... := serverutils.StartServer(t, ...)
    defer srv.Stopper().Stop(...)
    ts := srv.ApplicationLayer()

See also: https://go.crdb.dev/p/testserver-and-cluster-virtualization
    replication_stream_e2e_test.go:1220: condition failed to evaluate within 45s: from replication_stream_e2e_test.go:1203: running iterate callback: found table data located on another node 2
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/90c1b75d835f45b8488807abb5b1092d/logTestStreamingRegionalConstraint48417678
--- FAIL: TestStreamingRegionalConstraint (52.66s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

_{This test on roachdash | Improve this report!}

111574: c2c: skip TestStreamingRegionalConstraint r=msbutler a=msbutler Informs #111541 Release note: none Co-authored-by: Michael Butler <[email protected]>

111571: tests: silence some warnings r=yuzefovich a=knz This will improve investigations for failures like #111541. Epic: CRDB-18499. 111590: github-pull-request-make: longer overall timeout for `stressrace` r=jlinder a=rickystewart Multiple people have seen this timeout for `race`. Let's bump this timeout only for `race`. Epic: none Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>

msbutler · 2023-10-03T16:15:22Z

@stevendanna a git bisect revealed that #111178 causes

./dev test pkg/ccl/streamingccl/streamingest -f TestStreamingRegionalConstraint --stress --cpus=8 --count=20

to consisently fail on my gce worker. If I run the same command on the previous commit, the stress test passes no problem. Do you have any intuition on why the splits pr could affect our ability replicate spanconfigs?

msbutler · 2023-10-03T19:27:17Z

The most benign explanation of this regression: the initial splits lead to an occasional test suffering from more severe hardware starvation. When I modify the test to relax the time requirement to observe the replicated span configs by 5x, the test passes under stress 100 runs on my gce worker:

On 4 parallel cpus:

//pkg/ccl/streamingccl/streamingest:streamingest_test                    PASSED in 67.0s
  Stats over 100 runs: max = 67.0s, min = 7.2s, avg = 8.8s, dev = 5.9s

On 8 parallel cpus:

//pkg/ccl/streamingccl/streamingest:streamingest_test                    PASSED in 202.9s
  Stats over 100 runs: max = 202.9s, min = 9.2s, avg = 28.8s, dev = 39.7s

But with 12 parallel cpus: a test will fail after 3m45s. So, all this suggests that hardware exhaustion correlates to the this test timeout. We still ought to figure out how to speed up this test.

msbutler · 2023-10-04T18:24:36Z

Just noticed something fishy after bumping the vmodule on ingest_span_configs.go. Before I explain that, recall the timeline of this test:

Init a single MR host cluster with a source app tenant and destination app tenant
Start the replication stream
Create a database with a regional constraint on the source app tenant
Wait for the initial scan to complete
Check that the source side has observed the regional constraint
Check that the destination side has replicated and observed the regional constraint -- we fail here

On a green run: we observe the flushing of an incremental span config update, which presumably is the region config update:

❯ grep "ingest_span" out.txt
I231004 18:12:19.029687 21135 ccl/streamingccl/streamingest/ingest_span_configs.go:108  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 704  initialized span config ingestor
I231004 18:12:19.042198 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:255  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 708  flushing span config 45 updates and 0 deletes
I231004 18:12:19.042242 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:307  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 709  flushing full span configuration state (45 records)
I231004 18:12:20.193548 21283 ccl/streamingccl/streamingest/ingest_span_configs.go:255  [T1,Vsystem,n1,job=REPLICATION STREAM INGESTION id=905618190521303043] 718  flushing span config 1 updates and 0 deletes

On a failed run, we don't observe the incremental update on the destination side. But we do see the initial flush of the span config state:

❯ grep "ingest_span" streamingesttest.log
I231004 18:07:10.480362 55 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹/home/michaelbutler/.cache/bazel/_bazel_michaelbutler/338fb3aeb5cca80cb3648387a412fd28/sandbox/linux-sandbox/3802/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test› ‹-test.timeout=895s› ‹-vmodule=ingest_span_configs=2›]
I231004 18:07:21.285026 23789 ccl/streamingccl/streamingest/ingest_span_configs.go:108 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 698  initialized span config ingestor
I231004 18:07:21.319160 24050 ccl/streamingccl/streamingest/ingest_span_configs.go:255 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 702  flushing span config 46 updates and 0 deletes
I231004 18:07:21.319200 24050 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905617212680601603›] 703  flushing full span configuration state (46 records)

So for whatever reason, either the span config event stream has not sent over the update or a subsequent checkpoint that would induce a destination side flush.

msbutler · 2023-10-04T19:23:53Z

Alright, I've confirmed with a bit more logging that the regional constraint was replicated and flushed to the destination side span config table (see below). The remaining mystery, which is outside of c2c land: why does it take so long to observe the regional constraints on the destination side.

I231004 19:16:43.593958 66 util/log/file_sync_buffer.go:238 ⋮ [config]   arguments: [‹/home/michaelbutler/.cache/bazel/_bazel_michaelbutler/338fb3aeb5cca80cb3648387a412fd28/sandbox/linux-sandbox/5529/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/streamingccl/streamingest/streamingest_test_/streamingest_test› ‹-test.timeout=895s› ‹-vmodule=ingest_span_configs=2,span_config_event_stream=2›]
I231004 19:16:54.714223 24378 ccl/streamingccl/streamingest/ingest_span_configs.go:108 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 716  initialized span config ingestor
I231004 19:16:54.741312 24567 ccl/streamingccl/streamproducer/span_config_event_stream.go:186 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 718  observed complete scan
I231004 19:16:54.741400 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:267 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 719  observed constraint update [‹+region=mars›]
I231004 19:16:54.742470 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:274 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 720  checkpointing span config stream at 2023-10-04 19:16:54.740267269 +0000 UTC
I231004 19:16:54.742510 24565 ccl/streamingccl/streamproducer/span_config_event_stream.go:276 ⋮ [T1,Vsystem,n1,client=127.0.0.1:41202,hostssl,user=root] 721  sending 49 span config events
I231004 19:16:54.758117 24466 ccl/streamingccl/streamingest/ingest_span_configs.go:255 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 723  flushing span config 46 updates and 0 deletes
I231004 19:16:54.758150 24466 ccl/streamingccl/streamingest/ingest_span_configs.go:307 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 724  flushing full span configuration state (46 records)

msbutler · 2023-10-04T19:48:48Z

I have one theory for why the split pr could be causing this test to slowdown: According to the test logs, we issue one initial split at the tenant start key:

❯ cat out.txt | grep "splitting and scattering"
I231004 19:16:54.651275 22316 ccl/streamingccl/streamingest/stream_ingestion_dist.go:362 ⋮ [T1,Vsystem,n1,job=‹REPLICATION STREAM INGESTION id=905630888904556547›] 712  splitting and scattering at /Tenant/2

I would expect this split to noop, since I believe there already exists a split at the tenant start key. More interestingly, we also call scatter on this tenant's key space, which will induce some allocator work. If some of this allocator work is asynchronous, then the span config replication stream will begin while the allocator is handling the scatter. The work to apply the regional constraint would then get put in the allocator queue behind this scatter request.

Another thought with no evidence: perhaps admission control is throttling the allocator more, after it handled the admin scatter request.

msbutler · 2023-10-04T20:02:22Z

Okie dokie, my theory above has some weight: commenting out the scatter request in stream_ingestion_dist.go erases the regression.

It was helpful while looking into cockroachdb#111541 Release note: None

111803: streamproducer: add verbose logging to span config event stream r=stevendanna a=msbutler It was helpful while looking into #111541 Epic: none Release Note: none Co-authored-by: Michael Butler <[email protected]>

msbutler · 2023-10-05T16:57:59Z

Per @kvoli 's advice, I took a look at the kv distributions logs of a failed run. I noticed that after the scatter request and the span config update, I see the allocator attempt to transfer the tenant's range to s1 (perhaps to obey the regional constraint), but the attempt got rejected, then there's nothing in the logs for 20 seconds.

I231005 15:42:37.155598 24698 13@kv/kvserver/allocator/plan/replicate.go:1046 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 1180  transferring lease to s1
I231005 15:42:37.155635 24698 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 1181  transferring lease to s1
I231005 15:42:37.158345 25416 13@kv/kvserver/replicate_queue.go:785 ⋮ [T1,Vsystem,n1,replicate,s1,r86/1:‹/Tenant/{2/Table/…-3}›] 1182  error processing replica: no removable replicas from range that needs a removal: ‹[1*:19, 3:0]›
I231005 15:42:56.761708 49682 13@kv/kvserver/replicate_queue.go:1075 ⋮ [T1,Vsystem,n1,replicate,s1,r34/1:‹/{NamespaceTab…-Table/32}›] 1183  transferring lease to s3

here's a zip of the latest logs:

testlogs.zip

msbutler · 2023-10-05T18:08:06Z

Two more things to note about the kv distribution log:

I see no mention of a scatter request, even though the cockroach.log indicates that it occurred successfully.
the above error message is the last log line in the distribution log on the relevant tenant's keyspace

msbutler · 2023-10-05T18:32:09Z

I added more verbose logging (replicate=6) which revealed the allocator tried, for about 10 seconds, to rebalance the range to obey the regional constraint, but I don't see any logs which indicate any success. I'm not sure how to dig further in these logs.

For example:

❯ cat streamingesttest-kv-distribution.log| grep "mars" | { head -n 5; tail -n 5; }
I231005 18:11:50.502240 23678 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2975  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.502598 24272 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2979  planning range change desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.502745 24402 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2981  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.503042 24272 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2984  computing range action desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:50.503276 24219 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n1,replicate,s1,r70/1:‹/Tenant/1{0/Table/…-1}›] 2986  planning range change desc=r70:‹/Tenant/1{0/Table/110-1}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=8] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.332612 25644 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3088  computing range action desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.332873 25874 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3090  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:11:51.415052 25874 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3101  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=4, gen=15] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:12:01.823578 37698 13@kv/kvserver/allocator/plan/replicate.go:152 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3127  computing range action desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=5, gen=17] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›
I231005 18:12:01.824062 37609 13@kv/kvserver/allocator/plan/replicate.go:255 ⋮ [T1,Vsystem,n2,replicate,s2,r86/3:‹/Tenant/{2/Table/…-3}›] 3129  planning range change desc=r86:‹/Tenant/{2/Table/110-3}› [(n2,s2):3, next=5, gen=17] config=‹range_min_bytes:134217728 range_max_bytes:536870912 gc_policy:<ttl_seconds:14400 protection_policies:<protected_timestamp:<wall_time:1696529509986143509 > > > num_replicas:1 constraints:<constraints:<key:"region" value:"mars" > > ›

To repro, run the following command on this branch:

 ./dev test pkg/ccl/streamingccl/streamingest -f TestStreamingRegionalConstraint --vmodule ingest_span_configs=2,span_config_event_stream=2,replicate=6 --stress --count=100 --cpus=8

Here are the test logs with verbose logging:
testlogs.zip

msbutler · 2023-10-05T18:40:36Z

handing this over @stevendanna while I'm out.

msbutler · 2023-10-17T15:10:43Z

I'm removing the release blocker on this as the failure is to due test environment constraints outlined in #112541

112470: upgrades: make stmt diag upgrade idempotent r=yuzefovich a=yuzefovich All upgrades are expected to be idempotent, but the stmt diag upgrade (needed for plan-gist batched matching) wasn't - we needed to add `IF EXISTS` clause to the `DROP INDEX` stmt (which doesn't have a meaningful `schemaExistsFn`). Additionally, we can combine two stmts that add a single column into one that adds two. Epic: None Release note: None 112496: roachtest: use tpch workload in import-cancellation r=yuzefovich a=yuzefovich Previously, we were using `querybench` to run TPCH queries after the import succeeded. The comments around the code suggest that we wanted to assert that the correct results were obtained, meaning that there was no data corruption during cancelled imports. However, `querybench` doesn't do any kind of verification, so this commit switches to using `tpch` workload with `--enable-checks=true` which does the desired verification. Noticed this when looking at #111985. Epic: None Release note: None 112532: upgrade: Increase timeout for TestTenantAutoUpgrade under stress r=stevendanna a=ajstorm Test times out under stress. With updated timeout it now passes on a local repro. Fixes: #112158 Release note: None 112543: streamingccl: unskip TestStreamingRegionalConstraint r=kvoli a=msbutler This patch unskips TestStreamingRegionalConstraint under a non-stress build. Fixes #111541 Release note: none Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Adam Storm <[email protected]> Co-authored-by: Michael Butler <[email protected]>

This patch unskips TestStreamingRegionalConstraint under a non-stress build. Fixes #111541 Release note: none

Informs cockroachdb#111541 Release note: none

112597: streamingst: increase timeout on TestStreamingRegionalConstraint r=msbutler a=msbutler Informs #111541 Release note: none Co-authored-by: Michael Butler <[email protected]>

Informs #111541 Release note: none

cockroach-teamcity added this to the 23.2 milestone Sep 30, 2023

knz mentioned this issue Oct 2, 2023

tests: silence some warnings #111571

Merged

msbutler mentioned this issue Oct 2, 2023

c2c: skip TestStreamingRegionalConstraint #111574

Merged

msbutler self-assigned this Oct 2, 2023

adityamaru added the skipped-test label Oct 2, 2023

craig bot pushed a commit that referenced this issue Oct 2, 2023

Merge #111574

b8c015c

111574: c2c: skip TestStreamingRegionalConstraint r=msbutler a=msbutler Informs #111541 Release note: none Co-authored-by: Michael Butler <[email protected]>

msbutler added a commit to msbutler/cockroach that referenced this issue Oct 5, 2023

streamproducer: add verbose logging to span config event stream

c577c50

It was helpful while looking into cockroachdb#111541 Release note: None

msbutler mentioned this issue Oct 5, 2023

streamproducer: add verbose logging to span config event stream #111803

Merged

msbutler assigned stevendanna and unassigned msbutler Oct 5, 2023

msbutler self-assigned this Oct 16, 2023

msbutler mentioned this issue Oct 17, 2023

streamingccl: TestStreamingRegionalConstraint times out under stress due to allocator cpu starvation #112541

Closed

msbutler removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Oct 17, 2023

msbutler mentioned this issue Oct 17, 2023

streamingccl: unskip TestStreamingRegionalConstraint #112543

Merged

craig bot closed this as completed in 7ace7d5 Oct 17, 2023

blathers-crl bot mentioned this issue Oct 17, 2023

release-23.2: streamingccl: unskip TestStreamingRegionalConstraint #112557

Merged

blathers-crl bot pushed a commit that referenced this issue Oct 17, 2023

streamingccl: unskip TestStreamingRegionalConstraint

7a7dde3

This patch unskips TestStreamingRegionalConstraint under a non-stress build. Fixes #111541 Release note: none

exalate-issue-sync bot unassigned msbutler Oct 17, 2023

msbutler added a commit to msbutler/cockroach that referenced this issue Oct 18, 2023

streamingst: increase timeout on TestStreamingRegionalConstraint

7442fb3

Informs cockroachdb#111541 Release note: none

msbutler mentioned this issue Oct 18, 2023

streamingst: increase timeout on TestStreamingRegionalConstraint #112597

Merged

craig bot pushed a commit that referenced this issue Oct 18, 2023

Merge #112597

01d526b

112597: streamingst: increase timeout on TestStreamingRegionalConstraint r=msbutler a=msbutler Informs #111541 Release note: none Co-authored-by: Michael Butler <[email protected]>

blathers-crl bot mentioned this issue Oct 23, 2023

release-23.2: streamingst: increase timeout on TestStreamingRegionalConstraint #112896

Closed

blathers-crl bot pushed a commit that referenced this issue Oct 23, 2023

streamingst: increase timeout on TestStreamingRegionalConstraint

04d9ce8

Informs #111541 Release note: none

msbutler added a commit that referenced this issue Oct 23, 2023

streamingst: increase timeout on TestStreamingRegionalConstraint

f3dd184

Informs #111541 Release note: none

github-project-automation bot added this to Disaster Recovery Backlog Aug 28, 2024

github-project-automation bot moved this to Done in Disaster Recovery Backlog Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

cockroach-teamcity commented Sep 30, 2023 •

edited by cockroach-jira-scripts

Loading

michae2 commented Oct 2, 2023

cockroach-teamcity commented Oct 2, 2023

cockroach-teamcity commented Oct 2, 2023

cockroach-teamcity commented Oct 2, 2023

msbutler commented Oct 3, 2023

msbutler commented Oct 3, 2023 •

edited

Loading

msbutler commented Oct 4, 2023

msbutler commented Oct 4, 2023 •

edited

Loading

msbutler commented Oct 4, 2023 •

edited

Loading

msbutler commented Oct 4, 2023

msbutler commented Oct 5, 2023 •

edited

Loading

msbutler commented Oct 5, 2023

msbutler commented Oct 5, 2023

msbutler commented Oct 5, 2023

msbutler commented Oct 17, 2023

ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541

Comments

cockroach-teamcity commented Sep 30, 2023 • edited by cockroach-jira-scripts Loading

michae2 commented Oct 2, 2023

cockroach-teamcity commented Oct 2, 2023

cockroach-teamcity commented Oct 2, 2023

cockroach-teamcity commented Oct 2, 2023

msbutler commented Oct 3, 2023

msbutler commented Oct 3, 2023 • edited Loading

msbutler commented Oct 4, 2023

msbutler commented Oct 4, 2023 • edited Loading

msbutler commented Oct 4, 2023 • edited Loading

msbutler commented Oct 4, 2023

msbutler commented Oct 5, 2023 • edited Loading

msbutler commented Oct 5, 2023

msbutler commented Oct 5, 2023

msbutler commented Oct 5, 2023

msbutler commented Oct 17, 2023

cockroach-teamcity commented Sep 30, 2023 •

edited by cockroach-jira-scripts

Loading

msbutler commented Oct 3, 2023 •

edited

Loading

msbutler commented Oct 4, 2023 •

edited

Loading

msbutler commented Oct 4, 2023 •

edited

Loading

msbutler commented Oct 5, 2023 •

edited

Loading