-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/streamingccl/streamingest: TestStreamingRegionalConstraint failed #111541
Comments
ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:
|
ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:
Parameters: |
ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ f6f355b50e0dbf28633e25ddd05f2775141af31e:
Parameters: |
111574: c2c: skip TestStreamingRegionalConstraint r=msbutler a=msbutler Informs #111541 Release note: none Co-authored-by: Michael Butler <[email protected]>
111571: tests: silence some warnings r=yuzefovich a=knz This will improve investigations for failures like #111541. Epic: CRDB-18499. 111590: github-pull-request-make: longer overall timeout for `stressrace` r=jlinder a=rickystewart Multiple people have seen this timeout for `race`. Let's bump this timeout only for `race`. Epic: none Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
@stevendanna a git bisect revealed that #111178 causes
to consisently fail on my gce worker. If I run the same command on the previous commit, the stress test passes no problem. Do you have any intuition on why the splits pr could affect our ability replicate spanconfigs? |
The most benign explanation of this regression: the initial splits lead to an occasional test suffering from more severe hardware starvation. When I modify the test to relax the time requirement to observe the replicated span configs by 5x, the test passes under stress 100 runs on my gce worker: On 4 parallel cpus:
On 8 parallel cpus:
But with 12 parallel cpus: a test will fail after 3m45s. So, all this suggests that hardware exhaustion correlates to the this test timeout. We still ought to figure out how to speed up this test. |
Just noticed something fishy after bumping the vmodule on
On a green run: we observe the flushing of an incremental span config update, which presumably is the region config update:
On a failed run, we don't observe the incremental update on the destination side. But we do see the initial flush of the span config state:
So for whatever reason, either the span config event stream has not sent over the update or a subsequent checkpoint that would induce a destination side flush. |
Alright, I've confirmed with a bit more logging that the regional constraint was replicated and flushed to the destination side span config table (see below). The remaining mystery, which is outside of c2c land: why does it take so long to observe the regional constraints on the destination side.
|
I have one theory for why the split pr could be causing this test to slowdown: According to the test logs, we issue one initial split at the tenant start key:
I would expect this split to noop, since I believe there already exists a split at the tenant start key. More interestingly, we also call scatter on this tenant's key space, which will induce some allocator work. If some of this allocator work is asynchronous, then the span config replication stream will begin while the allocator is handling the scatter. The work to apply the regional constraint would then get put in the allocator queue behind this scatter request. Another thought with no evidence: perhaps admission control is throttling the allocator more, after it handled the admin scatter request. |
Okie dokie, my theory above has some weight: commenting out the scatter request in |
It was helpful while looking into cockroachdb#111541 Release note: None
111803: streamproducer: add verbose logging to span config event stream r=stevendanna a=msbutler It was helpful while looking into #111541 Epic: none Release Note: none Co-authored-by: Michael Butler <[email protected]>
Per @kvoli 's advice, I took a look at the kv distributions logs of a failed run. I noticed that after the scatter request and the span config update, I see the allocator attempt to transfer the tenant's range to s1 (perhaps to obey the regional constraint), but the attempt got rejected, then there's nothing in the logs for 20 seconds.
here's a zip of the latest logs: |
Two more things to note about the kv distribution log:
|
I added more verbose logging ( For example:
To repro, run the following command on this branch:
Here are the test logs with verbose logging: |
handing this over @stevendanna while I'm out. |
I'm removing the release blocker on this as the failure is to due test environment constraints outlined in #112541 |
112470: upgrades: make stmt diag upgrade idempotent r=yuzefovich a=yuzefovich All upgrades are expected to be idempotent, but the stmt diag upgrade (needed for plan-gist batched matching) wasn't - we needed to add `IF EXISTS` clause to the `DROP INDEX` stmt (which doesn't have a meaningful `schemaExistsFn`). Additionally, we can combine two stmts that add a single column into one that adds two. Epic: None Release note: None 112496: roachtest: use tpch workload in import-cancellation r=yuzefovich a=yuzefovich Previously, we were using `querybench` to run TPCH queries after the import succeeded. The comments around the code suggest that we wanted to assert that the correct results were obtained, meaning that there was no data corruption during cancelled imports. However, `querybench` doesn't do any kind of verification, so this commit switches to using `tpch` workload with `--enable-checks=true` which does the desired verification. Noticed this when looking at #111985. Epic: None Release note: None 112532: upgrade: Increase timeout for TestTenantAutoUpgrade under stress r=stevendanna a=ajstorm Test times out under stress. With updated timeout it now passes on a local repro. Fixes: #112158 Release note: None 112543: streamingccl: unskip TestStreamingRegionalConstraint r=kvoli a=msbutler This patch unskips TestStreamingRegionalConstraint under a non-stress build. Fixes #111541 Release note: none Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Adam Storm <[email protected]> Co-authored-by: Michael Butler <[email protected]>
This patch unskips TestStreamingRegionalConstraint under a non-stress build. Fixes #111541 Release note: none
Informs cockroachdb#111541 Release note: none
112597: streamingst: increase timeout on TestStreamingRegionalConstraint r=msbutler a=msbutler Informs #111541 Release note: none Co-authored-by: Michael Butler <[email protected]>
Informs #111541 Release note: none
ccl/streamingccl/streamingest.TestStreamingRegionalConstraint failed with artifacts on master @ fad649d89721ddb3e9f3dcab1ad5d14f74c91bc9:
Parameters:
TAGS=bazel,gss,deadlock
,stress=true
Help
See also: How To Investigate a Go Test Failure (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-31957
The text was updated successfully, but these errors were encountered: