-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: TestCheckConsistencyInconsistent
failed, data race in TestingSetRedactable
#81819
Comments
TestCheckConsistencyInconsistent
TestingSetRedactable
This test will be disabled under race in #82042. However, the underlying race should still be addressed. |
82042: kvserver: skip `TestCheckConsistencyInconsistent` under race r=tbg a=erikgrinaker The test is rather slow and also timing-dependent, which can result in false positives. Furthermore, the call to `TestingSetRedactable` triggers the race detector. This patch therefore disables the test under the race detector. Touches #81819. Release note: None Co-authored-by: Erik Grinaker <[email protected]>
We have marked this test failure issue as stale because it has been |
Hi @shralex, please add branch-* labels to identify which branch(es) this release-blocker affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
I unskipped the test under race and stressed it:
This did not hit the data race mentioned above. However, after about 10m of stress, it failed with the following, more concerning error:
I've since been stressing without race and have not hit the same failure. EDIT: I quickly hit this again with |
TestingSetRedactable
TestCheckConsistencyInconsistent
failed, data race in TestingSetRedactable
I've been able to hit this failure as far back as 4f904eb. @nicktrav could we have someone from Replication take a look at this test failure and determine what to do about it? Maybe @pavelkalinnikov, who worked with this test last October. There's an argument that what we're seeing here should be promoted to a |
cc @cockroachdb/replication |
Reproduces fairly quickly on my end too. Investigating. |
I was bisecting, and hit the original race after a while (got it at 4eb5451, but it's probably a long existing one):
|
Another error:
|
Here is what happened during one of the test failures:
Node
Shortly, all 3 nodes got a second phase
Nodes 2 and 3 quickly succeeded in creating the checkpoint @ log index 41:
5+eps seconds later, the leaseholder ended the second phase of the consistency check:
Notably, only nodes 2 and 3 returned a checksum. Node 1 timed out at 5 seconds and returned an error. Only 10 more seconds later, node 1 finally reported that it created a checksum:
And exactly 5 seconds later it gave up on keeping the result of the checksum computation because the result collection request never showed up (we know that it timed out above):
So what's probably happening is:
The problem is that on the node 1 the checkpoint directory isn't complete yet. Even the |
This commit makes it so that the consistency checker checkpoints are first created in a "_pending" folder, and only after completion are atomically renamed to the intended directory name. This is to help distinguish valid checkpoints from the ones that weren't finalized (for example, when the node crashed in the meantime). Part of cockroachdb#81819 Epic: none Release note (ops change): checkpoint directories (that can be created in the rare event of range inconsistency) are now clearly indicated as pending until they are fully populated. This is to help operators to distinguish valid checkpoints from corrupted ones.
98527: sql: deprecate gossip-based physical planning r=dt a=dt Previously system tenants and secondary tenants used differnt physical planning implementations, with the system tenant and only the system tenant using nodeIDs while other tenants used the instance table. This unifies those implementations such that all tenants use NodeIDs if running in mixed mode and use the instance table if not. Release note: none. Epic: CRDB-16910 98879: sql: support array-flatten subqueries within UDFs r=mgartner a=mgartner Array-flatten subqueries (e.g., `ARRAY(SELECT a FROM t)`) are now supported within UDFs. They are now converted to a normal subquery with a ScalarGroupBy if they exist within a UDF, even if they are uncorrelated. This allows them to be executed without any changes to the execbuilder or the evaluation logic of `tree.Routine`. Fixes #98738 Release note: None 98988: ui: add badges for filter elements r=maryliag a=maryliag Adds badges for each of the selected filters on SQL Activity and Insights pages. Part Of #98891 https://www.loom.com/share/e7417209fc704d71b2733f58609fb4de <img width="1160" alt="Screenshot 2023-03-19 at 1 30 49 PM" src="https://user-images.githubusercontent.com/1017486/226195412-03d3ef58-5d6d-4c24-84f1-6a55b952f5c0.png"> Release note (ui change): Adds badges for each selected filter on SQL Activity and Insights pages. 99042: sql: added indexes to system.statement_statistics, system.transaction_statistics r=ericharmeling a=ericharmeling Fixes #98624. This commit adds indexes on new computed columns to the system.statement_statistics and system.transaction_statistics tables. Epic: none Release note: None 99054: pkg/ccl: Unskip TestTenantStatusAPI/tenant_ranges/pagination r=dhartunian a=abarganier Fixes: #92979 Previously, in #97386, we skipped test_tenant_ranges_pagination because it was marked as flaky. The test makes a request for a single range and expects an offset of `1` back. It then uses this offset to request a second range, and expects an offset of `2`. This means that the test requires at least 3 ranges to exist on the tenant. The test was flaking on the assertion that the offset returned by the second request came back as `2`. Instead, it was flaking when the offset came back as `0`, which signifies that there are no more ranges to process. We learned that the tenant create process has an asycnhronous splitting of ranges that occurs, which is what would lead to this sporadic scenario where not enough ranges existed (yet) for the test to succeed. This patch updates the test with a `testutils.SucceedsSoon` clause that checks first that `crdb_internal.ranges` contains at least 3 ranges, prior to making the second request. This should provide sufficient time for the range split queue to be processed and eliminate the vast majority of these test flakes. Release note: none 99119: kvserver: mark in-flight storage checkpoints r=tbg a=pavelkalinnikov This commit makes it so that the consistency checker checkpoints are first created in a "_pending" folder, and only after completion are atomically renamed to the intended directory name. This is to help distinguish valid checkpoints from the ones that weren't finalized (for example, when the node crashed in the meantime). Part of #81819 Epic: none Release note (ops change): checkpoint directories (that can be created in the rare event of range inconsistency) are now clearly indicated as pending until they are fully populated. This is to help operators to distinguish valid checkpoints from corrupted ones. Co-authored-by: David Taylor <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: maryliag <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Alex Barganier <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
This commit fixes the race possible in TestCheckConsistencyInconsistent: - Node 2 is corrupted. - The second phase of runConsistency check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. This commit makes the test wait for the moment when all the checkpoint are known to be fully populated. Part of cockroachdb#81819 Epic: none Release note: none
This commit fixes the race possible in TestCheckConsistencyInconsistent: - Node 2 is corrupted. - The second phase of runConsistency check times out on node 1, and returns early when only nodes 2 and 3 have created the storage checkpoint. - Node 1 haven't created the checkpoint, but has started doing so. - Node 2 "fatals" (mocked out in the test) shortly after the check is complete. - Node 1 is still creating its checkpoint, but has probably created the directory by now. - Hence, the test assumes that the checkpoint has been created, and proceeds to open it and compute the checksum of the range. This commit makes the test wait for the moment when all the checkpoint are known to be fully populated. Part of #81819 Epic: none Release note: none
Jira issue: CRDB-16183
The text was updated successfully, but these errors were encountered: