-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: schemachange/during/kv failed #85449
Comments
It feels like adding robustness to transient failures to split inside of |
This is surprising because we do have a retry loop in I think the key is that this is coming from "change replicas", so it must be returned by |
I think I understand why this is failing to be retried. cockroach/pkg/kv/kvserver/replica_command.go Line 551 in f4b491f
To do that, it uses However, when we construct this error, we use cockroach/pkg/kv/kvserver/replica_command.go Lines 2427 to 2432 in f4b491f
This is not checked by I don't yet understand the comment here, but this would be fixed by #85405. |
85405: kv: retry AdminSplit on LeaseTransferRejectedBecauseTargetMayNeedSnapshot r=shralex a=nvanbenschoten Informs #84635. Informs #84162. Fixes #85449. Fixes #83174. This commit considers the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot` status to be a form of a retriable replication change error. It then hooks `Replica.executeAdminCommandWithDescriptor` up to consult this status in its retry loop. This avoids spurious errors when a split gets blocked behind a lateral replica move like we see in the following situation: 1. issue AdminSplit 2. range in joint config, first needs to leave (maybeLeaveAtomicChangeReplicas) 3. to leave, needs to transfer lease from voter_outgoing to voter_incoming 4. can’t transfer lease because doing so is deemed to be potentially unsafe Release note: None Release justification: Low risk, resolves flaky test. 87137: storage: default to TableFormatPebblev1 in backups r=itsbilal,dt a=jbowens If the v22.2 upgrade has not yet been finalized, so we're not permitted to use the new TableFormatPebblev2 sstable format, default to TableFormatPebblev1 which is the format used by v22.1 internally. This change is intended to allow us to remove code for understanding the old RocksDB table format version sooner (eg, v23.1). Release justification: low-risk updates to existing functionality Release note: None 87152: sql: encode either 0 or 1 spans in scan gists r=mgartner a=mgartner #### dev: add rewritable paths for pkg/sql/opt/exec/explain tests This commit adds fixtures in `pkg/sql/opt/testutils/opttester/testfixtures` as rewritable paths for tests in `pkg/sql/opt/exec/explain`. This prevents `dev test pkg/sql/opt/exec/explain` from erring when the `--rewrite` flag is used. Release justification: This is a test-only change. Release note: None #### sql: encode either 0 or 1 spans in scan gists In plan gists, we no longer encode the exact number of spans for scans so that two queries with the same plan but a different number of spans have the same gist. In addition, plan gists are now decoded with the `OnlyShape` flag which prints any non-zero number of spans as "1+ spans" and removes attributes like "missing stats" from scans. Fixes #87138 Release justification: This is a minor change that makes plan gist instrumentation more scalable. Release note (bug fix): The Explain Tab inside the Statement Details page now groups plans that have the same shape but a different number of spans in corresponding scans. 87154: roachtest: stop cockroach gracefully when upgrading nodes r=yuzefovich a=yuzefovich This commit makes it so that we stop cockroach nodes gracefully when upgrading them. Previous abrupt behavior of stopping the nodes during the upgrade could lead to test flakes because the nodes were not being properly drained. Here is one scenario for how one of the flakes (`pq: version mismatch in flow request: 65; this node accepts 69 through 69`, which means that a gateway running an older version asks another node running a newer version to do DistSQL computation, but the versions are not DistSQL compatible): - we are in a state when node 1 is running a newer version when node 2 is running an older version. Importantly, node 1 was upgraded "abruptly" meaning that it wasn't properly drained; in particular, it didn't send DistSQL draining notification through gossip. - newer node has already been started but its DistSQL server hasn't been started yet (although it already can accept incoming RPCs - see comments on `distsql.ServerImpl.Start` for more details). This means that newer node has **not** sent through gossip an update about its DistSQL version. - node 2 acts as the gateway for a query that reads some data that node 1 is the leaseholder for. During the physical planning, older node 2 checks whether newer node 1 is "healthy and compatible", and node 1 is deemed both healthy (because it can accept incoming RPCs) and is compatible (because node 2 hasn't received updated DistSQL version of node 1 since it hasn't been sent yet). As a result, node 2 plans a read on node 1. - when node 1 receives that request, it errors out with "version mismatch" error. This whole problem is solved if we stop nodes gracefully when upgrading them. In particular, this will mean that node 1 would first dissipate its draining notification across the cluster, so during the physical planning it will only be considered IFF node 1 has already communicated its updated DistSQL version, and then it would be deemed DistSQL-incompatible. I verified that this scenario is possible (with manual adjustments of the version upgrade test and cockroach binary to insert a delay) and that it's fixed by this commit. I believe it is likely that other flake types have the same root cause, but I haven't verified it. Fixes: #87104. Release justification: test-only change. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Jackson Owens <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
roachtest.schemachange/during/kv failed with artifacts on master @ 748b89a3fbef294f4b0f930c9dbdf88294b3deeb:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-18251
Epic CRDB-19172
The text was updated successfully, but these errors were encountered: