-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: acceptance/version-upgrade failed [not using applied state] #62267
Comments
|
This looks like an instance of #58378. |
@tbg and @irfansharif what do we want to do here? Should we mark #58378 as a GA-blocker? |
Looks like a somewhat different issue:
There is no snapshot application racing with the cluster version bump here. Instead, it seems as though there's some invalid interleaving of replicaGC and (lease) request evaluation. |
@tbg, I have cycles, I'll take a look here. |
Well it's not exactly that, but I think it's possible for replica GC to purge out a replica, but if that replica is receiving a raft message, it might re-instate an uninitialized replica in the store map. I'm not sure yet though, this is pretty difficult to reproduce. I'm going to try and see what reverting #60429 tickles, I suspect it's the same issue (where we're dealing with uninitialized replicas, and thus panicking). Hopefully that makes it easier to repro. |
Hm, I'm not having much luck with this one. It's pretty difficult to repro (read: I couldn't) and none of my theories have panned out. My best guess for as to what's happening is that the replicaGC process GCs an old replica, writing the range tombstone key and removing it from the store's replicas map. Perhaps concurrently, we're processing an old raft request, calling into store.tryGetOrCreateReplica. I don't see how (we appropriately check for the tombstone and the store map), but if that GC-ed replica were missing from the map, we might try to initialize a second incarnation of it here:
In doing so it'll read from the engine to find it's replica state, but won't find any if it was GC-ed away, so it's as-if it was "unset": cockroach/pkg/kv/kvserver/replica_init.go Lines 171 to 173 in 8b137b4
cockroach/pkg/kv/kvserver/replica_proposal.go Line 831 in 8b137b4
But again, I'm not actually seeing evidence of any of that. I tried to work backwards from the replica state possibly being unset, and this seems to be the only possible narrative for it, but I don't see how. A third thing that would need to happen is a lease request from this node, which would need to evaluate on the follower node. In the failure above, r21/3 on n3 is being GC-ed, and is also evaluating a proposal. Looking at the descriptor for r21, the replicas + leaseholders are elsewhere:
This "unset" replica state, coming about from an uninitialized replica being installed in the stores map (again, not even sure if that's what's happening), would also fit into what #58378 is seeing. I'm going to look at other things for now, maybe I'll think of something else while I do. I'd be very curious to see if anyone else has better luck (+cc @tbg). |
(roachtest).acceptance/version-upgrade failed on master@d891594d3c998f153b88f631e3c89ac7d12c2a6e:
More
Artifacts: /acceptance/version-upgrade
See this test on roachdash |
The same kind of GC race as before, except it happened during teardown after the test had already successfully completed. |
Another instance of this failure happening to me during merge: https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Roachtest/2824498 |
I'm still unable to repro; I’ve had about 50 runs with nothing. I’m still trying to tease out how it’s possible for an uninitalized range to evaluate a proposal. My only guess is that if we’re somehow no able to read the range tombstone the replicaGC process writes out, we could misconstrue that as instantiate a replica without the right replica state. It's also not clear why r21 (and r84) is even being GC-ed. Somewhat surprisingly that happens both times in both failures where those two very replicas are GC-ed, but there aren't any indications as to why (no merges/splits/change triggers). Actually, as I type that out, maybe we aren't logging things in the same way we do in 21.1. I'll double check, I was able to see the same GC behavior for r21 and r84 on successful runs, so at least that should be easier to track down (and maybe hints at what's happening here). |
So at least that's how the evaluation is happening. |
(roachtest).acceptance/version-upgrade failed on master@0dd303e2234c5efe21ba242aff74a5edfd5f5c40:
More
Artifacts: /acceptance/version-upgrade
See this test on roachdash |
Looks like the same thing
|
Hmmmmmmmm latest (unverified) shower thought on how we could be running into this bug.
|
The close occurrence of replicaGC logs is explainable, r42/3 should be GC-ed. We'd only trigger this assertion if we evaluated anything against that replica, like a lease request, which we also know is happening (#62267 (comment)). I'll try writing tests for it, and for #58378. I think they're the same issue. |
Fixes cockroachdb#58378. Fixes cockroachdb#62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in cockroachdb#62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in cockroachdb#58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in cockroachdb#60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None
60835: kv: bump timestamp cache to Pushee.MinTimestamp on PUSH_ABORT r=nvanbenschoten a=nvanbenschoten Fixes #60779. Fixes #60580. We were only checking that the batch header timestamp was equal to or greater than this pushee's min timestamp, so this is as far as we can bump the timestamp cache. 62832: geo: minor performance improvement for looping over edges r=otan a=andyyang890 This patch slightly improves the performance of many spatial builtins by storing the number of edges used in the loop conditions of for loops into a variable. We discovered this was taking a lot of time when profiling the point-in-polygon optimization. Release note: None 62838: kvserver: purge gc-able, unmigrated replicas during migrations r=irfansharif a=irfansharif Fixes #58378. Fixes #62267. Previously it was possible for us to have replicas in-memory, with pre-migrated state, even after a migration was finalized. This led to the kind of badness we were observing in #62267, where it appeared that a replica was not using the applied state key despite us having migrated into it (see TruncatedAndRangeAppliedState, introduced in #58088). --- To see how, consider the following set of events: - Say r42 starts off on n1, n2, and n3 - n3 flaps and so we place a replica for r42 on n4 - n3's replica, r42/3, is now GC-able, but still un-GC-ed - We run the applied state migration, first migrating all ranges into it and then purging outdated replicas - Well, we should want to purge r42/3, cause it's un-migrated and evaluating anything on it (say a lease request) is unsound because we've bumped version gates that tell the kvserver to always expect post-migration state - What happens when we try to purge r42/3? Previous to this PR if it didn't have a replica version, we'd skip over it (!) - Was it possible for r42/3 to not have a replica version? Shouldn't it have been accounted for when we migrated all ranges? No, that's precisely why the migration infrastructure purge outdated replicas. The migrate request only returns once its applied on all followers; in our example that wouldn't include r42/3 since it was no longer one - The stop-gap in #60429 made it so that we didn't GC r42/3, when we should've been doing the opposite. When iterating over a store's replicas for purging purposes, an empty replica version is fine and expected; we should interpret that as signal that we're dealing with a replica that was obviously never migrated (to even start using replica versions in the first place). Because it didn't have a valid replica version installed, we can infer that it's soon to be GC-ed (else we wouldn't have been able to finalize the applied state + replica version migration) - The conditions above made it possible for us to evaluate requests on replicas with migration state out-of-date relative to the store's version - Boom Release note: None 62839: zonepb: make subzone DiffWithZone more accurate r=ajstorm a=otan * Subzones may be defined in a different order. We did not take this into account which can cause bugs when e.g. ADD REGION adds a subzone in the end rather than in the old "expected" location in the subzones array. This has been fixed by comparing subzones using an unordered map. * The ApplyZoneConfig we previously did overwrote subzone fields on the original subzone array element, meaning that if there was a mismatch it would not be reported through validation. This is now fixed by applying the expected zone config to *zonepb.NewZoneConfig() instead. * Added logic to only check for zone config matches subzones from active subzone IDs. * Improve the error messaging when a subzone config is mismatching - namely, add index and partitioning information and differentiate between missing fields and missing / extraneous zone configs Resolves #62790 Release note (bug fix): Fixed validation bugs during ALTER TABLE ... SET LOCALITY / crdb_internal.validate_multi_region_zone_config where validation errors could occur when the database of a REGIONAL BY ROW table has a new region added. Also fix a validation bug partition zone mismatches configs were not caught. 62872: build: use -json for RandomSyntax test r=otan a=rafiss I'm hoping this will help out with an issue where the test failures seem to be missing helpful logs. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Andy Yang <[email protected]> Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Rafi Shamim <[email protected]>
(roachtest).acceptance/version-upgrade failed on master@24e76d399047857a3230cd0c3dd2dcea42c4292a:
More
Artifacts: /acceptance/version-upgrade
Related:
roachtest: acceptance/version-upgrade failed: invalid attempted write of database descriptor #58307 roachtest: acceptance/version-upgrade failed A-schema-descriptors C-bug C-test-failure O-roachtest branch-release-20.2
roachtest: acceptance/version-upgrade failed #53812 roachtest: acceptance/version-upgrade failed C-test-failure O-roachtest O-robot branch-release-19.2
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: