-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: lazily load storage.Replica objects #31663
base: master
Are you sure you want to change the base?
Conversation
The performance at startup is markedly better. With 700,000 ranges: Before: 189.3s, 5GiB RSS If we can do something more efficient with loading the MVCC stats, I think we can substantially reduce the time further. |
Can we parallelize loading the replicas? Is there some blocker there that I'm not seeing? |
I tried to parallelize by NumProc (or whatever the runtime variable is) and it didn't seem to have much of an impact. That was a while ago though and I didn't pursue it, so I think the experiment should be retried. If there were any way to read all of the MVCCStats in a linear fashion and then match them to the range descriptors post-hoc in one pass, it would be much faster. One problem is that the MVCCStats have been combined into an aggregate state and there's an inline migration which chooses which way to load them. Again, I haven't started down that path in earnest so this is just what I vaguely recall from a few months ago when I was first experimenting with this. I think though that there room for improvement and this is already a big improvement. The push here is to allow the replicas to be made non-resident again after being resident so we can keep memory used fot replicas at a relatively small size. |
2dbc1b9
to
5654898
Compare
I made a second commit which for some reason is not showing up here even though it definitely pushed to this branch. Will attempt to solve this strange GitHub issue tomorrow. But the good news is that I was able to reduce the |
Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/ I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable"). |
4 similar comments
Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/ I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable"). |
Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/ I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable"). |
Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/ I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable"). |
Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/ I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable"). |
The third commit further optimizes to 14.9s by parallelizing the two scans. |
@tschottdorf Nothing in this commit kicks replicas out of memory. That's another PR, which must ref count active replicas in order to safely make them non-resident. The way it works is to only kick a replica out when its qps is zero. Correctly dealing with stats otherwise would be difficult or impossible. I'll need to look at the cited issues more closely to determine what, if anything, this work can do to prevent the badness you reference when many replicas aren't quiescable. |
178ed53
to
2ff33fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 14 of 18 files at r1, 3 of 4 files at r2, 6 of 6 files at r3, 1 of 1 files at r4.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/client_status_test.go, line 82 at r2 (raw file):
// Subset of ranges including first. {"a", "c", 1, 3}, // middle subset of ranges.
nit: s/m/M/
pkg/storage/replica.go, line 4866 at r2 (raw file):
continue } // In common operation, we only quiesce when all followers are
I think this is skewing with #31112.
pkg/storage/replica_shim.go, line 30 at r2 (raw file):
) // ReplicaShim holds either a RangeDescriptor or an initialized
How does this interact with ReplicaPlaceholder
objects? How does it interact with uninitialized replicas? Is there a way we can merge all of these concepts, or at least make sure their relationship to one-another is easy to understand?
pkg/storage/replica_shim.go, line 37 at r2 (raw file):
replica *Replica // non-nil if resident // Fields below not valid if replica is non-nil.
I think it's worth pulling this into a separate struct so that we can easily zero it out when we pull a replica into memory.
pkg/storage/replica_shim.go, line 199 at r2 (raw file):
Leaseholder: holder, LeaseType: roachpb.LeaseEpoch, Quiescent: true,
// non-resident replicas are always quiescent
pkg/storage/replica_shim.go, line 217 at r2 (raw file):
} func (rs *ReplicaShim) startKey() roachpb.RKey {
mention rangeKeyItem
here.
pkg/storage/store.go, line 367 at r2 (raw file):
} else { var err error repl, err = rs.store.GetReplica(shim.RangeID())
Shouldn't we avoid this lookup for resident replicas?
pkg/storage/store.go, line 2036 at r2 (raw file):
} rep.SetZoneConfig(shim.zone) shim.replica = rep
This seems like it should be a method on ReplicaShim
, along with clearing the other fields.
pkg/storage/store.go, line 2044 at r2 (raw file):
} // mustGetReplica fetches a replica by Range ID. Any error
I would expect a method called "mustXXX" to always fatal if it failed. This has a strange contract where it only fatals for some errors.
pkg/storage/store.go, line 1319 at r4 (raw file):
// creating replica shims. statsMap := map[roachpb.RangeID]enginepb.MVCCStats{} if _, err = engine.MVCCIterate(ctx, s.engine,
A few nits here that should help speed things up:
MVCCIterate
is more heavy-weight than you need. You can avoid a lot of memory copying by using anengine.Iterator
directly and decoding keys and protos straight from unsafe buffers- Pull the protos out of the loop to avoid allocations there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/replica.go, line 4866 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think this is skewing with #31112.
Yikes, thanks for pointing that out. This was a bad rebase outcome.
pkg/storage/replica_shim.go, line 30 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
How does this interact with
ReplicaPlaceholder
objects? How does it interact with uninitialized replicas? Is there a way we can merge all of these concepts, or at least make sure their relationship to one-another is easy to understand?
It works with ReplicaPlaceholder
objects in exactly the same way that Replica
objects did – they both exist together in the replicaShims
map. Uninitialized replicas also are embedded in ReplicaShims
. There's a 1:1 correspondence between how Replica
s were used / referenced by a Store
and how ReplicaShim
s are now used / referenced. In other words, ReplicaShim
has replaced Replica
in all of the Store
's top-level data structures.
pkg/storage/replica_shim.go, line 37 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I think it's worth pulling this into a separate struct so that we can easily zero it out when we pull a replica into memory.
Done.
pkg/storage/replica_shim.go, line 199 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
// non-resident replicas are always quiescent
Done.
pkg/storage/replica_shim.go, line 217 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
mention
rangeKeyItem
here.
Done.
pkg/storage/store.go, line 367 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Shouldn't we avoid this lookup for resident replicas?
Easier to just call GetReplica than to have another if/else block here. It's cheap if the replica is already resident.
pkg/storage/store.go, line 2036 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This seems like it should be a method on
ReplicaShim
, along with clearing the other fields.
Done.
pkg/storage/store.go, line 2044 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
I would expect a method called "mustXXX" to always fatal if it failed. This has a strange contract where it only fatals for some errors.
The problem here is the weirdness of the GetReplica
contract, which returns an error if the replica isn't found, where the majority of callers actually just want to ignore that error and continue. Since RangeNotFoundError
isn't considered a fatal error for most callers who want the mustGetReplica
semantics, I've changed things around a bit here to just return a *Replica
, which might be nil if not found. Don't know if this addresses your concern, but the point of this method is to avoid checking the error for RangeNotFoundError
across 10 call sites.
pkg/storage/store.go, line 1319 at r4 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
A few nits here that should help speed things up:
MVCCIterate
is more heavy-weight than you need. You can avoid a lot of memory copying by using anengine.Iterator
directly and decoding keys and protos straight from unsafe buffers- Pull the protos out of the loop to avoid allocations there
I'm going to punt on the changes to the iteration, but I've made the suggested changes around avoiding allocations and it seemed to have a slight positive impact.
2ff33fe
to
ad0135d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 5 files at r5, 4 of 4 files at r7.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_shim.go, line 30 at r2 (raw file):
Previously, spencerkimball (Spencer Kimball) wrote…
It works with
ReplicaPlaceholder
objects in exactly the same way thatReplica
objects did – they both exist together in thereplicaShims
map. Uninitialized replicas also are embedded inReplicaShims
. There's a 1:1 correspondence between howReplica
s were used / referenced by aStore
and howReplicaShim
s are now used / referenced. In other words,ReplicaShim
has replacedReplica
in all of theStore
's top-level data structures.
So to visualize this, we now have the following states that a Replica can be in:
replicasByKey
(oneof)
/ \
/ \
/ \
ReplicaShim ReplicaPlaceholder
(oneof)
/ \
/ \
/ \
Replica nonResidentReplica
(oneof)
/ \
/ \
/ \
Replica.initialized !Replica.initialized
I'm worried about this ever-growing complexity because it makes the code harder to reason about. Can any of these states be collapsed into the same level or even better, combined? For instance, is there a reason to keep the non-initialized Replica state if we also have a non-resident Replica state? It doesn't need to hold up this PR, but I think it would benefit us to think about this.
@benesch you've worked with this code most closely over the past few months. I'm curious if you have any suggestions here or strong feelings about what should and shouldn't be done.
pkg/storage/store.go, line 2044 at r2 (raw file):
Previously, spencerkimball (Spencer Kimball) wrote…
The problem here is the weirdness of the
GetReplica
contract, which returns an error if the replica isn't found, where the majority of callers actually just want to ignore that error and continue. SinceRangeNotFoundError
isn't considered a fatal error for most callers who want themustGetReplica
semantics, I've changed things around a bit here to just return a*Replica
, which might be nil if not found. Don't know if this addresses your concern, but the point of this method is to avoid checking the error forRangeNotFoundError
across 10 call sites.
Yeah, that's better.
I haven't given this a close review, but from the looks of it it's the type of change that could introduce a number of issues. @benesch could you work with @spencerkimball to vet this before it merges (once it's in there's no way to "turn it off")? I'm thinking a roachtest run and a nightly stress run. For the roachtest run, unfortunately things are already pretty busted right now but I hope that #31914 can improve the situation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speedup here is impressive, though I echo @tschottdorf's concern that this could be destabilizing. Let's hold off on merging this until we've got the roachtest stability issues under control. I think that is close, though it is hard to put a precise time bound on what that stability will be achieved.
Reviewable status: complete! 1 of 0 LGTMs obtained
ad0135d
to
d07351b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The speedup here is now down to 9.7s
after reducing allocations and getting rid of migrations which have already run, but were running nevertheless anew at every startup.
I'll work with Nikhil to do roachtest and nightly stress runs.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/replica_shim.go, line 30 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
So to visualize this, we now have the following states that a Replica can be in:
replicasByKey (oneof) / \ / \ / \ ReplicaShim ReplicaPlaceholder (oneof) / \ / \ / \ Replica nonResidentReplica (oneof) / \ / \ / \ Replica.initialized !Replica.initialized
I'm worried about this ever-growing complexity because it makes the code harder to reason about. Can any of these states be collapsed into the same level or even better, combined? For instance, is there a reason to keep the non-initialized Replica state if we also have a non-resident Replica state? It doesn't need to hold up this PR, but I think it would benefit us to think about this.
@benesch you've worked with this code most closely over the past few months. I'm curious if you have any suggestions here or strong feelings about what should and shouldn't be done.
Yes this graph is correct, except uninitialized replicas are not stored in replicasByKey
; they're in the replicaShims
map. The uninitialized replica map actually just provides a boolean for any given replica for whether the replica has been added yet to replicasByKey
after being initialized. The reason is that once the replica's range descriptor has been set, we lose the context of whether the replica was previously uninitialized and hasn't yet been added to replicasBykey
.
pkg/storage/store.go, line 2044 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, that's better.
Done.
@tschottdorf @petermattis: both nightly stress and roachtest seem to be fine with this change. I'm relying on @benesch to interpret the non-culpability around pre-existing roachtest issues: https://teamcity.cockroachdb.com/viewLog.html?buildId=988969&tab=queuedBuildOverviewTab |
@petermattis, @tbg should this PR still be in limbo? It's not getting any younger. |
d07351b
to
dd78004
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tobi and I are concerned about the potential stability impact of this change, especially since we're still handling the fallout from 2.1 changes like the contention queue and async writes. Since we're not currently planning to push for higher scalability in this dimension for 2.2, I'd rather put this PR on hold even if that means rewriting it when we're ready to tackle the issue. When that day comes, I'd like to see a larger refactoring that unifies replica shims with initialized and uninitialized replicas and replica placeholders instead of bolting a new not-quite-replica type on top of the multiple existing ones.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
storage: lazily load storage.Replica objects More than several tens of thousands of `Replica` objects can take substantial time at server startup to initialize. This change introduces a `ReplicaShim` object which contains a pointer to the `Replica` which is non-nil if resident. Otherwise, a set of other values including a ref to the appropriate zone config, `MVCCStats` and a handful of other properties necessary to accurately compute stats even when a replica is not-resident. Further improve performance on initializing `ReplicaShim` objects at store startup by parallelizing reading the MVCC stats and the range descriptors. Improve replica shim load performance by loading all MVCCStats values during a simple scan over all local range ID keys and adding stats values to a map. The stats map is consulted when loading the range descriptors to initialize each `ReplicaShim`. Release note (performance improvement): faster startup whena node contains many replicas. Release note: None
dd78004
to
b60f881
Compare
The comment below goes into more detail, but here's the TL;DR: Problems: 1. right now answering "is this Replica object unused?" is impossible 2. right now replicaIDs change on existing replicas, which is very complex to reason about 3. right now it'll be difficult to split replicas into different states because that requires answering 1 during state transitions Proposed: 1. use a clean API to refcount Replica usage, and 2. allow initiating teardown "basically whenever" without blocking (think merge trigger), 3. so that the replica clears out quickly, 4. which in turn solves 1. 5. Then we can solve 2 because we'll be able to replace the Replica object in the Store whenever the replicaID would previously change in-place (this will not be trivial, but I hope it can be done), 6. and we should also be able to do 3 (again, not trivial, but a lot harder right now). I expect the replication code to benefit from 6) as the Raft instance on a Replica would never change. This PR is a toy experiment for 1. It certainly wouldn't survive contact with the real code, but it's sufficient to discuss this project and iterate on the provisional Guard interface. ---- GuardedReplica is the external interface through which callers interact with a Replica. By acquiring references to the underlying Replica object while it is being used, it allows safe removal of Replica objects and/or their under- lying data. This is an important fundamental for five reasons: Today, we use no such mechanism, though this is largely due to failing in the past to establish one[1]. The status quo "works" by maintaining a destroyStatus inside of Replica.mu, which is checked in a few places such as before proposing a command or serving a read. Since these checks are only "point in time", and nothing prevents the write to the status from occurring just a moment after the check, there is a high cognitive overhead to reasoning about the possible outcomes. In fact, in the case in which things could go bad most spectacularly, namely removing a replica including data, we hold essentially all of the locks available to us and rely on the readOnlyCmdMu (which we would rather get rid off). This then is the first reason for proposing this change: make the replica lifetime easier to reason about and establish confidence that the Replica can't just disappear out from under us. The second motivator are ReplicaID transitions, which can be extremely complicated. For example, a Replica may 1. start off as an uninitialized Replica with ReplicaID 12 (i.e. no data) 2. receive a preemptive snapshot, which confusingly results in an initialized Replica with ReplicaID 12 (though preemptive snapshots nominally should result in a preemptive replica -- one with ReplicaID zero). 3. update its ReplicaID to 18 (i.e. being added back to the Raft group). 4. get ReplicaGC'ed because 5. it blocks a preemptive snapshot, which now recreates it. In my point of view, changing the ReplicaID for a live Replica is a bad idea and incurs too much complexity. An architecture in which Replica objects have a single ReplicaID throughout their lifetime is conceptually much simpler, but it is also much more straightforward to maintain since it does away with a whole class of concurrency that needs to be tamed in today's code, and which may have performance repercussions. On the other hand, replicaID changes are not frequent, and only need to be moderately fast. The alternative is to instantiate a new incarnation of the Replica whenever the ReplicaID changes. The difficult part about this is destroying the old Replica; since Replica provides proper serialization, we mustn't have commands in-flight in two instances for the same data (and generally we want to avoid even having to think about concurrent use of old incarnations). This is explored here. The above history would read something like this: 1. start off as an uninitialized Replica R with Replica ID 12 (no data) 2. preemptive snapshot is received: tear down R, instantiate R' 3. ReplicaID is updated: tear down R', instantiate R'' 4. R'' is marked for ReplicaGC: replace with placeholder R''' in Store, tear down R'', wait for references to drain, remove the data, remove R'''. 5. instantiate R''' (no change from before). A third upshot is almost visible in the above description. Once we can re- instantiate cleanly on ReplicaID-based state changes, we might as well go ahead and pull apart various types of Replica: - preemptive snapshots (though we may replace those by learner replicas in the future[2]) - uninitialized Replicas (has replicaID, but no data) - initialized Replicas (has replicaID and data) - placeholders (have no replicaID and no data) To simplify replicaGC and to remove the preemptive snapshot state even before we stop using preemptive snapshots, we may allow placeholders to hold data (but with no means to run client requests against it). Once there, reducing the memory footprint for "idle" replicas by only having a "shim" in memory (replacing the "full" version) until the next request comes in becomes a natural proposition by introducing a new replica state that upon access gets promoted to a full replica (which loads the full in-memory state from disk), pickup up past attempts at this which failed due to the technical debt present at the moment[3]. [1]: cockroachdb#8630 [2]: cockroachdb#34058 [3]: cockroachdb#31663 Release note: None
|
More than several tens of thousands of
Replica
objects can takesubstantial time at server startup to initialize. This change introduces
a
ReplicaShim
object which contains a pointer to theReplica
which isnon-nil if resident. Otherwise, a set of other values including a ref to
the appropriate zone config,
MVCCStats
and a handful of other propertiesnecessary to accurately compute stats even when a replica is not-resident.
Release note (performance improvement): faster startup whena node contains
many replicas.