Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: lazily load storage.Replica objects #31663

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

spencerkimball
Copy link
Member

More than several tens of thousands of Replica objects can take
substantial time at server startup to initialize. This change introduces
a ReplicaShim object which contains a pointer to the Replica which is
non-nil if resident. Otherwise, a set of other values including a ref to
the appropriate zone config, MVCCStats and a handful of other properties
necessary to accurately compute stats even when a replica is not-resident.

Release note (performance improvement): faster startup whena node contains
many replicas.

@spencerkimball spencerkimball requested review from nvanbenschoten and a team October 20, 2018 01:32
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@spencerkimball
Copy link
Member Author

spencerkimball commented Oct 20, 2018

The performance at startup is markedly better. With 700,000 ranges:

Before: 189.3s, 5GiB RSS
After: 45.8s, 638MiB RSS

If we can do something more efficient with loading the MVCC stats, I think we can substantially reduce the time further.

@benesch
Copy link
Contributor

benesch commented Oct 20, 2018

The performance at startup is markedly better. With 700,000 ranges:

Before: 189.3s, 5GiB RSS
After: 45.8s, 638MiB RSS

If we can do something more efficient with loading the MVCC stats, I think we can substantially reduce the time further.

Can we parallelize loading the replicas? Is there some blocker there that I'm not seeing?

@spencerkimball
Copy link
Member Author

I tried to parallelize by NumProc (or whatever the runtime variable is) and it didn't seem to have much of an impact. That was a while ago though and I didn't pursue it, so I think the experiment should be retried. If there were any way to read all of the MVCCStats in a linear fashion and then match them to the range descriptors post-hoc in one pass, it would be much faster. One problem is that the MVCCStats have been combined into an aggregate state and there's an inline migration which chooses which way to load them. Again, I haven't started down that path in earnest so this is just what I vaguely recall from a few months ago when I was first experimenting with this. I think though that there room for improvement and this is already a big improvement.

The push here is to allow the replicas to be made non-resident again after being resident so we can keep memory used fot replicas at a relatively small size.

@spencerkimball spencerkimball force-pushed the replica-shims branch 3 times, most recently from 2dbc1b9 to 5654898 Compare October 21, 2018 23:35
@spencerkimball
Copy link
Member Author

I made a second commit which for some reason is not showing up here even though it definitely pushed to this branch. Will attempt to solve this strange GitHub issue tomorrow. But the good news is that I was able to reduce the 45.8s to 18.8s by scanning all of the range local keys in an initial pass to fetch MVCCStats values instead of doing a point lookup for stats while scanning the range descriptors.

@tbg
Copy link
Member

tbg commented Oct 22, 2018

I made a second commit which for some reason is not showing up here even though it definitely pushed to this branch.

Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/

I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable").

4 similar comments
@tbg
Copy link
Member

tbg commented Oct 22, 2018

I made a second commit which for some reason is not showing up here even though it definitely pushed to this branch.

Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/

I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable").

@tbg
Copy link
Member

tbg commented Oct 22, 2018

I made a second commit which for some reason is not showing up here even though it definitely pushed to this branch.

Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/

I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable").

@tbg
Copy link
Member

tbg commented Oct 22, 2018

I made a second commit which for some reason is not showing up here even though it definitely pushed to this branch.

Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/

I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable").

@tbg
Copy link
Member

tbg commented Oct 22, 2018

I made a second commit which for some reason is not showing up here even though it definitely pushed to this branch.

Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/

I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable").

@spencerkimball
Copy link
Member Author

The third commit further optimizes to 14.9s by parallelizing the two scans.

@spencerkimball
Copy link
Member Author

I made a second commit which for some reason is not showing up here even though it definitely pushed to this branch.

Github outage, see https://blog.github.com/2018-10-21-october21-incident-report/

I wasn't able to glean this from the commit message (perhaps augment it a bit), but when are making replicas non-resident (i.e. when do you kick them out of memory). Is it when they quiesce or is there some more active approach too that limits the number of replicas that can be active simultaneously? I'm asking because in #31125 we see that ~500k is definitely too many replicas to have in memory, unquiesced. We also see in #31580 that even with quiescence being an option, a hiccup on node liveness can lock the system in a state in which all replicas are unquiesced so that running with lots of replicas can be a fragile state. (PS there are probably still somewhat low hanging performance fruit there that maybe make 500k ranges actually "manageable").

@tschottdorf Nothing in this commit kicks replicas out of memory. That's another PR, which must ref count active replicas in order to safely make them non-resident. The way it works is to only kick a replica out when its qps is zero. Correctly dealing with stats otherwise would be difficult or impossible. I'll need to look at the cited issues more closely to determine what, if anything, this work can do to prevent the badness you reference when many replicas aren't quiescable.

@spencerkimball spencerkimball force-pushed the replica-shims branch 2 times, most recently from 178ed53 to 2ff33fe Compare October 22, 2018 13:44
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 14 of 18 files at r1, 3 of 4 files at r2, 6 of 6 files at r3, 1 of 1 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/client_status_test.go, line 82 at r2 (raw file):

		// Subset of ranges including first.
		{"a", "c", 1, 3},
		// middle subset of ranges.

nit: s/m/M/


pkg/storage/replica.go, line 4866 at r2 (raw file):

			continue
		}
		// In common operation, we only quiesce when all followers are

I think this is skewing with #31112.


pkg/storage/replica_shim.go, line 30 at r2 (raw file):

)

// ReplicaShim holds either a RangeDescriptor or an initialized

How does this interact with ReplicaPlaceholder objects? How does it interact with uninitialized replicas? Is there a way we can merge all of these concepts, or at least make sure their relationship to one-another is easy to understand?


pkg/storage/replica_shim.go, line 37 at r2 (raw file):

	replica *Replica // non-nil if resident

	// Fields below not valid if replica is non-nil.

I think it's worth pulling this into a separate struct so that we can easily zero it out when we pull a replica into memory.


pkg/storage/replica_shim.go, line 199 at r2 (raw file):

		Leaseholder:     holder,
		LeaseType:       roachpb.LeaseEpoch,
		Quiescent:       true,

// non-resident replicas are always quiescent


pkg/storage/replica_shim.go, line 217 at r2 (raw file):

}

func (rs *ReplicaShim) startKey() roachpb.RKey {

mention rangeKeyItem here.


pkg/storage/store.go, line 367 at r2 (raw file):

		} else {
			var err error
			repl, err = rs.store.GetReplica(shim.RangeID())

Shouldn't we avoid this lookup for resident replicas?


pkg/storage/store.go, line 2036 at r2 (raw file):

		}
		rep.SetZoneConfig(shim.zone)
		shim.replica = rep

This seems like it should be a method on ReplicaShim, along with clearing the other fields.


pkg/storage/store.go, line 2044 at r2 (raw file):

}

// mustGetReplica fetches a replica by Range ID. Any error

I would expect a method called "mustXXX" to always fatal if it failed. This has a strange contract where it only fatals for some errors.


pkg/storage/store.go, line 1319 at r4 (raw file):

	// creating replica shims.
	statsMap := map[roachpb.RangeID]enginepb.MVCCStats{}
	if _, err = engine.MVCCIterate(ctx, s.engine,

A few nits here that should help speed things up:

  1. MVCCIterate is more heavy-weight than you need. You can avoid a lot of memory copying by using an engine.Iterator directly and decoding keys and protos straight from unsafe buffers
  2. Pull the protos out of the loop to avoid allocations there

Copy link
Member Author

@spencerkimball spencerkimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 4866 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think this is skewing with #31112.

Yikes, thanks for pointing that out. This was a bad rebase outcome.


pkg/storage/replica_shim.go, line 30 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How does this interact with ReplicaPlaceholder objects? How does it interact with uninitialized replicas? Is there a way we can merge all of these concepts, or at least make sure their relationship to one-another is easy to understand?

It works with ReplicaPlaceholder objects in exactly the same way that Replica objects did – they both exist together in the replicaShims map. Uninitialized replicas also are embedded in ReplicaShims. There's a 1:1 correspondence between how Replicas were used / referenced by a Store and how ReplicaShims are now used / referenced. In other words, ReplicaShim has replaced Replica in all of the Store's top-level data structures.


pkg/storage/replica_shim.go, line 37 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it's worth pulling this into a separate struct so that we can easily zero it out when we pull a replica into memory.

Done.


pkg/storage/replica_shim.go, line 199 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

// non-resident replicas are always quiescent

Done.


pkg/storage/replica_shim.go, line 217 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

mention rangeKeyItem here.

Done.


pkg/storage/store.go, line 367 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Shouldn't we avoid this lookup for resident replicas?

Easier to just call GetReplica than to have another if/else block here. It's cheap if the replica is already resident.


pkg/storage/store.go, line 2036 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This seems like it should be a method on ReplicaShim, along with clearing the other fields.

Done.


pkg/storage/store.go, line 2044 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I would expect a method called "mustXXX" to always fatal if it failed. This has a strange contract where it only fatals for some errors.

The problem here is the weirdness of the GetReplica contract, which returns an error if the replica isn't found, where the majority of callers actually just want to ignore that error and continue. Since RangeNotFoundError isn't considered a fatal error for most callers who want the mustGetReplica semantics, I've changed things around a bit here to just return a *Replica, which might be nil if not found. Don't know if this addresses your concern, but the point of this method is to avoid checking the error for RangeNotFoundError across 10 call sites.


pkg/storage/store.go, line 1319 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

A few nits here that should help speed things up:

  1. MVCCIterate is more heavy-weight than you need. You can avoid a lot of memory copying by using an engine.Iterator directly and decoding keys and protos straight from unsafe buffers
  2. Pull the protos out of the loop to avoid allocations there

I'm going to punt on the changes to the iteration, but I've made the suggested changes around avoiding allocations and it seemed to have a slight positive impact.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 5 files at r5, 4 of 4 files at r7.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_shim.go, line 30 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

It works with ReplicaPlaceholder objects in exactly the same way that Replica objects did – they both exist together in the replicaShims map. Uninitialized replicas also are embedded in ReplicaShims. There's a 1:1 correspondence between how Replicas were used / referenced by a Store and how ReplicaShims are now used / referenced. In other words, ReplicaShim has replaced Replica in all of the Store's top-level data structures.

So to visualize this, we now have the following states that a Replica can be in:

                                    replicasByKey
                                       (oneof)
                                      /      \
                                    /          \
                                  /              \
                       ReplicaShim                ReplicaPlaceholder
                          (oneof)
                         /      \
                       /          \
                     /              \
                 Replica        nonResidentReplica
                 (oneof)
                /      \
              /          \
            /              \
Replica.initialized  !Replica.initialized

I'm worried about this ever-growing complexity because it makes the code harder to reason about. Can any of these states be collapsed into the same level or even better, combined? For instance, is there a reason to keep the non-initialized Replica state if we also have a non-resident Replica state? It doesn't need to hold up this PR, but I think it would benefit us to think about this.

@benesch you've worked with this code most closely over the past few months. I'm curious if you have any suggestions here or strong feelings about what should and shouldn't be done.


pkg/storage/store.go, line 2044 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

The problem here is the weirdness of the GetReplica contract, which returns an error if the replica isn't found, where the majority of callers actually just want to ignore that error and continue. Since RangeNotFoundError isn't considered a fatal error for most callers who want the mustGetReplica semantics, I've changed things around a bit here to just return a *Replica, which might be nil if not found. Don't know if this addresses your concern, but the point of this method is to avoid checking the error for RangeNotFoundError across 10 call sites.

Yeah, that's better.

@tbg
Copy link
Member

tbg commented Oct 26, 2018

I haven't given this a close review, but from the looks of it it's the type of change that could introduce a number of issues. @benesch could you work with @spencerkimball to vet this before it merges (once it's in there's no way to "turn it off")? I'm thinking a roachtest run and a nightly stress run. For the roachtest run, unfortunately things are already pretty busted right now but I hope that #31914 can improve the situation.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speedup here is impressive, though I echo @tschottdorf's concern that this could be destabilizing. Let's hold off on merging this until we've got the roachtest stability issues under control. I think that is close, though it is hard to put a precise time bound on what that stability will be achieved.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

Copy link
Member Author

@spencerkimball spencerkimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speedup here is now down to 9.7s after reducing allocations and getting rid of migrations which have already run, but were running nevertheless anew at every startup.

I'll work with Nikhil to do roachtest and nightly stress runs.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica_shim.go, line 30 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

So to visualize this, we now have the following states that a Replica can be in:

                                    replicasByKey
                                       (oneof)
                                      /      \
                                    /          \
                                  /              \
                       ReplicaShim                ReplicaPlaceholder
                          (oneof)
                         /      \
                       /          \
                     /              \
                 Replica        nonResidentReplica
                 (oneof)
                /      \
              /          \
            /              \
Replica.initialized  !Replica.initialized

I'm worried about this ever-growing complexity because it makes the code harder to reason about. Can any of these states be collapsed into the same level or even better, combined? For instance, is there a reason to keep the non-initialized Replica state if we also have a non-resident Replica state? It doesn't need to hold up this PR, but I think it would benefit us to think about this.

@benesch you've worked with this code most closely over the past few months. I'm curious if you have any suggestions here or strong feelings about what should and shouldn't be done.

Yes this graph is correct, except uninitialized replicas are not stored in replicasByKey; they're in the replicaShims map. The uninitialized replica map actually just provides a boolean for any given replica for whether the replica has been added yet to replicasByKey after being initialized. The reason is that once the replica's range descriptor has been set, we lose the context of whether the replica was previously uninitialized and hasn't yet been added to replicasBykey.


pkg/storage/store.go, line 2044 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, that's better.

Done.

@spencerkimball
Copy link
Member Author

@tschottdorf @petermattis: both nightly stress and roachtest seem to be fine with this change. I'm relying on @benesch to interpret the non-culpability around pre-existing roachtest issues: https://teamcity.cockroachdb.com/viewLog.html?buildId=988969&tab=queuedBuildOverviewTab

@spencerkimball
Copy link
Member Author

@petermattis, @tbg should this PR still be in limbo? It's not getting any younger.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tobi and I are concerned about the potential stability impact of this change, especially since we're still handling the fallout from 2.1 changes like the contention queue and async writes. Since we're not currently planning to push for higher scalability in this dimension for 2.2, I'd rather put this PR on hold even if that means rewriting it when we're ready to tackle the issue. When that day comes, I'd like to see a larger refactoring that unifies replica shims with initialized and uninitialized replicas and replica placeholders instead of bolting a new not-quite-replica type on top of the multiple existing ones.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

    storage: lazily load storage.Replica objects

    More than several tens of thousands of `Replica` objects can take
    substantial time at server startup to initialize. This change introduces
    a `ReplicaShim` object which contains a pointer to the `Replica` which is
    non-nil if resident. Otherwise, a set of other values including a ref to
    the appropriate zone config, `MVCCStats` and a handful of other properties
    necessary to accurately compute stats even when a replica is not-resident.

    Further improve performance on initializing `ReplicaShim` objects at store
    startup by parallelizing reading the MVCC stats and the range descriptors.

    Improve replica shim load performance by loading all MVCCStats values
    during a simple scan over all local range ID keys and adding stats
    values to a map. The stats map is consulted when loading the range
    descriptors to initialize each `ReplicaShim`.

    Release note (performance improvement): faster startup whena node contains
    many replicas.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jan 16, 2019
The comment below goes into more detail, but here's the TL;DR:

Problems:

1. right now answering "is this Replica object unused?" is impossible
2. right now replicaIDs change on existing replicas, which is
   very complex to reason about
3. right now it'll be difficult to split replicas into different
   states because that requires answering 1 during state transitions

Proposed:

1. use a clean API to refcount Replica usage, and
2. allow initiating teardown "basically whenever" without blocking
   (think merge trigger),
3. so that the replica clears out quickly,
4. which in turn solves 1.
5. Then we can solve 2 because we'll be able to
   replace the Replica object in the Store whenever
   the replicaID would previously change in-place
   (this will not be trivial, but I hope it can be done),
6. and we should also be able to do 3 (again, not trivial,
   but a lot harder right now).

I expect the replication code to benefit from 6) as the Raft instance on
a Replica would never change.

This PR is a toy experiment for 1. It certainly wouldn't survive contact
with the real code, but it's sufficient to discuss this project and
iterate on the provisional Guard interface.

----

GuardedReplica is the external interface through which callers interact with
a Replica. By acquiring references to the underlying Replica object while it
is being used, it allows safe removal of Replica objects and/or their under-
lying data. This is an important fundamental for five reasons:

Today, we use no such mechanism, though this is largely due to failing in the
past to establish one[1]. The status quo "works" by maintaining a
destroyStatus inside of Replica.mu, which is checked in a few places such as
before proposing a command or serving a read. Since these checks are only
"point in time", and nothing prevents the write to the status from occurring
just a moment after the check, there is a high cognitive overhead to
reasoning about the possible outcomes. In fact, in the case in which things
could go bad most spectacularly, namely removing a replica including data, we
hold essentially all of the locks available to us and rely on the
readOnlyCmdMu (which we would rather get rid off). This then is the first
reason for proposing this change: make the replica lifetime easier to reason
about and establish confidence that the Replica can't just disappear out from
under us.

The second motivator are ReplicaID transitions, which can be extremely
complicated. For example, a Replica may

1. start off as an uninitialized Replica with ReplicaID 12 (i.e. no data)
2. receive a preemptive snapshot, which confusingly results in an initialized
   Replica with ReplicaID 12 (though preemptive snapshots nominally should
   result in a preemptive replica -- one with ReplicaID zero).
3. update its ReplicaID to 18 (i.e. being added back to the Raft group).
4. get ReplicaGC'ed because
5. it blocks a preemptive snapshot, which now recreates it.

In my point of view, changing the ReplicaID for a live Replica is a bad idea
and incurs too much complexity. An architecture in which Replica objects have
a single ReplicaID throughout their lifetime is conceptually much simpler,
but it is also much more straightforward to maintain since it does away with
a whole class of concurrency that needs to be tamed in today's code, and which
may have performance repercussions. On the other hand, replicaID changes are
not frequent, and only need to be moderately fast.

The alternative is to instantiate a new incarnation of the Replica whenever
the ReplicaID changes. The difficult part about this is destroying the old
Replica; since Replica provides proper serialization, we mustn't have commands
in-flight in two instances for the same data (and generally we want to avoid
even having to think about concurrent use of old incarnations). This is
explored here. The above history would read something like this:

1. start off as an uninitialized Replica R with Replica ID 12 (no data)
2. preemptive snapshot is received: tear down R, instantiate R'
3. ReplicaID is updated: tear down R', instantiate R''
4. R'' is marked for ReplicaGC: replace with placeholder R''' in Store, tear
   down R'', wait for references to drain, remove the data, remove R'''.
5. instantiate R''' (no change from before).

A third upshot is almost visible in the above description. Once we can re-
instantiate cleanly on ReplicaID-based state changes, we might as well go
ahead and pull apart various types of Replica:

- preemptive snapshots (though we may replace those by learner replicas in
  the future[2])
- uninitialized Replicas (has replicaID, but no data)
- initialized Replicas (has replicaID and data)
- placeholders (have no replicaID and no data)

To simplify replicaGC and to remove the preemptive snapshot state even before
we stop using preemptive snapshots, we may allow placeholders to hold data
(but with no means to run client requests against it).

Once there, reducing the memory footprint for "idle" replicas by only having
a "shim" in memory (replacing the "full" version) until the next request
comes in becomes a natural proposition by introducing a new replica state
that upon access gets promoted to a full replica (which loads the full
in-memory state from disk), pickup up past attempts at this which failed due
to the technical debt present at the moment[3].

[1]: cockroachdb#8630
[2]: cockroachdb#34058
[3]: cockroachdb#31663

Release note: None
@tbg tbg added the X-noremind Bots won't notify about PRs with X-noremind label Jun 19, 2019
@bobvawter
Copy link
Contributor

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@dhartunian dhartunian removed the request for review from a team June 12, 2020 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants