Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: make RaftTruncatedState unreplicated #34660

Merged
merged 5 commits into from
Feb 11, 2019
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Feb 6, 2019

This isn't 100% polished yet, but generally ready for review.


See #34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that is,
log entries which are already reflected in the state of the snapshot.
Fundamentally, this is because we have historically used a replicated
TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term). If the TruncatedState cannot diverge
across replicas, we must send the whole log in snapshots, as the first
log index must match what the TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations are
driven by a queue and use a complex decision process. That decision process
can be faulty and even if it isn't, the queue could be held up. Besides,
even when the Raft log contains only very few entries, these entries may be
quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft log
as part of snapshots, and in turn we need the TruncatedState to be
unreplicated.

This change migrates the TruncatedState into unreplicated keyspace. It does
not yet allow snapshots to avoid sending the past Raft log, but that is a
relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

  1. at any log position, the replicas of a Range either use the new
    (unreplicated) key or the old one, and exactly one of them exists.

  2. When a log truncation evaluates under the new cluster version, it
    initiates the migration by deleting the old key. Under the old cluster
    version, it behaves like today, updating the replicated truncated state.

  3. The deletion signals new code downstream of Raft and triggers a write to
    the new, unreplicated, key (atomic with the deletion of the old key).

  4. Future log truncations don't write any replicated data any more, but
    (like before) send along the TruncatedState which is written downstream of
    Raft atomically with the deletion of the log entries. This actually uses
    the same code as 3. What's new is that the truncated state needs to be
    verified before replacing a previous one. If replicas disagree about their
    truncated state, it's possible for replica X at FirstIndex=100 to apply a
    truncated state update that sets FirstIndex to, say, 50 (proposed by a
    replica with a "longer" historical log). In that case, the truncated state
    update must be ignored (this is straightforward downstream-of-Raft code).

  5. When a split trigger evaluates, it seeds the RHS with the legacy key iff
    the LHS uses the legacy key, and the unreplicated key otherwise. This makes
    sure that the invariant that all replicas agree on the state of the
    migration is upheld.

  6. When a snapshot is applied, the receiver is told whether the snapshot
    contains a legacy key. If not, it writes the truncated state (which is part
    of the snapshot metadata) in its unreplicated version. Otherwise it doesn't
    have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState and
OldState don't imply any particular ordering of operations.

┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
   │                        Bootstrap under old version
   │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │                                     │     Bootstrap under new version
   │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │─ ─ ┐
   │    | Log truncation under old version
   │< ─ ┘
   │                                     │
   │─ ─ ┐                                │
   │    | Snapshot                       │
   │< ─ ┘                                │
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Snapshot
   │                                     │< ─ ┘
   │                                     │
   │   Log truncation under new version  │
   │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under new version
   │                                     │< ─ ┘
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under old version
   │                                     │< ─ ┘ (necessarily running new binary)

Release note: None

@tbg tbg requested a review from a team as a code owner February 6, 2019 13:04
@tbg tbg requested review from a team February 6, 2019 13:04
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg requested a review from bdarnell February 6, 2019 13:04
Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good aside from the unfinished test.

Reviewed 11 of 11 files at r1, 1 of 1 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, 29 of 30 files at r5.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)


pkg/server/version_cluster_test.go, line 207 at r5 (raw file):

	oldVersionS := oldVersion.String()
	newVersionS := cluster.VersionByKey(cluster.VersionUnreplicatedRaftTruncatedState).String()
	_ = newVersionS

Obsolete?


pkg/server/version_cluster_test.go, line 215 at r5 (raw file):

		{oldVersionS, newVersionS},
		{oldVersionS, newVersionS},

Remove this blank line. This array doesn't seem to match the comment (unless I'm confused about how this test harness works).


pkg/server/version_cluster_test.go, line 241 at r5 (raw file):

	scatter := func() {
		return // HACK

Don't forget this. I'll stop reviewing this test here since it seems unfinished.


pkg/settings/cluster/cockroach_versions.go, line 74 at r5 (raw file):

	VersionLazyTxnRecord
	VersionSequencedReads
	// VersionUnreplicatedRaftTruncatedState, when active, moves the truncated

I think this list should just be the list of constant declarations; it's not the place for comments about migrations. I'd probably put this in the storage package on handleTruncatedStateBelowRaft.


pkg/storage/replica_raftstorage.go, line 498 at r5 (raw file):

	//
	// See the comment on VersionUnreplicatedRaftTruncatedState for details.
	UsesLegacyTruncatedState bool

It's confusing that you use "legacy" for this bool but "unreplicated" for the bool in the proto. That creates a link in my mind that unreplicated is the legacy mode when it's the reverse. Since we need default=false for the proto, I'd stick with "UsesUnreplicatedTruncatedState" here.


pkg/storage/batcheval/cmd_truncate_log.go, line 88 at r5 (raw file):

	//
	// TODO(tbg): think about synthesizing a valid term. Can we use the next
	// existing entry's term?

The term and index must match; raft won't be happy if you use index i but the term of index i+1. You might be able to get clever (I think the first entry after a term change is always empty, so if i+1 is non-empty it's safe to project its term backwards one entry), but that's probably not a great idea. I'm fine for now with just skipping truncations in this case unless we make further changes to truncation that make it more common.

Or if the leader has already truncated to index i, maybe we just truncate the whole cluster at that point (regardless of the index in the request)? I haven't paged in where this index is coming from; why would the leader be processing a truncation that is behind its own index?


pkg/storage/batcheval/cmd_truncate_log.go, line 108 at r5 (raw file):

	// still using the legacy truncated state key). On followers, this can in
	// principle be off either way, though in practice we expect followers to
	// have shorter logs than the leaseholder (see #34287).

s/shorter logs/logs that are no longer than/

The common case remains that everything is equal.


pkg/storage/stateloader/initial.go, line 58 at r5 (raw file):

	txnSpanGCThreshold hlc.Timestamp,
	activeVersion roachpb.Version,
	useLegacyTruncatedState bool, // see VersionUnreplicatedRaftTruncatedState

Same comment as before regarding legacy vs unreplicated.

Maybe make an enum instead of a bool so you don't have to comment at all the call sites?


pkg/storage/stateloader/stateloader.go, line 132 at r5 (raw file):

	eng engine.ReadWriter,
	state storagepb.ReplicaState,
	useLegacyTruncatedState bool, // see VersionUnreplicatedRaftTruncatedState

Same comment as in initial.go.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, PTAL!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)


pkg/server/version_cluster_test.go, line 207 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Obsolete?

Done.


pkg/server/version_cluster_test.go, line 215 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Remove this blank line. This array doesn't seem to match the comment (unless I'm confused about how this test harness works).

Done. The comments were out of sync with what the test is now doing.


pkg/server/version_cluster_test.go, line 241 at r5 (raw file):
Done. Amazingly this didn't turn up a new failure mode (that test helped me find a bunch of other errors before PR though).

1082 runs so far, 0 failures, over 10m40s


pkg/settings/cluster/cockroach_versions.go, line 74 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I think this list should just be the list of constant declarations; it's not the place for comments about migrations. I'd probably put this in the storage package on handleTruncatedStateBelowRaft.

Are you sure that's better? I personally would love a high level explanation with all of these migrations, especially since this one is linked to from so many places.


pkg/storage/replica_raftstorage.go, line 498 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It's confusing that you use "legacy" for this bool but "unreplicated" for the bool in the proto. That creates a link in my mind that unreplicated is the legacy mode when it's the reverse. Since we need default=false for the proto, I'd stick with "UsesUnreplicatedTruncatedState" here.

Done.


pkg/storage/batcheval/cmd_truncate_log.go, line 108 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

s/shorter logs/logs that are no longer than/

The common case remains that everything is equal.

Done.


pkg/storage/stateloader/initial.go, line 58 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Same comment as before regarding legacy vs unreplicated.

Maybe make an enum instead of a bool so you don't have to comment at all the call sites?

Good call on the enum, done.


pkg/storage/stateloader/stateloader.go, line 132 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Same comment as in initial.go.

Done.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)


pkg/storage/batcheval/cmd_truncate_log.go, line 88 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

The term and index must match; raft won't be happy if you use index i but the term of index i+1. You might be able to get clever (I think the first entry after a term change is always empty, so if i+1 is non-empty it's safe to project its term backwards one entry), but that's probably not a great idea. I'm fine for now with just skipping truncations in this case unless we make further changes to truncation that make it more common.

Or if the leader has already truncated to index i, maybe we just truncate the whole cluster at that point (regardless of the index in the request)? I haven't paged in where this index is coming from; why would the leader be processing a truncation that is behind its own index?

The index is coming from the raft log queue, i.e. almost always from the leaseholder's own Raft instance. I agree that this is mostly a non-issue, we're not planning any changes that make this a common occurrence (or an occurrence at all, really). Typically with what we're planning followers after having received a snapshot will have a shorter log than the leaseholder, but this check doesn't care.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 30 of 30 files at r10.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell and @tbg)


pkg/settings/cluster/cockroach_versions.go, line 74 at r5 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Are you sure that's better? I personally would love a high level explanation with all of these migrations, especially since this one is linked to from so many places.

This list is where I look for "which release included VersionBitArrayColumns", so I like having it compact enough that I can see that the constant is between Version2_0 and Version2_1. I agree it might be nice to use something in this package for comments like this. Maybe put it on the entry in versionsSingleton? (which already has a one-line comment with a link for each migration).


pkg/storage/stateloader/stateloader.go, line 123 at r10 (raw file):

const (
	// UseReplicatedTruncatedState means use the legacy (replicated) key.
	UseReplicatedTruncatedState TruncatedStateType = iota

Nit: TruncatedStateReplicated and TruncatedStateUnreplicated? Maybe even TruncatedStateLegacyReplicated to make it clear which one is older.

tbg added 5 commits February 11, 2019 21:36
No functional changes, just preparing to introduce the shiny new
unreplicated raft truncated state.

Release note: None
When replicas can have divergent truncated states, we want to carry
out truncations even if they seem pointless to the leaseholder, since
the leaseholder might have a shorter log than other replicas.

Release note: None
The encoded range-local keys were mostly incorrect in that they were
missing the replicated/unreplicated infix. Rather than trying to keep
this comment up to date, readers should be directed to TestPrettyPrint
which now conveniently logs all types of keys and their encoding.

Release note: None
See cockroachdb#34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that
is, log entries which are already reflected in the state of the
snapshot. Fundamentally, this is because we have historically used
a replicated TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term).
If the TruncatedState cannot diverge across replicas, we *must* send the
whole log in snapshots, as the first log index must match what the
TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations
are driven by a queue and use a complex decision process. That decision
process can be faulty and even if it isn't, the queue could be held up.
Besides, even when the Raft log contains only very few entries, these
entries may be quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft
log as part of snapshots, and in turn we need the TruncatedState to
be unreplicated.

This change migrates the TruncatedState into unreplicated keyspace.
It does not yet allow snapshots to avoid sending the past Raft log,
but that is a relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version,
it initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write
to the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream
of Raft atomically with the deletion of the log entries. This actually
uses the same code as 3.
What's new is that the truncated state needs to be verified before
replacing a previous one. If replicas disagree about their truncated
state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated
state update must be ignored (this is straightforward downstream-of-Raft
code).

5. When a split trigger evaluates, it seeds the RHS with the legacy
key iff the LHS uses the legacy key, and the unreplicated key otherwise.
This makes sure that the invariant that all replicas agree on the
state of the migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is
part of the snapshot metadata) in its unreplicated version. Otherwise
it doesn't have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState
and OldState don't imply any particular ordering of operations.

```
┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
    │                        Bootstrap under old version
    │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │                                     │     Bootstrap under new version
    │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
    │                                     │
    │─ ─ ┐
    │    | Log truncation under old version
    │< ─ ┘
    │                                     │
    │─ ─ ┐                                │
    │    | Snapshot                       │
    │< ─ ┘                                │
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Snapshot
    │                                     │< ─ ┘
    │                                     │
    │   Log truncation under new version  │
    │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under new version
    │                                     │< ─ ┘
    │                                     │
    │                                     │─ ─ ┐
    │                                     │    | Log truncation under old version
    │                                     │< ─ ┘ (necessarily running new binary)
```

Source: http://www.plantuml.com/plantuml/uml/ and the following input:

@startuml
scale 600 width

OldState <--] : Bootstrap under old version
NewState <--] : Bootstrap under new version
OldState --> OldState : Log truncation under old version
OldState --> OldState : Snapshot
NewState --> NewState : Snapshot
OldState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under new version
NewState --> NewState : Log truncation under old version\n(necessarily running new binary)
@enduml

Release note: None
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, comments addressed!

Thanks to megacheck/staticcheck I found a bug in the test -- it wasn't actually verifying that the old truncated state disappeared up but simply exited with a nil error. In fact the old truncated state didn't disappear reliably because that needed log truncations. Changed the test to force a log truncation instead. This has surived 5 minutes under stress without a failure, so I'm going to go ahead and merge.

bors r=bdarnell

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)


pkg/settings/cluster/cockroach_versions.go, line 74 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This list is where I look for "which release included VersionBitArrayColumns", so I like having it compact enough that I can see that the constant is between Version2_0 and Version2_1. I agree it might be nice to use something in this package for comments like this. Maybe put it on the entry in versionsSingleton? (which already has a one-line comment with a link for each migration).

Good idea, done.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r-

Branch shuffling wasn't done yet.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

@craig
Copy link
Contributor

craig bot commented Feb 11, 2019

Canceled

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r=bdarnell

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @bdarnell)

craig bot pushed a commit that referenced this pull request Feb 11, 2019
34296: storage: improve message on slow Raft proposal r=petermattis a=tbg

Touches #33007.

Release note: None

34589: importccl: fix flaky test TestImportCSVStmt r=rytaft a=rytaft

`TestImportCSVStmt` tests that `IMPORT` jobs appear in a certain order
in the `system.jobs` table. Automatic statistics were causing this
test to be flaky since `CreateStats` jobs were present in the jobs
table as well, in an unpredictable order. This commit fixes the problem
by only selecting `IMPORT` jobs from the jobs table.

Fixes #34568

Release note: None

34660: storage: make RaftTruncatedState unreplicated r=bdarnell a=tbg

This isn't 100% polished yet, but generally ready for review.

----

See #34287.

Today, Raft (or preemptive) snapshots include the past Raft log, that is,
log entries which are already reflected in the state of the snapshot.
Fundamentally, this is because we have historically used a replicated
TruncatedState.

TruncatedState essentially tells us what the first index in the log is
(though it also includes a Term). If the TruncatedState cannot diverge
across replicas, we *must* send the whole log in snapshots, as the first
log index must match what the TruncatedState claims it is.

The Raft log is typically, but not necessarily small. Log truncations are
driven by a queue and use a complex decision process. That decision process
can be faulty and even if it isn't, the queue could be held up. Besides,
even when the Raft log contains only very few entries, these entries may be
quite large (see SSTable ingestion during RESTORE).

All this motivates that we don't want to (be forced to) send the Raft log
as part of snapshots, and in turn we need the TruncatedState to be
unreplicated.

This change migrates the TruncatedState into unreplicated keyspace. It does
not yet allow snapshots to avoid sending the past Raft log, but that is a
relatively straightforward follow-up change.

VersionUnreplicatedRaftTruncatedState, when active, moves the truncated
state into unreplicated keyspace on log truncations.

The migration works as follows:

1. at any log position, the replicas of a Range either use the new
(unreplicated) key or the old one, and exactly one of them exists.

2. When a log truncation evaluates under the new cluster version, it
initiates the migration by deleting the old key. Under the old cluster
version, it behaves like today, updating the replicated truncated state.

3. The deletion signals new code downstream of Raft and triggers a write to
the new, unreplicated, key (atomic with the deletion of the old key).

4. Future log truncations don't write any replicated data any more, but
(like before) send along the TruncatedState which is written downstream of
Raft atomically with the deletion of the log entries. This actually uses
the same code as 3. What's new is that the truncated state needs to be
verified before replacing a previous one. If replicas disagree about their
truncated state, it's possible for replica X at FirstIndex=100 to apply a
truncated state update that sets FirstIndex to, say, 50 (proposed by a
replica with a "longer" historical log). In that case, the truncated state
update must be ignored (this is straightforward downstream-of-Raft code).

5. When a split trigger evaluates, it seeds the RHS with the legacy key iff
the LHS uses the legacy key, and the unreplicated key otherwise. This makes
sure that the invariant that all replicas agree on the state of the
migration is upheld.

6. When a snapshot is applied, the receiver is told whether the snapshot
contains a legacy key. If not, it writes the truncated state (which is part
of the snapshot metadata) in its unreplicated version. Otherwise it doesn't
have to do anything (the range will migrate later).

The following diagram visualizes the above. Note that it abuses sequence
diagrams to get a nice layout; the vertical lines belonging to NewState and
OldState don't imply any particular ordering of operations.

```
┌────────┐                            ┌────────┐
│OldState│                            │NewState│
└───┬────┘                            └───┬────┘
   │                        Bootstrap under old version
   │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │                                     │     Bootstrap under new version
   │                                     │ <─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
─ ─
   │                                     │
   │─ ─ ┐
   │    | Log truncation under old version
   │< ─ ┘
   │                                     │
   │─ ─ ┐                                │
   │    | Snapshot                       │
   │< ─ ┘                                │
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Snapshot
   │                                     │< ─ ┘
   │                                     │
   │   Log truncation under new version  │
   │ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─>│
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under new version
   │                                     │< ─ ┘
   │                                     │
   │                                     │─ ─ ┐
   │                                     │    | Log truncation under old version
   │                                     │< ─ ┘ (necessarily running new binary)
```

Release note: None

34762: distsqlplan: fix error in union planning r=jordanlewis a=jordanlewis

Previously, if 2 inputs to a UNION ALL had identical post processing
except for renders, further post processing on top of that union all
could invalidate the plan and cause errors or crashes.

Fixes #34437.

Release note (bug fix): fix a planning crash during UNION ALL operations
that had projections, filters or renders directly on top of the UNION
ALL in some cases.

34767: sql: reuse already allocated memory for the cache in a row container r=yuzefovich a=yuzefovich

Previously, we would always allocate new memory for every row that
we put in the cache of DiskBackedIndexedRowContainer and simply
discard the memory underlying the row that we remove from the cache.
Now, we're reusing that memory.

Release note: None

34779: opt: add stats to tpch xform test r=justinj a=justinj

Since we have stats by default now, this should be the default testing
mechanism. I left in tpch-no-stats since we also have that for tpcc,
just for safety.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
Co-authored-by: Rebecca Taft <[email protected]>
Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Justin Jaffray <[email protected]>
@craig
Copy link
Contributor

craig bot commented Feb 11, 2019

Build succeeded

@craig craig bot merged commit d0aa09e into cockroachdb:master Feb 11, 2019
tbg added a commit to tbg/cockroach that referenced this pull request Mar 13, 2019
Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a follower.
Say the leaseholder's log ranges from 100 to 200, and we assume that the size
(in bytes) of this log is 200mb. All of the log is successfully
committed and applied, and is thus reflected in the snapshot data.

Prior to this change, we would still send the 200mb of log entries along
with the snapshot, even though the snapshot itself already reflected
them.

After this change, we won't send any log entries along with the
snapshot, as sanity would suggest we would.

We were unable to make this change because up until recently, the Raft
truncated state (which dictates the first log index) was replicated and
consistency checked; this was changed in cockroachdb#34660. The migration
introduced there makes it straightforward to omit a prefix of the
log in snapshots, as done in this commit.

Somewhere down the road (19.2?) we should localize all the log
truncation decisions and simplify all this further. I suspect
that in doing so we can avoid tracking the size of the Raft log
in the first place; all we really need for this is some mechanism
that makes sure that an "idle" replica truncates its logs. With
unreplicated truncation, this becomes cheap enough to "just do".

Release note (bug fix): Remove historical log entries from Raft snapshots.
These log entries could lead to failed snapshots with a message such as:

    snapshot failed: aborting snapshot because raft log is too large
    (25911051 bytes after processing 7 of 37 entries)
tbg added a commit to tbg/cockroach that referenced this pull request Mar 14, 2019
Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a follower.
Say the leaseholder's log ranges from 100 to 200, and we assume that the size
(in bytes) of this log is 200mb. All of the log is successfully
committed and applied, and is thus reflected in the snapshot data.

Prior to this change, we would still send the 200mb of log entries along
with the snapshot, even though the snapshot itself already reflected
them.

After this change, we won't send any log entries along with the
snapshot, as sanity would suggest we would.

We were unable to make this change because up until recently, the Raft
truncated state (which dictates the first log index) was replicated and
consistency checked; this was changed in cockroachdb#34660. The migration
introduced there makes it straightforward to omit a prefix of the
log in snapshots, as done in this commit.

Somewhere down the road (19.2?) we should localize all the log
truncation decisions and simplify all this further. I suspect
that in doing so we can avoid tracking the size of the Raft log
in the first place; all we really need for this is some mechanism
that makes sure that an "idle" replica truncates its logs. With
unreplicated truncation, this becomes cheap enough to "just do".

Release note (bug fix): Remove historical log entries from Raft snapshots.
These log entries could lead to failed snapshots with a message such as:

    snapshot failed: aborting snapshot because raft log is too large
    (25911051 bytes after processing 7 of 37 entries)
tbg added a commit to tbg/cockroach that referenced this pull request Mar 18, 2019
Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a follower.
Say the leaseholder's log ranges from 100 to 200, and we assume that the size
(in bytes) of this log is 200mb. All of the log is successfully
committed and applied, and is thus reflected in the snapshot data.

Prior to this change, we would still send the 200mb of log entries along
with the snapshot, even though the snapshot itself already reflected
them.

After this change, we won't send any log entries along with the
snapshot, as sanity would suggest we would.

We were unable to make this change because up until recently, the Raft
truncated state (which dictates the first log index) was replicated and
consistency checked; this was changed in cockroachdb#34660. The migration
introduced there makes it straightforward to omit a prefix of the
log in snapshots, as done in this commit.

Somewhere down the road (19.2?) we should localize all the log
truncation decisions and simplify all this further. I suspect
that in doing so we can avoid tracking the size of the Raft log
in the first place; all we really need for this is some mechanism
that makes sure that an "idle" replica truncates its logs. With
unreplicated truncation, this becomes cheap enough to "just do".

Release note (bug fix): Remove historical log entries from Raft snapshots.
These log entries could lead to failed snapshots with a message such as:

    snapshot failed: aborting snapshot because raft log is too large
    (25911051 bytes after processing 7 of 37 entries)
craig bot pushed a commit that referenced this pull request Mar 18, 2019
35701: storage: don't send historical Raft log with snapshots r=andreimatei,bdarnell a=tbg

Assume a leaseholder wants to send a (Raft or preemptive) snapshot to a
follower. Say the leaseholder's log ranges from 100 to 200, and we assume
that the size
(in bytes) of this log is 200mb. All of the log is successfully committed
and applied, and is thus reflected in the snapshot data.

Prior to this change, we would still send the 200mb of log entries along
with the snapshot, even though the snapshot itself already reflected them.

After this change, we won't send any log entries along with the snapshot,
as sanity would suggest we would.

We were unable to make this change because up until recently, the Raft
truncated state (which dictates the first log index) was replicated and
consistency checked; this was changed in #34660.

Somewhere down the road (19.2?) we should localize all the log truncation
decisions and simplify all this further. I suspect that in doing so we can
avoid tracking the size of the Raft log in the first place; all we really
need for this is some mechanism that makes sure that an "idle" replica
truncates its logs. With unreplicated truncation, this becomes cheap enough
to "just do".

Release note (bug fix): Remove historical log entries from Raft snapshots.
These log entries could lead to failed snapshots with a message such as:

    snapshot failed: aborting snapshot because raft log is too large
    (25911051 bytes after processing 7 of 37 entries)

Co-authored-by: Tobias Schottdorf <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants