Skip to content

Commit

Permalink
storage: don't send log truncations through Raft
Browse files Browse the repository at this point in the history
Sending log truncations through Raft is inefficient: the Raft log is not itself
part of the replicated state. Instead, we only replicate the TruncatedState and,
as a side effect, ClearRange() the affected key range.

This is an individual performance optimization whose impact we should measure;
anecdotally it always looked like we were doing a lot of work for truncations
during a write-heavy workload; this should alleviate this somewhat). As
explained above, the change isn't made for performance at this point, though.

It also removes one migration concern for cockroachdb#16809, see
cockroachdb#16809 (comment).

We'll need to migrate this. It's straightforward with the in-flight PR cockroachdb#16977.

- we're moving logic downstream of Raft. However, we can easily migrate it
  upstream again, without a real migration, though I don't think that's going
  to happen.
- the big upshot is hopefully a large reduction in complexity for
  @irfansharif's PR: log truncation is one of the odd cases that requires
  a RaftWriteBatch. cockroachdb#16749 is the only other one, and there the (correct)
  solution also involves going downstream of Raft for a Raft-related write. So,
  after solving both of those, I think RaftWriteBatch can go? cc @irfansharif
- as @petermattis pointed out, after @irfansharif's change, we should be able
  to not sync the base engine on truncation changes but do it only as we
  actually clear the log entries (which can be delayed as we see fit). So for
  1000 log truncations across many ranges, we'll only have to sync once if
  that's how we set it up.
  • Loading branch information
tbg committed Jul 17, 2017
1 parent 81badfb commit 47d325e
Show file tree
Hide file tree
Showing 3 changed files with 76 additions and 18 deletions.
45 changes: 34 additions & 11 deletions pkg/storage/replica_command.go
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/storage/engine"
"github.com/cockroachdb/cockroach/pkg/storage/engine/enginepb"
"github.com/cockroachdb/cockroach/pkg/storage/storagebase"
"github.com/cockroachdb/cockroach/pkg/util"
"github.com/cockroachdb/cockroach/pkg/util/hlc"
"github.com/cockroachdb/cockroach/pkg/util/log"
"github.com/cockroachdb/cockroach/pkg/util/protoutil"
Expand Down Expand Up @@ -1884,16 +1885,38 @@ func evalTruncateLog(
if err != nil {
return EvalResult{}, err
}
start := keys.RaftLogKey(cArgs.EvalCtx.RangeID(), 0)
end := keys.RaftLogKey(cArgs.EvalCtx.RangeID(), args.Index)
var diff enginepb.MVCCStats
// Passing zero timestamp to MVCCDeleteRange is equivalent to a ranged clear
// but it also computes stats. Note that any sideloaded payloads that may be
// removed by this truncation don't matter; they're not tracked in the raft
// log delta.
if _, _, _, err := engine.MVCCDeleteRange(ctx, batch, &diff, start, end, math.MaxInt64, /* max */
hlc.Timestamp{}, nil /* txn */, false /* returnKeys */); err != nil {
return EvalResult{}, err

// We start at index zero because it's always possible that a previous
// truncation did not clean up entries made obsolete by the previous
// truncation.
start := engine.MakeMVCCMetadataKey(keys.RaftLogKey(cArgs.EvalCtx.RangeID(), 0))
end := engine.MakeMVCCMetadataKey(keys.RaftLogKey(cArgs.EvalCtx.RangeID(), args.Index))

var ms enginepb.MVCCStats
if util.IsMigrated() {
// Compute the stats delta that were to occur should the log entries be
// purged. We do this as a side effect of seeing a new TruncatedState,
// downstream of Raft. A follower may not run the side effect in the event
// of an ill-timed crash, but that's OK since the next truncation will get
// everything.
//
// Note that any sideloaded payloads that may be removed by this truncation
// don't matter; they're not tracked in the raft log delta.
iter := batch.NewIterator(false /* !prefix */)
defer iter.Close()
// We can pass zero as nowNanos because we're only interested in SysBytes.
var err error
ms, err = iter.ComputeStats(start, end, 0 /* nowNanos */)
if err != nil {
return EvalResult{}, errors.Wrap(err, "while computing stats of Raft log freed by truncation")
}
ms.SysBytes = -ms.SysBytes // simulate the deletion

} else {
if _, _, _, err := engine.MVCCDeleteRange(ctx, batch, &ms, start.Key, end.Key, math.MaxInt64, /* max */
hlc.Timestamp{}, nil /* txn */, false /* returnKeys */); err != nil {
return EvalResult{}, err
}
}

tState := &roachpb.RaftTruncatedState{
Expand All @@ -1903,7 +1926,7 @@ func evalTruncateLog(

var pd EvalResult
pd.Replicated.State.TruncatedState = tState
pd.Replicated.RaftLogDelta = &diff.SysBytes
pd.Replicated.RaftLogDelta = &ms.SysBytes

return pd, makeReplicaStateLoader(cArgs.EvalCtx.RangeID()).setTruncatedState(ctx, batch, cArgs.Stats, tState)
}
Expand Down
47 changes: 42 additions & 5 deletions pkg/storage/replica_proposal.go
Original file line number Diff line number Diff line change
Expand Up @@ -737,17 +737,54 @@ func (r *Replica) handleReplicatedEvalResult(

if newTruncState := rResult.State.TruncatedState; newTruncState != nil {
rResult.State.TruncatedState = nil // for assertion

r.mu.Lock()
r.mu.state.TruncatedState = newTruncState
r.mu.Unlock()

// TODO(tschottdorf): everything below doesn't need to be on this
// goroutine. Worth moving out -- truncations are frequent and missing
// one of the side effects below doesn't matter. Need to be careful
// about the interaction with `evalTruncateLog` though, which computes
// some stats based on the log entries it sees. Also, sideloaded storage
// needs to hold the raft mu. Perhaps it should just get its own mutex
// (which is usually held together with raftMu, except when accessing
// the storage for a truncation). Or, even better, make use of the fact
// that all we need to synchronize is disk i/o, and there is no overlap
// between files *removed* during truncation and those active in Raft.

if util.IsMigrated() {
// Truncate the Raft log.
start := engine.MakeMVCCMetadataKey(keys.RaftLogKey(r.RangeID, 0))
end := engine.MakeMVCCMetadataKey(
keys.RaftLogKey(r.RangeID, newTruncState.Index).PrefixEnd(),
)
iter := r.store.engine.NewIterator(false /* !prefix */)
// Clear the log entries. Intentionally don't use range deletion
// tombstones (ClearRange()) due to performance concerns connected
// to having many range deletion tombstones. There is a chance that
// ClearRange will perform well here because the tombstones could be
// "collapsed", but it is hardly worth the risk at this point.
err := r.store.engine.ClearIterRange(iter, start, end)
iter.Close()

if err != nil {
log.Errorf(ctx, "unable to clear truncated Raft entries for %+v: %s", newTruncState, err)
}
}

// Clear any entries in the Raft log entry cache for this range up
// to and including the most recently truncated index.
r.store.raftEntryCache.clearTo(r.RangeID, newTruncState.Index+1)
log.Eventf(ctx, "truncating sideloaded storage up to (and including) index %d", newTruncState.Index)
if err := r.raftMu.sideloaded.TruncateTo(ctx, newTruncState.Index+1); err != nil {
// We don't *have* to remove these entries for correctness. Log a
// loud error, but keep humming along.
log.Errorf(ctx, "while removing sideloaded files during log truncation: %s", err)

// Truncate the sideloaded storage.
{
log.Eventf(ctx, "truncating sideloaded storage up to (and including) index %d", newTruncState.Index)
if err := r.raftMu.sideloaded.TruncateTo(ctx, newTruncState.Index+1); err != nil {
// We don't *have* to remove these entries for correctness. Log a
// loud error, but keep humming along.
log.Errorf(ctx, "while removing sideloaded files during log truncation: %s", err)
}
}
}

Expand Down
2 changes: 0 additions & 2 deletions pkg/util/migration_stub.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,6 @@ package util

import "github.com/cockroachdb/cockroach/pkg/util/envutil"

var _ = IsMigrated()

// IsMigrated gives a crude way of landing changes that need a migration until
// #16977 is implemented. When IsMigrated() returns false (the default),
// mixed-cluster compatibility with 1.0 is required.
Expand Down

0 comments on commit 47d325e

Please sign in to comment.