Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: queue replicas for raft log truncation after write ops #7125

Merged

Conversation

petermattis
Copy link
Collaborator

@petermattis petermattis commented Jun 9, 2016

Fixes #6012.


This change is Reviewable

@petermattis
Copy link
Collaborator Author

@bdarnell I wanted to get feedback on this approach before I push it any further. In manual testing it seems to work as expected, keeping the raft log proactively truncated. Before merging I would want to add some specific testing of this functionality.

@BramGruneir
Copy link
Member

BramGruneir commented Jun 9, 2016

My only concern would be that under heavy load, we might be checking the raft log state too often. Specifically, calling getTruncatableIndexes(), which locks both the store and then the replica. Is there perhaps a way to only maybeadd after a specific number of writes, perhaps after RaftLogQueueStaleThreshold writes, instead of after each one?

@petermattis
Copy link
Collaborator Author

@BramGrunier getTruncatableIndexes only locks the store due to issues with TestGetTruncatableIndexes. Once you fix #7056 we'll only be locking the replica. And there is plenty of replica locking/unlocking already going on. I doubt this additional locking will be significant.

That said, only checking for whether the replica should be queued after every Nth write would be very easy to add.

@petermattis petermattis force-pushed the pmattis/truncate-log-on-write-cmd branch from 7b0b4e1 to 986b2c5 Compare June 9, 2016 21:38
@petermattis
Copy link
Collaborator Author

@BramGrunier I tweaked maybeAddToRaftLogQueue to only try adding the replica on every 20th write operation. Manual testing indicates this works well. I still need to add unit tests for this change.

@tamird
Copy link
Contributor

tamird commented Jun 9, 2016

Reviewed 1 of 6 files at r1, 5 of 5 files at r2.
Review status: all files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


storage/client_metrics_test.go, line 152 [r2] (raw file):

      t.Fatal(err)
  }
  mtc.stores[0].DisableRaftLogQueue(true)

does this want to be a loop? does it need a helpful comment?


storage/queue.go, line 262 [r2] (raw file):

func (bq *baseQueue) MaybeAdd(repl *Replica, now roachpb.Timestamp) {
  if bq.gossip == nil {
      // The queue is disabled due to being part of the bootstrap store.

can this move up the stack to avoid creating the queue?


storage/replica_test.go, line 866 [r2] (raw file):

  defer tc.Stop()
  tc.clock.SetMaxOffset(maxClockOffset)
  tc.store.DisableRaftLogQueue(true)

does this need a comment?


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 3 of 6 files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


storage/client_metrics_test.go, line 152 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

does this want to be a loop? does it need a helpful comment?

Switched to a loop and added a comment. Short summary is raft log truncation confuses this test.

storage/queue.go, line 262 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

can this move up the stack to avoid creating the queue?

I actually investigated that but ran into a number of headaches. I think doing so is the right long term approach. I'll file an issue as it is separate from this PR.

storage/replica_test.go, line 866 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

does this need a comment?

Yes. Similar to my other response, raft log truncation confuses this test.

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 3 of 6 files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


storage/replica.go, line 2354 [r3] (raw file):

  if appliedIndex%raftLogCheckFrequency == 0 {
      r.store.raftLogQueue.MaybeAdd(r, r.store.Clock().Now())
  }

I'm not seeing an easy way to test this. If I disable the raftLogQueue then this code doesn't have any effect. If I enable the raftLogQueue then there is a race between the scanner and this code and I'm not sure which one won. I could jump through hoops and stop the replica scanner loop. Thoughts?

Curious how we're testing the maybeAddToSplitQueue code. I couldn't find the test which exercises that code path, but it is very possible I'm just missing it.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/truncate-log-on-write-cmd branch from 4cba101 to 8c79418 Compare June 10, 2016 14:50
@tamird
Copy link
Contributor

tamird commented Jun 10, 2016

Reviewed 3 of 3 files at r3, 2 of 2 files at r4.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


storage/queue.go, line 262 [r2] (raw file):

Previously, petermattis (Peter Mattis) wrote…

I actually investigated that but ran into a number of headaches. I think doing so is the right long term approach. I'll file an issue as it is separate from this PR.

Cool. I think a TODO here with the issue number would be helpful.

Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 2 unresolved discussions, all commit checks successful.


storage/queue.go, line 262 [r2] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

Cool. I think a TODO here with the issue number would be helpful.

I just didn't push through the headaches enough. Looks like it was actually quite straightforward in the end: https://github.com//pull/7157

Comments from Reviewable

@bdarnell
Copy link
Contributor

LGTM modulo tests.

Previously, petermattis (Peter Mattis) wrote…

@BramGrunier I tweaked maybeAddToRaftLogQueue to only try adding the replica on every 20th write operation. Manual testing indicates this works well. I still need to add unit tests for this change.


Review status: all files reviewed at latest revision, 3 unresolved discussions, all commit checks successful.


storage/replica.go, line 2354 [r3] (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not seeing an easy way to test this. If I disable the raftLogQueue then this code doesn't have any effect. If I enable the raftLogQueue then there is a race between the scanner and this code and I'm not sure which one won. I could jump through hoops and stop the replica scanner loop. Thoughts?

Curious how we're testing the maybeAddToSplitQueue code. I couldn't find the test which exercises that code path, but it is very possible I'm just missing it.

We probably aren't testing `maybeAddToSplitQueue`. I think that bit of code predates our infrastructure for effectively testing splits.

Most tests set the ScanInterval to 10 minutes, which is enough to make this race unlikely. You could set the interval even higher if you want to make sure (or add a Disable method to the scanner and not just the individual queues).


storage/replica.go, line 2408 [r4] (raw file):

// truncation. If yes, the range is added to the raft log queue.
func (r *Replica) maybeAddToRaftLogQueue(appliedIndex uint64) {
  const raftLogCheckFrequency = RaftLogQueueStaleThreshold / 5

/ 4 (or >> 2) may be a little faster.

If RaftLogQueueStaleThreshold is too low, this rounds down to zero. We should make a note that if this becomes configurable, we need to either enforce that it falls in an appropriate range or that we round up to 1 here.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/truncate-log-on-write-cmd branch from 8c79418 to 0806632 Compare June 14, 2016 01:59
@tamird
Copy link
Contributor

tamird commented Jun 14, 2016

Reviewed 3 of 3 files at r5.
Review status: all files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/truncate-log-on-write-cmd branch from 0806632 to 5f8992b Compare June 14, 2016 12:46
@petermattis petermattis changed the title [DO NOT MERGE] storage: queue replicas for raft log truncation after write ops storage: queue replicas for raft log truncation after write ops Jun 14, 2016
@petermattis
Copy link
Collaborator Author

I added the ability to disable a replicaScanner and a new test which verifies that proactive raft log truncation is being performed. PTAL.

@petermattis petermattis force-pushed the pmattis/truncate-log-on-write-cmd branch 2 times, most recently from 1ce7952 to 53bf860 Compare June 14, 2016 12:53
@petermattis
Copy link
Collaborator Author

Review status: 3 of 6 files reviewed at latest revision, 2 unresolved discussions, some commit checks pending.


storage/replica.go, line 2354 [r3] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

We probably aren't testing maybeAddToSplitQueue. I think that bit of code predates our infrastructure for effectively testing splits.

Most tests set the ScanInterval to 10 minutes, which is enough to make this race unlikely. You could set the interval even higher if you want to make sure (or add a Disable method to the scanner and not just the individual queues).

I've added `replicaScanner.SetDisabled` so that processing of replicas can be disabled in tests.

storage/replica.go, line 2408 [r4] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

/ 4 (or >> 2) may be a little faster.

If RaftLogQueueStaleThreshold is too low, this rounds down to zero. We should make a note that if this becomes configurable, we need to either enforce that it falls in an appropriate range or that we round up to 1 here.

Can't imagine the speed matters here, but I've changed this to `/ 4`. I've added a comment about `RaftLogQueueStaleThreshold` becoming too low.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jun 14, 2016

:lgtm:


Reviewed 2 of 2 files at r6, 1 of 1 files at r7.
Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


storage/raft_log_queue_test.go, line 180 [r7] (raw file):

  r, err := store.GetReplica(1)
  if err != nil {
      t.Error(err)

should be Fatal


storage/raft_log_queue_test.go, line 187 [r7] (raw file):

  r.mu.Unlock()
  if err != nil {
      t.Error(err)

should be Fatal


storage/replica.go, line 2597 [r7] (raw file):

// truncation. If yes, the range is added to the raft log queue.
func (r *Replica) maybeAddToRaftLogQueue(appliedIndex uint64) {
  // If RaftLogQueueStaleThreshold is too small, raftLogCheckFrequency may be 0

any reason not to const raftLogCheckFrequency = RaftLogQueueStaleThreshold / 4 + 1?


storage/scanner.go, line 121 [r7] (raw file):

// SetDisabled turns replica scanning off or on as directed. Note that while
// disabled, removals are still processed
func (rs *replicaScanner) SetDisabled(disabled bool) {

this function should be in the file helpers_test.go


Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/truncate-log-on-write-cmd branch from 53bf860 to d390791 Compare June 14, 2016 16:15
@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 6 unresolved discussions, all commit checks successful.


storage/raft_log_queue_test.go, line 180 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

should be Fatal

Copy&paste from above. I've fixed both tests.

storage/raft_log_queue_test.go, line 187 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

should be Fatal

Done.

storage/replica.go, line 2597 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

any reason not to const raftLogCheckFrequency = RaftLogQueueStaleThreshold / 4 + 1?

Done.

storage/scanner.go, line 121 [r7] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this function should be in the file helpers_test.go

Done.

Comments from Reviewable

@tamird
Copy link
Contributor

tamird commented Jun 14, 2016

Reviewed 4 of 4 files at r8.
Review status: all files reviewed at latest revision, 4 unresolved discussions, some commit checks pending.


storage/helpers_test.go, line 122 [r8] (raw file):

// SetDisabled turns replica scanning off or on as directed. Note that while
// disabled, removals are still processed

missing period


storage/raft_log_queue_test.go, line 79 [r8] (raw file):

  store, _, stopper := createTestStore(t)
  defer stopper.Stop()
  if _, err := store.GetReplica(0); err == nil {

this is pretty weird - why is it even here?


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: all files reviewed at latest revision, 4 unresolved discussions, all commit checks successful.


storage/helpers_test.go, line 122 [r8] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

missing period

Done.

storage/raft_log_queue_test.go, line 79 [r8] (raw file):

Previously, tamird (Tamir Duberstein) wrote…

this is pretty weird - why is it even here?

@BramGrunier?

Comments from Reviewable

@petermattis petermattis force-pushed the pmattis/truncate-log-on-write-cmd branch from d390791 to a4b7a46 Compare June 14, 2016 17:02
@petermattis petermattis merged commit f10003e into cockroachdb:master Jun 14, 2016
@petermattis petermattis deleted the pmattis/truncate-log-on-write-cmd branch June 14, 2016 18:47
@bdarnell
Copy link
Contributor

:lgtm:


Review status: 6 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


storage/raft_log_queue_test.go, line 201 [r9] (raw file):

  // Wait for tasks to finish, in case the processLoop grabbed the event
  // before ForceRaftLogScanAndProcess but is still working on it.

Comment looks stale (or else it's not clear where ForceRaftLogScanAndProcess is getting called)


Comments from Reviewable

@petermattis
Copy link
Collaborator Author

Review status: 6 of 7 files reviewed at latest revision, 1 unresolved discussion, all commit checks successful.


storage/raft_log_queue_test.go, line 201 [r9] (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Comment looks stale (or else it's not clear where ForceRaftLogScanAndProcess is getting called)

Doh, already merged. Sending a follow-on PR to fix.

Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants