Improve how M3DB handles data durability during topology changes #1183

richardartoul · 2018-11-16T16:58:33Z

Previously, M3DB implemented durability during topology changes by forcing the peers bootstrapper to write out a snapshot file for any mutable data that was streamed in from a peer. This ensured that if the node crashed after the topology change, the commitlog bootstrapper could restore all the data the node was expected to have.

We're currently trying to move M3DB to a model where every set of snapshot files indicates that all the data that had been received by the node up until the beginning of the snapshot could be recovered from a combination of the data files, snapshot files, and commitlog files. The existing implementation clashes with that idea because it was writing out individual snapshot files that were not tied to any large snapshot process.

As a result, this P.R modifies M3DB to eschew writing out snapshot files for mutable data in the peers bootstrapper, and instead it prevents the clustered database from marking shards as available until the last successful snapshot began AFTER the last bootstrap completed and all bootstrapping for the topology change has completed. This is a much simpler model that is robust against future changes to the database.

codecov · 2018-11-16T17:04:59Z

Codecov Report

Merging #1183 into master will increase coverage by 0.2%.
The diff coverage is 72%.

@@           Coverage Diff            @@
##           master   #1183     +/-   ##
========================================
+ Coverage    70.8%   71.1%   +0.2%     
========================================
  Files         748     737     -11     
  Lines       63055   61944   -1111     
========================================
- Hits        44653   44043    -610     
+ Misses      15527   15042    -485     
+ Partials     2875    2859     -16

Flag	Coverage Δ
#aggregator	`81.6% <ø> (-0.1%)`	⬇️
#cluster	`85.6% <ø> (-0.1%)`	⬇️
#collector	`78.1% <ø> (-0.3%)`	⬇️
#dbnode	`80.7% <72%> (-0.2%)`	⬇️
#m3em	`73.2% <ø> (ø)`	⬆️
#m3ninx	`75.4% <ø> (+0.2%)`	⬆️
#m3nsch	`51.1% <ø> (ø)`	⬆️
#metrics	`18.3% <ø> (+0.1%)`	⬆️
#msg	`74.9% <ø> (-0.1%)`	⬇️
#query	`61.7% <ø> (+1.4%)`	⬆️
#x	`74.4% <ø> (-1.7%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4386243...8cfaefb. Read the comment docs.

…led everywhere

scripts/development/m3_stack/start_m3.sh

scripts/docker-integration-tests/prometheus/test.sh

scripts/docker-integration-tests/simple/test.sh

robskillington · 2018-11-20T19:02:06Z

src/dbnode/integration/cluster_add_one_node_test.go

+	// is going on are both handled correctly. In addition, this will ensure that we hold onto both
+	// sets of data durably after topology changes and that the node can be properly bootstrapped
+	// from just the filesystem and commitlog in a later portion of the test.
+	seriesToWriteDuringPeerStreaming := []string{


Should we just add this as a separate test? Seems like we're repurposing this one and changing it quite substantially?

There are two separate tests, they just call into this shared codepath. I think its fine, if I separated them out completely they'd be almost complete copy-pasta of each other, and the existing test benefits from this additional check as well (even if you don't verify the commitlog behavior, if you're doing a node add you probably want to make sure the node adding keeps track of the data it receives from its peer as well as all the data its receiving while actually joining)

I see, yeah makes sense.

robskillington · 2018-11-20T19:02:43Z

src/dbnode/integration/cluster_add_one_node_test.go

+			// We expect consistency errors because we're only running with
+			// R.F = 2 and one node is leaving and one node is joining for
+			// each of the shards that is changing hands.
+			if !client.IsConsistencyResultError(err) {


Hm maybe it's better to make it just RF=3?

I honestly just didn't do it because it would probably take a few hours to re-write all the sharding logic and fix any little issues that crop up and it doesn't really make the test any better. Can change it if you feel strongly

Np, that's fine.

src/dbnode/storage/bootstrap.go

robskillington · 2018-12-13T00:28:16Z

src/dbnode/storage/database.go

 	d.queueBootstrapWithLock()
 }

+func (d *db) hasReceivedNewShards(incoming sharding.ShardSet) bool {


nit: Rename to hasReceivedNewShardsWithLock?

robskillington · 2018-12-13T00:29:07Z

src/dbnode/storage/database.go

+func (d *db) hasReceivedNewShards(incoming sharding.ShardSet) bool {
+	var (
+		existing    = d.shardSet
+		existingSet = map[uint32]struct{}{}


nit: existingSet = make(map[uint32]struct{}, len(existing.AllIDs())

robskillington

LGTM other than some nits and comment about only setting last bootstrapped time if multiErr.FinalError() == nil for the bootstrap.

richardartoul changed the title ~~[WIP] - Fix how M3DB handles data durability during topology changes~~ Fix how M3DB handles data durability during topology changes Nov 16, 2018

richardartoul changed the title ~~Fix how M3DB handles data durability during topology changes~~ [WIP] - Fix how M3DB handles data durability during topology changes Nov 16, 2018

richardartoul changed the title ~~[WIP] - Fix how M3DB handles data durability during topology changes~~ Improve how M3DB handles data durability during topology changes Nov 19, 2018

richardartoul requested review from robskillington, justinjc and prateek November 19, 2018 16:26

Richard Artoul added 23 commits November 19, 2018 11:49

Add clarifying comment

6ef03a3

Add durably persisted method to database and default to snapshot enab…

ff2de89

…led everywhere

Refactor peers bootstrapper to not incrementally snapshot

bb703b1

wip

cd52034

Remove debug prints

98259a5

wip

a78c997

Improve integration test

c84140c

remove debug log

e211867

Clean up peer bootstrapper more

24774e8

Clean up server properly

53d2e27

Fix typo

eb08aea

wip

e77a2f9

Add unit tests for IsBootstrapped and IsBootstrappedAndDurable

0677a2a

Update flushmanager test with lastsuccessfulsnapshotstarttime

fbfdd6f

Update clustered database

f9652fb

Update test

a8b0a3d

remove print

e0110b9

Fix comment

e42468d

fix comment

0684d5c

fix comments

a651bcf

fix comment

68f54c9

fix lint issue

ca1c3a7

WIP bullshit

5d09229

Richard Artoul added 11 commits November 19, 2018 14:42

Fix scripts to wait for shards to be marked available

e0b8541

Only make undurable if received new shards

8cbc5a7

Add unit test

ed99b59

Simplify test

b359a10

fix comments

dffcbaa

Fix flaky test

9f6b13a

fix broke ntest

e6167ba

Fix broken test

b472abc

Shrink defaultTickMinimumInterval to 5 seconds

4431196

Add debug logs

fc42969

Improve debug logs

48bfa8a

robskillington reviewed Nov 20, 2018

View reviewed changes

scripts/development/m3_stack/start_m3.sh Outdated Show resolved Hide resolved

robskillington reviewed Nov 20, 2018

View reviewed changes

scripts/docker-integration-tests/prometheus/test.sh Outdated Show resolved Hide resolved

robskillington reviewed Nov 20, 2018

View reviewed changes

scripts/docker-integration-tests/simple/test.sh Outdated Show resolved Hide resolved

robskillington reviewed Nov 20, 2018

View reviewed changes

Richard Artoul added 2 commits November 20, 2018 16:34

bump default tick min interval to 10 seconds

bd1eb3b

Address feedback

19918e7

richardartoul mentioned this pull request Dec 5, 2018

Nodes can run out of disk space when joining the cluster #1231

Closed

richardartoul added the PR: Awaiting Review [db] label Dec 7, 2018

robskillington reviewed Dec 13, 2018

View reviewed changes

src/dbnode/storage/bootstrap.go Show resolved Hide resolved

robskillington reviewed Dec 13, 2018

View reviewed changes

robskillington approved these changes Dec 13, 2018

View reviewed changes

Richard Artoul added 3 commits December 13, 2018 15:39

address nit feedback

ca02bfe

Update comment

bc798ea

fix comment

8cfaefb

richardartoul merged commit cff28c1 into master Dec 14, 2018

justinjc deleted the ra/fix-placement-peer-bootstrap branch January 7, 2019 19:30

richardartoul mentioned this pull request Jan 9, 2019

Use IsBootstrappedAndDurable instead of IsBootstrapped in health checks #1287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve how M3DB handles data durability during topology changes #1183

Improve how M3DB handles data durability during topology changes #1183

richardartoul commented Nov 16, 2018 •

edited

Loading

codecov bot commented Nov 16, 2018 •

edited

Loading

robskillington Nov 20, 2018

richardartoul Nov 21, 2018

robskillington Dec 10, 2018

robskillington Nov 20, 2018

richardartoul Nov 21, 2018

robskillington Dec 10, 2018

robskillington Dec 13, 2018

robskillington Dec 13, 2018

robskillington left a comment

Improve how M3DB handles data durability during topology changes #1183

Improve how M3DB handles data durability during topology changes #1183

Conversation

richardartoul commented Nov 16, 2018 • edited Loading

codecov bot commented Nov 16, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robskillington left a comment

Choose a reason for hiding this comment

richardartoul commented Nov 16, 2018 •

edited

Loading

codecov bot commented Nov 16, 2018 •

edited

Loading