Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbnode] Shards assignment improvements during cluster topology changes #3425

Merged
merged 24 commits into from
Apr 22, 2021

Conversation

soundvibe
Copy link
Collaborator

@soundvibe soundvibe commented Apr 16, 2021

What this PR does / why we need it:

Currently, during cluster topology changes, db node assigns new shards without taking into account already running background processes. This behaviour is problematic because some db node background processes (e.g. cold flush, warm flush, snapshotting) heavily rely on retrieving owned shards from namespaces. Usually these background processes use db.IsBootstrapped() to be sure that retrieved shards are bootstrapped and consistent. This check was not 100% sufficient because in current implementation new shards might be assigned after db.IsBootstrapped() check returns true thus making background processes unpredictable and inconsistent.

This PR solves this issue by enqueueing new shardSet update and executing it when it is safe,- when no other background processes are running. We also wait until bootstrap is actually started when new shards are received, this ensures that new not bootstrapped shards will start bootstrapping (and whole db will be in Bootstrapping state) before background process are resumed, so the next db.IsBootstrapped check for the background process will return false so the execution will be skipped until node is fully bootstrapped.
Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:


Does this PR require updating code package or user-facing documentation?:


…ise warm and cold flushes might fail because some shards might still be not bootstrapped.
* master:
  Start (#3396)
  [query] Graphite fix exponentialMovingAverage seeded constant when num steps used for window size (#3395)
  [query] Fix Graphite exponentialMovingAverage to use correct EMA const and return single series per input series (#3391)
  [DOCS] Add contribution guide for documentation (#3365)
  [dbnode] Skip bootstrapping shards from aggregation (#3394)
…ssigned.

check for bootstrapped shards when doing cold flush cleanups.
* master:
  [DOCS] Configuration and component section overhaul (#3324)
  Pass a context to the query worker pool (#3350)
…it is better to make a check before calling cleanup.
* master:
  [dbnode] Decoder: fix handling of values requiring 64bit precision (#3406)
  Revert "Revert "[dbnode] Improve m3tsz decoding performance (#3358)" (#3403)" (#3405)
  [dbnode] TestReaderIteratorDecodingRegression (#3404)
  Revert "[dbnode] Improve m3tsz decoding performance (#3358)" (#3403)
  Update generated file (#3402)
…shards get assigned when file ops are not running.
…ely so that `IsBootstrappedAndDurable()` won't return true when db was previously bootstrapped and new bootstrap is enqueued.
@codecov
Copy link

codecov bot commented Apr 16, 2021

Codecov Report

Merging #3425 (4f035bd) into master (4f035bd) will not change coverage.
The diff coverage is n/a.

❗ Current head 4f035bd differs from pull request most recent head 003ff06. Consider uploading reports for the commit 003ff06 to get more accurate results

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3425   +/-   ##
=======================================
  Coverage    72.1%    72.1%           
=======================================
  Files        1100     1100           
  Lines      103600   103600           
=======================================
  Hits        74780    74780           
  Misses      23663    23663           
  Partials     5157     5157           
Flag Coverage Δ
aggregator 76.8% <0.0%> (ø)
cluster 84.9% <0.0%> (ø)
collector 84.3% <0.0%> (ø)
dbnode 78.3% <0.0%> (ø)
m3em 74.4% <0.0%> (ø)
m3ninx 73.5% <0.0%> (ø)
metrics 19.7% <0.0%> (ø)
msg 74.5% <0.0%> (ø)
query 66.9% <0.0%> (ø)
x 80.3% <0.0%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f035bd...003ff06. Read the comment docs.

* master:
  [dbnode] Set default values for BootstrapPeersConfiguration (#3420)
  [integration-tests] Use explicit version for quay.io/m3db/prometheus_remote_client_golang (#3422)
  [dtest] Fix dtest docker compose config: env => environment (#3421)
  Fix broken links to edit pages (#3419)
  [dbnode] Fix races in source_data_test.go (#3418)
  [coordinator] add more information to processed count metric (#3415)
  [dbnode] Avoid use of grow on demand worker pool for fetchTagged and aggregate (#3416)
  [docs] Fix m3aggregagtor typo (#3417)
  [x/log] Bump zap version and add logging encoder configuration (#3377)
  Do not use buffer channels if growOnDemand is true (#3414)
  [dbnode] Fix TestSeriesWriteReadParallel datapoints too far in past with -race flag (#3413)
  [docs] Update m3db operator docs with v0.13.0 features (#3397)
  [aggregator] Fix followerFlushManager metrics (#3411)
  [query] Restore optimization to skip initial fetch for timeShift and unary fns (#3408)
@notbdu
Copy link
Contributor

notbdu commented Apr 20, 2021

Approach and core logic LGTM.

* master:
  [query] Add Graphite find limit config integration test (#3428)
  [docs] Fix typo in consistency level description (#3431)
  Wrap errors from the m3 remote storage (#3427)
  [query] Add Graphite find and render limit option overrides (#3426)
  [coordinator] support for augmenting prom quantiles (#3424)
Comment on lines 158 to 161
asyncResult.bootstrapStarted.Done()

// Keep performing bootstraps until none pending and no error returned.
var result BootstrapResult
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can use a defer here? Just in case another code path returns early.

Suggested change
asyncResult.bootstrapStarted.Done()
// Keep performing bootstraps until none pending and no error returned.
var result BootstrapResult
// Keep performing bootstraps until none pending and no error returned.
var result BootstrapResult
asyncResult.bootstrapStarted.Done()
defer func() {
asyncResult.bootstrapResult = result
asyncResult.bootstrapCompleted.Done()
}()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Would need to remove the bootstrapResult = result assignment and bootstrapCompleted.Done() calls at bottom of method obviously)

@@ -409,13 +407,11 @@ func (d *clusterDB) analyzeAndReportShardStates() {
count := d.bootstrapCount[id]
if count != len(namespaces) {
// Should never happen if bootstrapped and durable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Update this comment to say that this can temporarily occur due to race condition?

Comment on lines 465 to 468
if errors.Is(err, errMediatorNotOpen) {
// initial assignment.
d.assignShardSet(shardSet)
} else {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we special case this rather than return an error and proceed anyway?

i.e.

if !d.mediator.Open() {
  // Initial assignment
  d.assignShardSet(shardSet)
  return
}

if err := d.mediator.EnqueueMutuallyExclusiveFn(func() {
  d.assignShardSet(shardSet)
}); err != nil {
  // log invariant error
}

Comment on lines 495 to 498
for len(b.externalFnCh) > 0 {
externalFn := <-b.externalFnCh
externalFn()
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is racey if there were any other readers on the channel. It is better (and more idiomatic/common) to simply for loop and break when can't read anymore:

for {
  select {
  case fn := <-b.externalFnCh:
    fn()
  default:
    break // Break from loop, no more to read
  }
}

* master:
  [dbnode] Adaptive WriteBatch allocations (#3429)
Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@soundvibe soundvibe merged commit a4cee97 into master Apr 22, 2021
@soundvibe soundvibe deleted the linasn/assign-new-shards-fix branch April 22, 2021 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants