Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dbnode] Get rid of excessive locking when adding new namespaces #3765

Merged
merged 29 commits into from
Oct 6, 2021

Conversation

soundvibe
Copy link
Collaborator

@soundvibe soundvibe commented Sep 22, 2021

What this PR does / why we need it:

Currently, when the new namespace is being added to m3db node, the database level lock is held throughout the whole update namespaces process. This might potentially take quite a lot of time because it needs to enqueue bootstrap and wait for it to actually start. We are talking about waiting for potentially a several minutes here. During this time dbnode is unable to handle new requests so goroutine count and memory usage begins to increase, eventually leading to OOM.

This PR ensures that namespaces are updated when all background file ops are disabled and completed and also reduces database level locking (lock is held only when namespaces are being updated in the internal map).

Before:
image
After:
image

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:


Does this PR require updating code package or user-facing documentation?:


@soundvibe soundvibe changed the title [dbnode] Remove excessive locking when adding new namespaces [dbnode] Get rid of excessive locking when adding new namespaces Sep 22, 2021
@codecov
Copy link

codecov bot commented Sep 22, 2021

Codecov Report

Merging #3765 (52b554e) into master (a6bf60d) will decrease coverage by 0.3%.
The diff coverage is 72.7%.

❗ Current head 52b554e differs from pull request most recent head 15fa0cd. Consider uploading reports for the commit 15fa0cd to get more accurate results

Impacted file tree graph

@@           Coverage Diff            @@
##           master   #3765     +/-   ##
========================================
- Coverage    57.1%   56.8%   -0.4%     
========================================
  Files         552     552             
  Lines       63468   63100    -368     
========================================
- Hits        36303   35874    -429     
- Misses      23961   24020     +59     
- Partials     3204    3206      +2     
Flag Coverage Δ
aggregator 63.3% <ø> (ø)
cluster ∅ <ø> (∅)
collector 58.4% <ø> (ø)
dbnode 60.4% <72.7%> (-0.5%) ⬇️
m3em 46.4% <ø> (ø)
metrics 19.7% <ø> (ø)
msg 74.3% <ø> (-0.1%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8dd5622...15fa0cd. Read the comment docs.

src/dbnode/storage/database.go Outdated Show resolved Hide resolved
src/dbnode/storage/database.go Outdated Show resolved Hide resolved
src/dbnode/storage/database.go Show resolved Hide resolved
Linas Naginionis added 2 commits September 22, 2021 17:37
…`onCompleteFn` might be invoked for the second bootstrap while another bootstrap is still in progress. This is bad because bootstrap should work when fileOps are disabled and `onCompleteFn` might be enabling fileOps.
@linasm
Copy link
Collaborator

linasm commented Sep 23, 2021

@soundvibe
Copy link
Collaborator Author

soundvibe commented Sep 23, 2021

Looks like there is a related test failure @soundvibe: https://buildkite.com/uberopensource/m3-monorepo-public/builds/10113#be0da0a5-486d-413a-ae61-0a6bdd257285/773-953

Yes, already working on a fix for assignShardSets().
Updated assignShardSets().

Linas Naginionis and others added 8 commits September 23, 2021 15:54
* master:
  [dbnode,query] Allow blocking operations on main thread to be interrupted (#3764)
  [coordinator] Add support for prometheus remote write compatible backend (#3742)
  [tests] Add dbnode capable of being started in-process. (#3759)
If db was not yet bootstrapped and new shardSet is assigned, do this immediately (no need to enqueue).
New test for add new namespace using enqueue.
* master:
  [aggregator] Include resendEnabled in clone() (#3769)
  [dbnode] Add support for dynamic (stdlib) pools in m3db client (#3775)
…he callback functions (#3810)

* Simplify callback logic by having the mediator own the lifecycle of the callback functions

* Remove no longer required BootstrapAsyncResult

* Remove unused require.NoError

* Fix lint
Linas Naginionis and others added 2 commits October 5, 2021 11:41
Fixed possible race in `context.StartSampledTraceSpan()`.
Use zap error logging where needed.
Make sure fileOps are disabled/enabled only when really needed (bootstraps > 0 and mediator is opened), otherwise just use simpler logics.
Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few minor suggestions before merging, simplifications and more use of lock/defer unlock since enqueueing bootstraps and turning back on file ops can be done while holding a lock (since both async/very fast):
#3818

Copy link
Collaborator

@robskillington robskillington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM logic wise, I would want to merge the simplifications into this branch from the suggested PR I opened targeting this branch before we land this PR however: #3818

Stamping the PR so can land asynchronously.

@soundvibe
Copy link
Collaborator Author

LGTM logic wise, I would want to merge the simplifications into this branch from the suggested PR I opened targeting this branch before we land this PR however: #3818

Stamping the PR so can land asynchronously.

Merged the simplifications into this branch.

@soundvibe soundvibe merged commit 5f351fb into master Oct 6, 2021
@soundvibe soundvibe deleted the linasn/bootstrap-after-fileops-wait branch October 6, 2021 10:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants