Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove nested read lock to prevent deadlock #2128

Merged
merged 5 commits into from
Jan 31, 2020
Merged

Conversation

justinjc
Copy link
Collaborator

What this PR does / why we need it:
Fixes #2127

Special notes for your reviewer:

Does this PR introduce a user-facing and/or backwards incompatible change?:

NONE

Does this PR require updating code package or user-facing documentation?:

NONE

@justinjc justinjc requested a review from prateek January 30, 2020 16:42
@prateek
Copy link
Collaborator

prateek commented Jan 30, 2020

notes from our chat:

  • rewrite the fix to not require multiple acquisitions of the shard read lock
  • TestShardTickWriteRace to use a property for tickBatchSize <- (0,10)
  • TestShardTickWriteRace to have tick wait interval be millisecond
  • TestShardTickWriteRace run the internal test method 100 times and mark it big (to ensure race detector)
  • can finally close [WIP] Shard Tick/Write race #506
  • file an issue to create a debug/test lock which checks for recursive gets and fails; and/or a linter that doesn't suck at this (https://github.com/gnieto/mulint ain't it)

@@ -385,8 +385,12 @@ func (s *dbShard) RetrievableBlockColdVersion(blockStart time.Time) (int, error)
// BlockStatesSnapshot implements series.QueryableBlockRetriever
func (s *dbShard) BlockStatesSnapshot() series.ShardBlockStateSnapshot {
s.RLock()
defer s.RUnlock()
return s.blockStatesSnapshotWithRLock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to acquire the flushState RLock while holding the shard RLock?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I checked usages of the flushState Rlock and they don't conflict have a relationship anywhere else in the code, so this should be good.

@codecov
Copy link

codecov bot commented Jan 30, 2020

Codecov Report

Merging #2128 into master will decrease coverage by 18.7%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #2128      +/-   ##
=========================================
- Coverage    69.9%   51.2%   -18.8%     
=========================================
  Files        1001     828     -173     
  Lines       86476   75886   -10590     
=========================================
- Hits        60478   38863   -21615     
- Misses      21715   33702   +11987     
+ Partials     4283    3321     -962
Flag Coverage Δ
#aggregator 68% <ø> (+4.7%) ⬆️
#cluster 77.3% <ø> (+1.8%) ⬆️
#collector 48.8% <ø> (-7.2%) ⬇️
#dbnode 64.5% <100%> (+0.2%) ⬆️
#m3em 44.2% <ø> (-8.3%) ⬇️
#m3ninx 56.8% <ø> (-4.9%) ⬇️
#m3nsch 100% <ø> (+71.5%) ⬆️
#metrics 17.6% <ø> (-82.4%) ⬇️
#msg 72.9% <ø> (-1.5%) ⬇️
#query 26.5% <ø> (-17.5%) ⬇️
#x ?

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 217cfe8...7f78cd3. Read the comment docs.

Copy link
Collaborator

@prateek prateek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good stuff

@justinjc justinjc merged commit af17524 into master Jan 31, 2020
@justinjc justinjc deleted the juchan/deadlock-fix branch January 31, 2020 00:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rare goroutine deadlock in M3DB
2 participants