-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetcher: Fix data races #4663
Fetcher: Fix data races #4663
Conversation
Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Matej Gera <[email protected]>
Signed-off-by: Matej Gera <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, job, just one nit!
@@ -723,7 +739,10 @@ func (r *ReplicaLabelRemover) Modify(_ context.Context, metas map[ulid.ULID]*met | |||
level.Warn(r.logger).Log("msg", "block has no labels left, creating one", r.replicaLabels[0], "deduped") | |||
l[r.replicaLabels[0]] = "deduped" | |||
} | |||
metas[u].Thanos.Labels = l | |||
|
|||
nm := *meta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deep copy! 🤗
pkg/block/fetcher.go
Outdated
} | ||
|
||
// Filter filters out blocks that are marked for deletion after a given delay. | ||
// It also returns the blocks that can be deleted since they were uploaded delay duration before current time. | ||
func (f *IgnoreDeletionMarkFilter) Filter(ctx context.Context, metas map[ulid.ULID]*metadata.Meta, synced *extprom.TxGaugeVec) error { | ||
f.mtx.Lock() | ||
f.deletionMarkMap = make(map[ulid.ULID]*metadata.DeletionMark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is that needed, do we access this map somewhere else?
If that's the case I think we have to build this map locally and only then swap in locked fashion, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we will have "partial" results
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's being accessed in DeletionMarkBlocks
. I'm wondering what are the ramifications of having only 'partial' results in the map, shouldn't we pick up any leftovers during the next run of the cleaner func? (Isn't that how it works now, as current version is also potentially allowing to access map with 'partial' results?).
Otherwise you're right, we can also build it separately and swap afterwards while using lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just assuming there is a reader somewhere else accessing the deletion map concurrently. For a short period instead of 20000 deleted items it would have zero elements and the back to 20000. I would need to check what exactly wrong can happen with this, but it is just asking for problems here, no? (: Maybe now, things would eventually heal, but tomorrow no. No one would assume that internal map is flaky unless it's called flakyDeleionMarkMap
across instances with 20 lines of comments on why and still that might be surprising for next writer/reader (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would fix this before moving on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying, I see the point now 👍 I adjusted it, please let me know if it looks good now!
Signed-off-by: Matej Gera <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded (:
pkg/block/fetcher.go
Outdated
f.mtx.Lock() | ||
defer f.mtx.Unlock() | ||
|
||
deletionMarkMap := make(map[ulid.ULID]*metadata.DeletionMark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deletionMarkMap := make(map[ulid.ULID]*metadata.DeletionMark) | |
deletionMarkMap := make(map[ulid.ULID]*metadata.DeletionMark, len(f.deletionMarkMap)) |
pkg/block/fetcher.go
Outdated
} | ||
|
||
// Filter filters out blocks that are marked for deletion after a given delay. | ||
// It also returns the blocks that can be deleted since they were uploaded delay duration before current time. | ||
func (f *IgnoreDeletionMarkFilter) Filter(ctx context.Context, metas map[ulid.ULID]*metadata.Meta, synced *extprom.TxGaugeVec) error { | ||
f.mtx.Lock() | ||
f.deletionMarkMap = make(map[ulid.ULID]*metadata.DeletionMark) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would fix this before moving on
Signed-off-by: Matej Gera <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank You! LGTM! 💪🏽
@@ -871,6 +900,10 @@ func (f *IgnoreDeletionMarkFilter) Filter(ctx context.Context, metas map[ulid.UL | |||
return errors.Wrap(err, "filter blocks marked for deletion") | |||
} | |||
|
|||
f.mtx.Lock() | |||
f.deletionMarkMap = deletionMarkMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽
Changes
In the process of refactoring E2E tests, it was discovered that some compactor tests are occasionally failing due to a data race (#4579 (comment)).
After building the Docker image with race detector enabled and running the compactor E2E tests, it was discovered that there is a number of data races occurring, due to concurrent map write / reading.
This PR attempts to address all the discovered races.
Verification
Ran compactor tests with race detector enabled on the Thanos binary - ✔️