-
Notifications
You must be signed in to change notification settings - Fork 773
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sdk-metrics] Fix race condition for MemoryPoint Reclaim #5546
[sdk-metrics] Fix race condition for MemoryPoint Reclaim #5546
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5546 +/- ##
==========================================
+ Coverage 83.38% 85.58% +2.19%
==========================================
Files 297 289 -8
Lines 12531 12493 -38
==========================================
+ Hits 10449 10692 +243
+ Misses 2082 1801 -281
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@Yun-Ting Would you mind spinning up an issue to make sure we have coyote test coverage over this delta metric point reclaim feature? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes
On the main branch:
An update thread does the following:
MetricPoint
arrayReferenceCount
for theMetricPoint
at that indexMetricPoint
is valid for useMetricPointStatus
toCollectPending
ReferenceCount
for theMetricPoint
The collect thread does the following:
MetricPointStatus
for a givenMetricPoint
isNoCollectPending
, then check if it can be marked invalidMetricPoint
invalid (this happens by setting theReferenceCount
toint.MinValue
when no one is using it)MetricPoint
Where's the race condition?
Consider this sequence of steps where both the Update thread and the Collect thread are working on the same MetricPoint
MetricPoint
indexMetricPoints
MetricPointStatus
for theMetricPoint
MetricPointStatus
isNoCollectPending
ReferenceCount
to1
MetricPoint
is valid for useMetricPointStatus
toCollectPending
(This is the update that we would miss)ReferenceCount
to0
ReferenceCount
toint.MinValue
as it finds theReferenceCount
to be0
MetricPoint
and misses the update that happened at T7Fix
I'm introducing a double-checked locking type construct to recheck if the
MetricPointStatus
was changed toCollectPending
before the Collect thread could mark the MetricPoint invalid for use. When that happens, we would now callSnapshot
for thatMetricPoint
and mark it to be reclaimed in the next Collect cycle. Note that theMetricPoint
would remain invalid for use until the next Collect cycle reclaims it.