-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix delta metric storage concurrency bug #5932
Changes from 1 commit
153ed45
8926439
e998b72
55b845a
60c2b43
945def7
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,6 +26,7 @@ | |
import java.util.Queue; | ||
import java.util.concurrent.ConcurrentHashMap; | ||
import java.util.concurrent.ConcurrentLinkedQueue; | ||
import java.util.concurrent.locks.StampedLock; | ||
import java.util.logging.Level; | ||
import java.util.logging.Logger; | ||
|
||
|
@@ -46,7 +47,8 @@ public final class DefaultSynchronousMetricStorage<T extends PointData, U extend | |
private final MetricDescriptor metricDescriptor; | ||
private final AggregationTemporality aggregationTemporality; | ||
private final Aggregator<T, U> aggregator; | ||
private final ConcurrentHashMap<Attributes, AggregatorHandle<T, U>> aggregatorHandles = | ||
private final StampedLock sl = new StampedLock(); | ||
private ConcurrentHashMap<Attributes, AggregatorHandle<T, U>> aggregatorHandles = | ||
new ConcurrentHashMap<>(); | ||
private final AttributesProcessor attributesProcessor; | ||
|
||
|
@@ -83,8 +85,13 @@ Queue<AggregatorHandle<T, U>> getAggregatorHandlePool() { | |
|
||
@Override | ||
public void recordLong(long value, Attributes attributes, Context context) { | ||
AggregatorHandle<T, U> handle = getAggregatorHandle(attributes, context); | ||
handle.recordLong(value, attributes, context); | ||
long stamp = sl.readLock(); | ||
try { | ||
AggregatorHandle<T, U> handle = getAggregatorHandle(attributes, context); | ||
handle.recordLong(value, attributes, context); | ||
} finally { | ||
sl.unlockRead(stamp); | ||
} | ||
} | ||
|
||
@Override | ||
|
@@ -99,8 +106,13 @@ public void recordDouble(double value, Attributes attributes, Context context) { | |
+ ". Dropping measurement."); | ||
return; | ||
} | ||
AggregatorHandle<T, U> handle = getAggregatorHandle(attributes, context); | ||
handle.recordDouble(value, attributes, context); | ||
long stamp = sl.readLock(); | ||
try { | ||
AggregatorHandle<T, U> handle = getAggregatorHandle(attributes, context); | ||
handle.recordDouble(value, attributes, context); | ||
} finally { | ||
sl.unlockRead(stamp); | ||
} | ||
} | ||
|
||
private AggregatorHandle<T, U> getAggregatorHandle(Attributes attributes, Context context) { | ||
|
@@ -146,13 +158,25 @@ public MetricData collect( | |
? registeredReader.getLastCollectEpochNanos() | ||
: startEpochNanos; | ||
|
||
ConcurrentHashMap<Attributes, AggregatorHandle<T, U>> aggregatorHandles; | ||
if (reset) { | ||
long stamp = sl.writeLock(); | ||
try { | ||
aggregatorHandles = this.aggregatorHandles; | ||
this.aggregatorHandles = new ConcurrentHashMap<>(); | ||
} finally { | ||
sl.unlockWrite(stamp); | ||
} | ||
} else { | ||
aggregatorHandles = this.aggregatorHandles; | ||
} | ||
|
||
// Grab aggregated points. | ||
List<T> points = new ArrayList<>(aggregatorHandles.size()); | ||
aggregatorHandles.forEach( | ||
(attributes, handle) -> { | ||
T point = handle.aggregateThenMaybeReset(start, epochNanos, attributes, reset); | ||
if (reset) { | ||
aggregatorHandles.remove(attributes, handle); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here's the bug:
The solution is to guard the |
||
// Return the aggregator to the pool. | ||
aggregatorHandlePool.offer(handle); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an important note and a question.
Important note: We must never block upon recording, as we deteriorate performance (recording must wait for write lock to finish).
I suggest we create a data structure called AggregatorHandles:
we should have an
activeAggregatorHandles
andstandbyAggregatorHandles
of that type.When we record, we call
readLock().tryLock
. If we getfalse
it means the write lock has been taken and we need to refresh the value fromactiveAggregatorHandles
(explained below why) hence uponfalse
we re-read the value at activeAggregatorHandles and call tryLock again - this should never fail - if it does, fail.Upon collecting:
activeAggregatorHandles
andstandbyAggregatorHandles
- saving the value in active as the AggregatorHandles we will work on.writeLock().lock()
- This will cause us to block until all readers which took a handle will finish recording. Since it's not user-dependant, it should be immediate and guaranteed to happen. It's ok to block in collect() as it's not as latency sensitive asrecord()
.AggregatorHandles
and obtain and use another lock (the newly active lock). The only "left-overs" we have are handles which retrieves the previousAggregatorHandles
, and haven't yet managed to callreadLock.lock()
. There are two options for them:a. writeLock() was already called hence they will get "false" - this is a signal for them that the
activeAggregatorHandles
was switched - they need to re-read its value and obtain a read lock. We can wrap with a while loop. I tried thinking about it a lot I can't see it spinning forever - doesn't seem like a realistic option.b.
readLock.lock()
returns true and they continue to use it, and writeLock will wait for them.Question.
I wonder why use
StampedLock
vs usingReentrantReadWriteLock
which seems much easier to reason about, and doesn't require persisting a stamp in memory. It's harder to reason about the code when you see the stamp from lock without understanding why need the stamp.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm switching to
ReentrantReadWriteLock
for reasons described here.Sure. The approach you outline is optimistic reads. There's a number of different ways to do this, and they appear to be simpler / higher performance with StampedLock. But in either case, I'll rework it so recording measurements doesn't have to block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're still blocking :) Once a collection start you grab a writeLock, which blocks all readLock until collection has finished. Hence I suggested a different approach outlined above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic you suggest does check out and does not block. I'm running the performance tests now to evaluate if reading a volatile variable every time we record on the hot path degrades performance in a serious way. It could be the case that reading a non-volatile variable and only blocking for an extremely short amount of time once per collection is better than reading a volatile variable every record but never blocking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok so here's the JMH results for the plain read write lock solution which blocks for a narrow window when collecting vs. @asafm's proposal to always read from a volatile but never block. Source code for the volatile, never blocking solution here.
The solution to always read from a volatile and never block reduces performance on the record path by a modest ~4% versus the read write lock approach which blocks briefly during collection. Its also worth noting that as implemented, the volatile never block approach impacts the cumulative performance as well, which isn't strictly necessary since this concurrency bug only affects delta. Cumulative should be able to read a non-volatile variable safely since it never needs to change it.