-
Notifications
You must be signed in to change notification settings - Fork 15
[LWMeta] Add metadata metrics to TimeLock #6665
[LWMeta] Add metadata metrics to TimeLock #6665
Conversation
Generate changelog in
|
public AsyncTimelockServiceImpl( | ||
AsyncLockService lockService, ManagedTimestampService timestampService, LockLog lockLog) { | ||
AsyncLockService lockService, | ||
ManagedTimestampService timestampService, | ||
LockLog lockLog, | ||
MetadataMetrics metadataMetrics) { | ||
this.metadataMetrics = metadataMetrics; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked in sourcegraph to ensure that no one is using this constructor outside of timelock
namespaces: | ||
requestMetadata: | ||
docs: Metrics tracking metadata presence in lock requests sent to TimeLock | ||
metrics: | ||
numChangeMetadata: | ||
docs: Number of change metadata objects contained in a lock request | ||
type: histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about the best way to name these. I wanted to highlight that we are only collecting metrics on how much/what kind of metadata we receive as part of requests. If we do other metrics in the future (e.g., how much metadata is currently persisted in TimeLock), I'd add them as a separate namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, that's exactly what I did
import java.util.Optional; | ||
import java.util.Set; | ||
import java.util.UUID; | ||
import java.util.function.Supplier; | ||
import java.util.stream.Collectors; | ||
|
||
public class LockEventLogImpl implements LockEventLog { | ||
|
||
@VisibleForTesting | ||
static final int SLIDING_WINDOW_SIZE = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just so we do not have to adjust the tests if we ever decide to change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice
.orElse(0); | ||
int numPresentMetadataDiff = metadata.map(_unused -> 1).orElse(0) | ||
- replacedMetadata.map(_unused -> 1).orElse(0); | ||
metadataMetrics.numChangeMetadata().inc(changeMetadataSizeDiff); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
calling inc()
with negative value works as expected (verified in a test)
/** | ||
* Returns the {@link LockWatchEvent} that was replaced if the buffer is already full. | ||
*/ | ||
Optional<LockWatchEvent> add(LockWatchEvent.Builder eventBuilder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most calls to this now drop the return value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked at tests yet, but so far looks good!
@@ -0,0 +1,20 @@ | |||
options: | |||
javaPackage: 'com.palantir.atlasdb.timelock.metrics' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight semantic change I'd suggest, I'd go for com.palantir.atlasdb.timelock.lockwatches
, and then have the namespaces be request
and current
(stored implies persistence).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can then just have changeMetadata
rather than numChangeMetadata
. So using the metric would look like: com.palantir.atlasdb.timelock.lockwatches.request.changeMetadata.p99
. Or even com.palantir.atlasdb.timelock.lockwatches.request.changeMetadataCount.p99
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also think about how we can track # of bytes as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of bytes is a bit trickier since we potentially have to iterate over all metadata objects. For SKAPC, we know an upper bound for the number of bytes per ChangeMetadata
object (maximum key size). Do we want the exact number anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking of just tracking it when we do an add
. It would be a meter metric, so it's more of the rate than total size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also do a counter, where we minus once we remove something from the sliding window
@@ -106,6 +112,12 @@ public ListenableFuture<LockResponseV2> lock(IdentifiedLockRequest request) { | |||
request.getLockDescriptors(), | |||
TimeLimit.of(request.getAcquireTimeoutMs()), | |||
request.getMetadata()); | |||
metadataMetrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will construct a new histogram object for every request; we should create the histogram metadataMetrics::numChangeMetadata
in the constructor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fixed this for the counters
/** | ||
* Returns the {@link LockWatchEvent} that was replaced if the buffer is already full. | ||
*/ | ||
Optional<LockWatchEvent> add(LockWatchEvent.Builder eventBuilder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little hesitant to keep this named as add
, maybe addOrReplace
or something similar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved the metrics to ArrayLockEventSlidingWindow
, so this, once again, does not return anything. Should we still rename it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah
import java.util.Optional; | ||
import java.util.Set; | ||
import java.util.UUID; | ||
import java.util.function.Supplier; | ||
import java.util.stream.Collectors; | ||
|
||
public class LockEventLogImpl implements LockEventLog { | ||
|
||
@VisibleForTesting | ||
static final int SLIDING_WINDOW_SIZE = 1000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice
@@ -64,7 +78,22 @@ public synchronized <T> ValueAndLockWatchStateUpdate<T> runTask( | |||
@Override | |||
public synchronized void logLock( | |||
Set<LockDescriptor> locksTakenOut, LockToken lockToken, Optional<LockRequestMetadata> metadata) { | |||
slidingWindow.add(LockEvent.builder(locksTakenOut, lockToken, metadata)); | |||
Optional<LockWatchEvent> replacedEvent = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question: Do we think we should have the metric recorded here, or in ArrayLockEventSlidingWindow
? What's sort of frustrating by the previous implementation, is that we have the storage of lock events, and then the processing. Feels a bit weird to me that we track the storage of lock events at the processing layer, although the same times it sort of makes sense, as we avoid wiring in multiple places.
That said, maybe we can use the delegate pattern to help us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the point. Also, moving the metrics to the storage layer makes sense for testing since we have to be aware of the buffer filling up (which the processing layer should not need to be aware of).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind doing more wiring to track metrics at the very low level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think delegating the metric tracking can be worthwhile if we do track a lot but for 2-3 metrics I don't think it's worth it yet
For now, just count the size of metadata for each incoming request
…Lock as part of the sliding window
c13a5f5
to
f0d976b
Compare
@Value.Immutable | ||
abstract static class FakeLockWatchEvent implements LockWatchEvent { | ||
@Override | ||
@Value.Parameter | ||
public abstract long sequence(); | ||
|
||
@Override | ||
@Value.Parameter | ||
public abstract int size(); | ||
|
||
@Override | ||
public <T> T accept(LockWatchEvent.Visitor<T> visitor) { | ||
// do nothing | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to remove this because the accept
returning null here will cause the add
to fail since an Optional<LockRequestMetadata>
is now null Also, we should never return null anyways, no even in test code!.
|
||
@Test | ||
public void maintainsCorrectMetadataCountWhenOverwritingBuffer() { | ||
assertThat(WINDOW_SIZE).as("This test does not work with small windows").isGreaterThanOrEqualTo(5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: explain why it doesn't work for small windows!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! one small nit
General
Before this PR:
After this PR:
We have metrics to observe how much metadata is being sent to TimeLock and how much we metadata are storing in the lock watch event sliding window. This will be useful to decide if we need a limit on the amount of metadata and if so, what the limit should be.
==COMMIT_MSG==
[LWMeta] Add metadata metrics to TimeLock
==COMMIT_MSG==
Priority:
Concerns / possible downsides (what feedback would you like?):
I noticed that we do not have a lot of metrics in TimeLock in general (basically just the witchcraft ones). Is this deliberate for performance reasons? If yes, then this PR does not make a lot of sense.
Is documentation needed?:
Compatibility
I have had to adjust the constructor interface of
AsyncTimeLockServiceImpl
andLockEventLogImpl
and some usages of those, but i don't believe anyone is using then outside of TimeLock.Testing and Correctness
What was existing testing like? What have you done to improve it?:
Added tests to verify that metrics are set correctly (for stored + request metadata)
Execution
How would I tell this PR works in production? (Metrics, logs, etc.):
We have new metrics for metadata:
timelock.requestMetadata
andtimelock.storedMetadata
Has the safety of all log arguments been decided correctly?:
N/A
Will this change significantly affect our spending on metrics or logs?:
Not really given that we already emit metrics for every request in TimeLock and this metric is only for lock requests.
How would I tell that this PR does not work in production? (monitors, etc.):
Metrics are not visible OR the metrics are 0 even if clients are sending metadata-enriched requests.
If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
A fix should be implemented
If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):
N/A
Scale
Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
No
Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
No
Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
If we get rid of metrics in TimeLock for performance, than we should also look at this.
Development Process
Where should we start reviewing?:
If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?: