[LWMeta] Add metadata metrics to TimeLock #6665

Toemmsche · 2023-07-13T14:56:57Z

General

Before this PR:

After this PR:
We have metrics to observe how much metadata is being sent to TimeLock and how much we metadata are storing in the lock watch event sliding window. This will be useful to decide if we need a limit on the amount of metadata and if so, what the limit should be.

==COMMIT_MSG==
[LWMeta] Add metadata metrics to TimeLock
==COMMIT_MSG==

Priority:

Concerns / possible downsides (what feedback would you like?):
I noticed that we do not have a lot of metrics in TimeLock in general (basically just the witchcraft ones). Is this deliberate for performance reasons? If yes, then this PR does not make a lot of sense.

Is documentation needed?:

Compatibility

I have had to adjust the constructor interface of AsyncTimeLockServiceImpl and LockEventLogImpl and some usages of those, but i don't believe anyone is using then outside of TimeLock.

Testing and Correctness

What was existing testing like? What have you done to improve it?:
Added tests to verify that metrics are set correctly (for stored + request metadata)

Execution

How would I tell this PR works in production? (Metrics, logs, etc.):
We have new metrics for metadata: timelock.requestMetadata and timelock.storedMetadata
Has the safety of all log arguments been decided correctly?:
N/A
Will this change significantly affect our spending on metrics or logs?:
Not really given that we already emit metrics for every request in TimeLock and this metric is only for lock requests.
How would I tell that this PR does not work in production? (monitors, etc.):
Metrics are not visible OR the metrics are 0 even if clients are sending metadata-enriched requests.
If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
A fix should be implemented
If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):
N/A

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
No
Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
No
Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
If we get rid of metrics in TimeLock for performance, than we should also look at this.

Development Process

Where should we start reviewing?:

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:

changelog-app · 2023-07-13T14:57:00Z

Generate changelog in `changelog/@unreleased`

What do the change types mean?

feature: A new feature of the service.
improvement: An incremental improvement in the functionality or operation of the service.
fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
and external consumers of the service's API (e.g. customer-written software or integrations).
deprecation: Advertises the intention to remove service functionality without any change to the
operation of the service itself.
manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
performing database surgery, ...) at the time of upgrade for it to succeed.
migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?

❗The break and manual task changelog types will result in a major release!
🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
✨ All others will result in a minor version release.

Type

Description

[LWMeta] Add metadata metrics to TimeLock

Check the box to generate changelog(s)

Generate changelog entry

Toemmsche · 2023-07-14T13:29:14Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/AsyncTimelockServiceImpl.java

    public AsyncTimelockServiceImpl(
-            AsyncLockService lockService, ManagedTimestampService timestampService, LockLog lockLog) {
+            AsyncLockService lockService,
+            ManagedTimestampService timestampService,
+            LockLog lockLog,
+            MetadataMetrics metadataMetrics) {
+        this.metadataMetrics = metadataMetrics;


I checked in sourcegraph to ensure that no one is using this constructor outside of timelock

Toemmsche · 2023-07-14T14:06:16Z

timelock-impl/src/main/metrics/metadata.yml

+namespaces:
+  requestMetadata:
+    docs: Metrics tracking metadata presence in lock requests sent to TimeLock
+    metrics:
+      numChangeMetadata:
+        docs: Number of change metadata objects contained in a lock request
+        type: histogram


Not sure about the best way to name these. I wanted to highlight that we are only collecting metrics on how much/what kind of metadata we receive as part of requests. If we do other metrics in the future (e.g., how much metadata is currently persisted in TimeLock), I'd add them as a separate namespace.

In fact, that's exactly what I did

Toemmsche · 2023-07-17T09:57:43Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/lock/watch/LockEventLogImpl.java

 import java.util.Optional;
 import java.util.Set;
 import java.util.UUID;
 import java.util.function.Supplier;
 import java.util.stream.Collectors;

 public class LockEventLogImpl implements LockEventLog {
+
+    @VisibleForTesting
+    static final int SLIDING_WINDOW_SIZE = 1000;


just so we do not have to adjust the tests if we ever decide to change this

Toemmsche · 2023-07-17T09:59:31Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/lock/watch/LockEventLogImpl.java

+                        .orElse(0);
+        int numPresentMetadataDiff = metadata.map(_unused -> 1).orElse(0)
+                - replacedMetadata.map(_unused -> 1).orElse(0);
+        metadataMetrics.numChangeMetadata().inc(changeMetadataSizeDiff);


calling inc() with negative value works as expected (verified in a test)

Toemmsche · 2023-07-17T10:00:04Z

...impl/src/main/java/com/palantir/atlasdb/timelock/lock/watch/ArrayLockEventSlidingWindow.java

+    /**
+     * Returns the {@link LockWatchEvent} that was replaced if the buffer is already full.
+     */
+    Optional<LockWatchEvent> add(LockWatchEvent.Builder eventBuilder) {


Most calls to this now drop the return value

Sam-Kramer

Haven't looked at tests yet, but so far looks good!

Sam-Kramer · 2023-07-19T15:22:53Z

timelock-impl/src/main/metrics/metadata.yml

@@ -0,0 +1,20 @@
+options:
+  javaPackage: 'com.palantir.atlasdb.timelock.metrics'


slight semantic change I'd suggest, I'd go for com.palantir.atlasdb.timelock.lockwatches, and then have the namespaces be request and current (stored implies persistence).

You can then just have changeMetadata rather than numChangeMetadata. So using the metric would look like: com.palantir.atlasdb.timelock.lockwatches.request.changeMetadata.p99. Or even com.palantir.atlasdb.timelock.lockwatches.request.changeMetadataCount.p99.

We should also think about how we can track # of bytes as well.

Number of bytes is a bit trickier since we potentially have to iterate over all metadata objects. For SKAPC, we know an upper bound for the number of bytes per ChangeMetadata object (maximum key size). Do we want the exact number anyways?

Yeah I was thinking of just tracking it when we do an add. It would be a meter metric, so it's more of the rate than total size.

We could also do a counter, where we minus once we remove something from the sliding window

Sam-Kramer · 2023-07-20T09:01:12Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/AsyncTimelockServiceImpl.java

@@ -106,6 +112,12 @@ public ListenableFuture<LockResponseV2> lock(IdentifiedLockRequest request) {
                request.getLockDescriptors(),
                TimeLimit.of(request.getAcquireTimeoutMs()),
                request.getMetadata());
+        metadataMetrics


This will construct a new histogram object for every request; we should create the histogram metadataMetrics::numChangeMetadata in the constructor

Also fixed this for the counters

Sam-Kramer · 2023-07-20T09:17:40Z

...impl/src/main/java/com/palantir/atlasdb/timelock/lock/watch/ArrayLockEventSlidingWindow.java

+    /**
+     * Returns the {@link LockWatchEvent} that was replaced if the buffer is already full.
+     */
+    Optional<LockWatchEvent> add(LockWatchEvent.Builder eventBuilder) {


I'm a little hesitant to keep this named as add, maybe addOrReplace or something similar?

I moved the metrics to ArrayLockEventSlidingWindow, so this, once again, does not return anything. Should we still rename it?

Sam-Kramer · 2023-07-20T09:18:11Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/lock/watch/LockEventLogImpl.java

 import java.util.Optional;
 import java.util.Set;
 import java.util.UUID;
 import java.util.function.Supplier;
 import java.util.stream.Collectors;

 public class LockEventLogImpl implements LockEventLog {
+
+    @VisibleForTesting
+    static final int SLIDING_WINDOW_SIZE = 1000;


Sam-Kramer · 2023-07-20T09:28:21Z

timelock-impl/src/main/java/com/palantir/atlasdb/timelock/lock/watch/LockEventLogImpl.java

@@ -64,7 +78,22 @@ public synchronized <T> ValueAndLockWatchStateUpdate<T> runTask(
    @Override
    public synchronized void logLock(
            Set<LockDescriptor> locksTakenOut, LockToken lockToken, Optional<LockRequestMetadata> metadata) {
-        slidingWindow.add(LockEvent.builder(locksTakenOut, lockToken, metadata));
+        Optional<LockWatchEvent> replacedEvent =


Open question: Do we think we should have the metric recorded here, or in ArrayLockEventSlidingWindow? What's sort of frustrating by the previous implementation, is that we have the storage of lock events, and then the processing. Feels a bit weird to me that we track the storage of lock events at the processing layer, although the same times it sort of makes sense, as we avoid wiring in multiple places.

That said, maybe we can use the delegate pattern to help us?

I see the point. Also, moving the metrics to the storage layer makes sense for testing since we have to be aware of the buffer filling up (which the processing layer should not need to be aware of).

I don't mind doing more wiring to track metrics at the very low level

I think delegating the metric tracking can be worthwhile if we do track a lot but for 2-3 metrics I don't think it's worth it yet

For now, just count the size of metadata for each incoming request

…Lock as part of the sliding window

Toemmsche · 2023-07-20T12:47:24Z

.../src/test/java/com/palantir/atlasdb/timelock/lock/watch/ArrayLockEventSlidingWindowTest.java

-    @Value.Immutable
-    abstract static class FakeLockWatchEvent implements LockWatchEvent {
-        @Override
-        @Value.Parameter
-        public abstract long sequence();
-
-        @Override
-        @Value.Parameter
-        public abstract int size();
-
-        @Override
-        public <T> T accept(LockWatchEvent.Visitor<T> visitor) {
-            // do nothing
-            return null;
-        }


Had to remove this because the accept returning null here will cause the add to fail since an Optional<LockRequestMetadata> is now null Also, we should never return null anyways, no even in test code!.

Sam-Kramer · 2023-07-21T12:25:28Z

.../src/test/java/com/palantir/atlasdb/timelock/lock/watch/ArrayLockEventSlidingWindowTest.java

+
+    @Test
+    public void maintainsCorrectMetadataCountWhenOverwritingBuffer() {
+        assertThat(WINDOW_SIZE).as("This test does not work with small windows").isGreaterThanOrEqualTo(5);


nit: explain why it doesn't work for small windows!

Sam-Kramer

lgtm! one small nit

Toemmsche mentioned this pull request Jul 14, 2023

[LWMeta] TimeLock API changes #6660

Merged

Toemmsche commented Jul 14, 2023

View reviewed changes

Toemmsche marked this pull request as ready for review July 14, 2023 13:43

Toemmsche requested a review from Sam-Kramer July 14, 2023 13:43

Toemmsche commented Jul 14, 2023

View reviewed changes

Toemmsche commented Jul 17, 2023

View reviewed changes

Toemmsche mentioned this pull request Jul 17, 2023

[LWMeta] Discard metadata if too large #6661

Closed

Sam-Kramer reviewed Jul 20, 2023

View reviewed changes

Toemmsche requested a review from Sam-Kramer July 20, 2023 10:28

Toemmsche and others added 13 commits July 20, 2023 13:44

Add metadata metrics to TimeLock

126bc4e

For now, just count the size of metadata for each incoming request

Add license exception

8f1e4b5

Add generated changelog entries

b31e33e

Rename metric

9ed33fb

Rename again

a01c4aa

Ignore again

a202e50

Add metrics for the size of metadata that is currently stored in Time…

e61adf8

…Lock as part of the sliding window

Exclude stored metadata from license

a1c589b

Fix test constructor

2a73061

Improvements

994dc7e

Better assertions

1f96f2c

Address comments

5ed07fa

Rename to BufferMetrics

f0d976b

Toemmsche force-pushed the tpapke/metadata/timelock-metadata-metrics branch from c13a5f5 to f0d976b Compare July 20, 2023 12:44

Toemmsche commented Jul 20, 2023

View reviewed changes

Revert metadatatest

543638b

Sam-Kramer reviewed Jul 21, 2023

View reviewed changes

Sam-Kramer approved these changes Jul 21, 2023

View reviewed changes

Update .as()

1944e65

Toemmsche added the merge when ready label Jul 21, 2023

bulldozer-bot bot merged commit 0995b73 into develop Jul 21, 2023

bulldozer-bot bot deleted the tpapke/metadata/timelock-metadata-metrics branch July 21, 2023 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LWMeta] Add metadata metrics to TimeLock #6665

[LWMeta] Add metadata metrics to TimeLock #6665

Toemmsche commented Jul 13, 2023 •

edited

Loading

changelog-app bot commented Jul 13, 2023 •

edited by Toemmsche

Loading

Toemmsche Jul 14, 2023 •

edited

Loading

Toemmsche Jul 14, 2023

Toemmsche Jul 17, 2023

Toemmsche Jul 17, 2023

Sam-Kramer Jul 20, 2023

Toemmsche Jul 17, 2023

Toemmsche Jul 17, 2023 •

edited

Loading

Sam-Kramer left a comment

Sam-Kramer Jul 19, 2023

Sam-Kramer Jul 19, 2023

Sam-Kramer Jul 20, 2023

Toemmsche Jul 20, 2023

Sam-Kramer Jul 20, 2023

Sam-Kramer Jul 20, 2023

Sam-Kramer Jul 20, 2023

Toemmsche Jul 20, 2023

Sam-Kramer Jul 20, 2023

Toemmsche Jul 20, 2023

Sam-Kramer Jul 20, 2023

Sam-Kramer Jul 20, 2023

Sam-Kramer Jul 20, 2023

Toemmsche Jul 20, 2023

Toemmsche Jul 20, 2023

Toemmsche Jul 20, 2023

Toemmsche Jul 20, 2023

Sam-Kramer Jul 21, 2023

Sam-Kramer left a comment

		@@ -0,0 +1,20 @@
		options:
		javaPackage: 'com.palantir.atlasdb.timelock.metrics'

[LWMeta] Add metadata metrics to TimeLock #6665

[LWMeta] Add metadata metrics to TimeLock #6665

Conversation

Toemmsche commented Jul 13, 2023 • edited Loading

General

Compatibility

Testing and Correctness

Execution

Scale

Development Process

changelog-app bot commented Jul 13, 2023 • edited by Toemmsche Loading

Generate changelog in changelog/@unreleased

Toemmsche Jul 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Toemmsche Jul 17, 2023 • edited Loading

Choose a reason for hiding this comment

Sam-Kramer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sam-Kramer left a comment

Choose a reason for hiding this comment

Toemmsche commented Jul 13, 2023 •

edited

Loading

changelog-app bot commented Jul 13, 2023 •

edited by Toemmsche

Loading

Generate changelog in `changelog/@unreleased`

Toemmsche Jul 14, 2023 •

edited

Loading

Toemmsche Jul 17, 2023 •

edited

Loading