Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

[LWMeta] Add metadata metrics to TimeLock #6665

Merged
merged 15 commits into from
Jul 21, 2023

Conversation

Toemmsche
Copy link
Contributor

@Toemmsche Toemmsche commented Jul 13, 2023

General

Before this PR:

After this PR:
We have metrics to observe how much metadata is being sent to TimeLock and how much we metadata are storing in the lock watch event sliding window. This will be useful to decide if we need a limit on the amount of metadata and if so, what the limit should be.

==COMMIT_MSG==
[LWMeta] Add metadata metrics to TimeLock
==COMMIT_MSG==

Priority:

Concerns / possible downsides (what feedback would you like?):
I noticed that we do not have a lot of metrics in TimeLock in general (basically just the witchcraft ones). Is this deliberate for performance reasons? If yes, then this PR does not make a lot of sense.

Is documentation needed?:

Compatibility

I have had to adjust the constructor interface of AsyncTimeLockServiceImpl and LockEventLogImpl and some usages of those, but i don't believe anyone is using then outside of TimeLock.

Testing and Correctness

What was existing testing like? What have you done to improve it?:
Added tests to verify that metrics are set correctly (for stored + request metadata)

Execution

How would I tell this PR works in production? (Metrics, logs, etc.):
We have new metrics for metadata: timelock.requestMetadata and timelock.storedMetadata
Has the safety of all log arguments been decided correctly?:
N/A
Will this change significantly affect our spending on metrics or logs?:
Not really given that we already emit metrics for every request in TimeLock and this metric is only for lock requests.
How would I tell that this PR does not work in production? (monitors, etc.):
Metrics are not visible OR the metrics are 0 even if clients are sending metadata-enriched requests.
If this PR does not work as expected, how do I fix that state? Would rollback be straightforward?:
A fix should be implemented
If the above plan is more complex than “recall and rollback”, please tag the support PoC here (if it is the end of the week, tag both the current and next PoC):
N/A

Scale

Would this PR be expected to pose a risk at scale? Think of the shopping product at our largest stack.:
No
Would this PR be expected to perform a large number of database calls, and/or expensive database calls (e.g., row range scans, concurrent CAS)?:
No
Would this PR ever, with time and scale, become the wrong thing to do - and if so, how would we know that we need to do something differently?:
If we get rid of metrics in TimeLock for performance, than we should also look at this.

Development Process

Where should we start reviewing?:

If this PR is in excess of 500 lines excluding versions lock-files, why does it not make sense to split it?:

@changelog-app
Copy link

changelog-app bot commented Jul 13, 2023

Generate changelog in changelog/@unreleased

What do the change types mean?
  • feature: A new feature of the service.
  • improvement: An incremental improvement in the functionality or operation of the service.
  • fix: Remedies the incorrect behaviour of a component of the service in a backwards-compatible way.
  • break: Has the potential to break consumers of this service's API, inclusive of both Palantir services
    and external consumers of the service's API (e.g. customer-written software or integrations).
  • deprecation: Advertises the intention to remove service functionality without any change to the
    operation of the service itself.
  • manualTask: Requires the possibility of manual intervention (running a script, eyeballing configuration,
    performing database surgery, ...) at the time of upgrade for it to succeed.
  • migration: A fully automatic upgrade migration task with no engineer input required.

Note: only one type should be chosen.

How are new versions calculated?
  • ❗The break and manual task changelog types will result in a major release!
  • 🐛 The fix changelog type will result in a minor release in most cases, and a patch release version for patch branches. This behaviour is configurable in autorelease.
  • ✨ All others will result in a minor version release.

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

[LWMeta] Add metadata metrics to TimeLock

Check the box to generate changelog(s)

  • Generate changelog entry

Comment on lines 68 to 73
public AsyncTimelockServiceImpl(
AsyncLockService lockService, ManagedTimestampService timestampService, LockLog lockLog) {
AsyncLockService lockService,
ManagedTimestampService timestampService,
LockLog lockLog,
MetadataMetrics metadataMetrics) {
this.metadataMetrics = metadataMetrics;
Copy link
Contributor Author

@Toemmsche Toemmsche Jul 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked in sourcegraph to ensure that no one is using this constructor outside of timelock

@Toemmsche Toemmsche marked this pull request as ready for review July 14, 2023 13:43
@Toemmsche Toemmsche requested a review from Sam-Kramer July 14, 2023 13:43
Comment on lines 4 to 10
namespaces:
requestMetadata:
docs: Metrics tracking metadata presence in lock requests sent to TimeLock
metrics:
numChangeMetadata:
docs: Number of change metadata objects contained in a lock request
type: histogram
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the best way to name these. I wanted to highlight that we are only collecting metrics on how much/what kind of metadata we receive as part of requests. If we do other metrics in the future (e.g., how much metadata is currently persisted in TimeLock), I'd add them as a separate namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, that's exactly what I did

import java.util.Optional;
import java.util.Set;
import java.util.UUID;
import java.util.function.Supplier;
import java.util.stream.Collectors;

public class LockEventLogImpl implements LockEventLog {

@VisibleForTesting
static final int SLIDING_WINDOW_SIZE = 1000;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just so we do not have to adjust the tests if we ever decide to change this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice

.orElse(0);
int numPresentMetadataDiff = metadata.map(_unused -> 1).orElse(0)
- replacedMetadata.map(_unused -> 1).orElse(0);
metadataMetrics.numChangeMetadata().inc(changeMetadataSizeDiff);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

calling inc() with negative value works as expected (verified in a test)

/**
* Returns the {@link LockWatchEvent} that was replaced if the buffer is already full.
*/
Optional<LockWatchEvent> add(LockWatchEvent.Builder eventBuilder) {
Copy link
Contributor Author

@Toemmsche Toemmsche Jul 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most calls to this now drop the return value

Copy link
Contributor

@Sam-Kramer Sam-Kramer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked at tests yet, but so far looks good!

@@ -0,0 +1,20 @@
options:
javaPackage: 'com.palantir.atlasdb.timelock.metrics'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

slight semantic change I'd suggest, I'd go for com.palantir.atlasdb.timelock.lockwatches, and then have the namespaces be request and current (stored implies persistence).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can then just have changeMetadata rather than numChangeMetadata. So using the metric would look like: com.palantir.atlasdb.timelock.lockwatches.request.changeMetadata.p99. Or even com.palantir.atlasdb.timelock.lockwatches.request.changeMetadataCount.p99.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also think about how we can track # of bytes as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Number of bytes is a bit trickier since we potentially have to iterate over all metadata objects. For SKAPC, we know an upper bound for the number of bytes per ChangeMetadata object (maximum key size). Do we want the exact number anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking of just tracking it when we do an add. It would be a meter metric, so it's more of the rate than total size.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also do a counter, where we minus once we remove something from the sliding window

@@ -106,6 +112,12 @@ public ListenableFuture<LockResponseV2> lock(IdentifiedLockRequest request) {
request.getLockDescriptors(),
TimeLimit.of(request.getAcquireTimeoutMs()),
request.getMetadata());
metadataMetrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will construct a new histogram object for every request; we should create the histogram metadataMetrics::numChangeMetadata in the constructor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixed this for the counters

/**
* Returns the {@link LockWatchEvent} that was replaced if the buffer is already full.
*/
Optional<LockWatchEvent> add(LockWatchEvent.Builder eventBuilder) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little hesitant to keep this named as add, maybe addOrReplace or something similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the metrics to ArrayLockEventSlidingWindow, so this, once again, does not return anything. Should we still rename it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nah

import java.util.Optional;
import java.util.Set;
import java.util.UUID;
import java.util.function.Supplier;
import java.util.stream.Collectors;

public class LockEventLogImpl implements LockEventLog {

@VisibleForTesting
static final int SLIDING_WINDOW_SIZE = 1000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice

@@ -64,7 +78,22 @@ public synchronized <T> ValueAndLockWatchStateUpdate<T> runTask(
@Override
public synchronized void logLock(
Set<LockDescriptor> locksTakenOut, LockToken lockToken, Optional<LockRequestMetadata> metadata) {
slidingWindow.add(LockEvent.builder(locksTakenOut, lockToken, metadata));
Optional<LockWatchEvent> replacedEvent =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question: Do we think we should have the metric recorded here, or in ArrayLockEventSlidingWindow? What's sort of frustrating by the previous implementation, is that we have the storage of lock events, and then the processing. Feels a bit weird to me that we track the storage of lock events at the processing layer, although the same times it sort of makes sense, as we avoid wiring in multiple places.

That said, maybe we can use the delegate pattern to help us?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the point. Also, moving the metrics to the storage layer makes sense for testing since we have to be aware of the buffer filling up (which the processing layer should not need to be aware of).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind doing more wiring to track metrics at the very low level

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think delegating the metric tracking can be worthwhile if we do track a lot but for 2-3 metrics I don't think it's worth it yet

@Toemmsche Toemmsche requested a review from Sam-Kramer July 20, 2023 10:28
@Toemmsche Toemmsche force-pushed the tpapke/metadata/timelock-metadata-metrics branch from c13a5f5 to f0d976b Compare July 20, 2023 12:44
Comment on lines -108 to -122
@Value.Immutable
abstract static class FakeLockWatchEvent implements LockWatchEvent {
@Override
@Value.Parameter
public abstract long sequence();

@Override
@Value.Parameter
public abstract int size();

@Override
public <T> T accept(LockWatchEvent.Visitor<T> visitor) {
// do nothing
return null;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to remove this because the accept returning null here will cause the add to fail since an Optional<LockRequestMetadata> is now null Also, we should never return null anyways, no even in test code!.


@Test
public void maintainsCorrectMetadataCountWhenOverwritingBuffer() {
assertThat(WINDOW_SIZE).as("This test does not work with small windows").isGreaterThanOrEqualTo(5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: explain why it doesn't work for small windows!

Copy link
Contributor

@Sam-Kramer Sam-Kramer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! one small nit

@bulldozer-bot bulldozer-bot bot merged commit 0995b73 into develop Jul 21, 2023
@bulldozer-bot bulldozer-bot bot deleted the tpapke/metadata/timelock-metadata-metrics branch July 21, 2023 14:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants