Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

add metrics to scrubber #4398

Merged
merged 2 commits into from
Nov 20, 2019
Merged

add metrics to scrubber #4398

merged 2 commits into from
Nov 20, 2019

Conversation

clockfort
Copy link
Contributor

Add metrics to scrubbing, for the purposes of debugging a stack that may not be keeping up with its own scrub queue.

(Aside) I believe the eventual goal is to get rid of 90% of scrubbing and trust in targeted sweep, but in the interim this needs metrics.

@changelog-app
Copy link

changelog-app bot commented Nov 8, 2019

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

The scrubber has been instrumented with metrics.

Check the box to generate changelog(s)

  • Generate changelog entry

@clockfort clockfort force-pushed the scrub_metrics branch 3 times, most recently from fe65c5c to bf5dfea Compare November 8, 2019 22:39
@@ -172,6 +184,13 @@ private Scrubber(KeyValueService keyValueService,
this.threadCount = threadCount;
this.readThreadCount = readThreadCount;
this.followers = followers;
this.metricsManager = metricsManager;

this.enqueuedCells = metricsManager.registerOrGetMeter(Scrubber.class, AtlasDbMetricNames.ENQUEUED_CELLS);
Copy link
Contributor

@jkozlowski jkozlowski Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make those metrics be lazily registered? I am only mildly familiar with the Scrubber, but it looks like it's only used for hard delete?

 private void scrubForAggressiveHardDelete(SnapshotTransaction tx) {
        if ((tx.getTransactionType() == TransactionType.AGGRESSIVE_HARD_DELETE) && !tx.isAborted()) {
            // t.getCellsToScrubImmediately() checks that t has been committed
            cleaner.scrubImmediately(this,
                    tx.getCellsToScrubImmediately(),
                    tx.getTimestamp(),
                    tx.getCommitTimestamp());
        }
    }
if (getTransactionType() == TransactionType.AGGRESSIVE_HARD_DELETE
                || getTransactionType() == TransactionType.HARD_DELETE) {
            cleaner.queueCellsForScrubbing(getCellsToQueueForScrubbing(), getStartTimestamp());
        }

So we wouldn't want to add all of those always, as not all products do hard delete.

Have a look at SweepOutcomeMetrics or LegacySweepMetrics.

Copy link
Contributor

@felixdesouza felixdesouza Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add logic to this in MetricsManager/TaggedMetricRegistry or port it upstream to tritium. Some sort of proxy which wraps around the metric and only instantiates/registers it when it's first used. If any additional metrics have to do this themselves, it becomes unwieldy.

Copy link
Contributor

@jeremyk-91 jeremyk-91 Nov 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good call, as Atlas-level metrics cost $$ in many deployments and we're trying to keep that down.

Though as a heads up: this is confusingly named, but any services that use Cleanup Tasks (and thus Stream Stores) will invoke hard delete transactions even if no user-level hard deletes have taken place, and thus register the metrics. So the savings might be less than we expect as things like internal shopping or compute deployer products will still use them.

Copy link
Contributor

@jkozlowski jkozlowski Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internal product creates the stream store, but does not use it in most deployments, so that wouldn't actually run any hard delete transactions? In which case we need the laziness here and also making sure the scrubber doesn't put, say, zeros for those metrics (and thus register them) when it had nothing to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixdesouza +1 for tracking something to make this easier, don't think it needs to block this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkozlowski Internal product uses the stream store for storing media, most all deployments would be using it.

I switched the new scrub metrics over to being lazily registered, though better upstream metrics library support transparent to the user would be better and surely result in additional cost savings for other less-used instrumented classes.

Copy link
Contributor

@jkozlowski jkozlowski Nov 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clockfort haha which internal product? the one I'm talking about only actually uses the stream store when installed on-prem, and the cost of those metrics we do not care about. Changes look good from my perspective, @jeremyk-91 can review the actual scrubber code. +1 for upstreaming something like this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(@jkozlowski was probably referring to the fjord product, @clockfort the large internal product!)

Copy link
Contributor Author

@clockfort clockfort Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep! yep

Copy link
Contributor

@jeremyk-91 jeremyk-91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified the marking of the meters is done at an appropriate time, and they aren't registered unless the relevant codepaths are hit.

@jeremyk-91 jeremyk-91 merged commit 970313d into develop Nov 20, 2019
@delete-merged-branch delete-merged-branch bot deleted the scrub_metrics branch November 20, 2019 18:39
@svc-autorelease
Copy link
Collaborator

Released 0.173.10

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants