Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: CountingGauge #862

Open
carl-mastrangelo opened this issue Dec 14, 2020 · 5 comments
Open

Feature Request: CountingGauge #862

carl-mastrangelo opened this issue Dec 14, 2020 · 5 comments
Assignees
Milestone

Comments

@carl-mastrangelo
Copy link

carl-mastrangelo commented Dec 14, 2020

For Gauges that are logically counters, Spectator doesn't have an API for expressing these. I would like to request either a) a new type that expresses this use, or b) extend the existing Gauge API to support it. Several alternatives were proposed to work around this, but they all have their own faults, which is why I believe the correct place to fix it is in Spectator itself.

A counter is a type where the value starts at 0, and goes up by integer amounts. A gauge is a type where the value starts at NaN, and can be set to any floating point amount. What I need is partly a counter, partly a gauge. The use case is keeping track of the number of active connections on a machine, which is an integral quantity, starts at 0, and can go up or down. On the surface, this seems trivial to implement with a PolledMeter. It gets much more complicated when tags are associated, and the Id is dynamic.

Alternatives Considered

Use PolledMeter and keep track of the counts myself.

This approach says to use my own AtomicLong to keep track of counts, and use PolledMeter to track the values. To implement this correctly, I would need to maintain my own ConcurrentMap of (Registry, Id) to AtomicLong, and increment them accordingly. The problem here is that Registries do not have a well defined equality , so it is risky to use this as a key value in the map. Registries do not claim they use instance identity, or that the hashCode is constant. This makes it unsuitable as a map key. Using just the the Id instead is almost workable, but it means keeping the registry around, rather than being able to pass it in as a param. Lastly, considering that Registry is already maintaining this map, it doesn't make sense to make users do it too.

PolledMeter has a second weakness in that it lazily checks the value of the meter. In a unit test, it is reasonable to get the meter from the registry, and check that value has been incremented. When using PolledMeter, the update is delayed, forcing the test to either: a) block at the end until it notices the counter has been updated b) implement a custom Executor for the test that updates values immediately. c) Skip checking the registry and look at the Id->AtomicLong map. All three are less than optimal and push burden onto the user.

Use Gauge API, and use set(value() + 1).

The idea here is to use the existing API, and call set() with the previous value plus the difference. This solution also works, but it forces additional burden on the caller. When creating a gauge for the first time, the initial value is NaN, which means it cannot be incremented. The initial value needs to be set to 0 if the previous value is NaN. Secondly, writing the value after reading it is racy, which means that synchronization needs to be involved to ensure that the gauge is updated safely. In a multithreaded program, including these locks increases the risk of blocking, and adds additional boilerplate to the code. The operation is logically a compareAndSwap (or getAndAdd), which means that using locks is overkill for the operation.

Implement a custom Gauge, rather than a whole new Type in Spectator.

A third way would be to implement the functionality I need, implementing the Gauge interface. The problem here is that registering my custom type with the registry is unnecessarily difficult. Registry provides a nice property that asking for the same Id will always return the same Meter. When creating my own type, this is no longer true, because there is no deduplication of the Ids to Meters. The user has to create a new gauge every time, attempt to call register(), and hope that the meter doesn't overwrite the previous one. (and from reading the code, it's not even clear this would work, since the meter state is the same, but the Meter itself is different.

An alternative to this would be managing the meter state directly from the registry using the state() function, but this again is unnecessarily verbose. It's a ton of boilerplate for what amounts to an getAndAdd.

Use DistributionSummary.

DistributionSummary (DS) is also proposed as an alternative, based on the claim that the values can be made lower. This has multiple issues as well. First, the value being measured, (e.g. active connections), is not a distribution. It wouldn't make sense to record this with a PercentileDistribution for instance. There is a fixed, known quantity at any given point in time, unlike a distribution. Second, DS only allows positive values, meaning that the meter cannot be decremented when a connection closes (see DefaultDistributionSummary.record). Instead, it is the responsibility of the caller to keep track of the actual counts, which is now the same as the polled meter case. Third, even if this was a workable solution, distribution adds tags to recorded meter called count. This name is confusing, because the meter name is already a count (like conn.active.count). The tag is not a count of the connections, but the count of how many times it has been called. This is needlessly confusing, and a user looking at the metrics in a UI would need to remember this peculiarity.

Proposed Alternative

Any non-negative integer gauge would be able to benefit from a proper CountingGauge type. Things like number of active requests, number of active threads, number of connections, number of elements in a queue. All of these would benefit from a type that can express their semantics. It would also be possible to add a compareAndSwap method to Gauge; the issue here is that the default value is difficult to set. (CAS(NaN, newValue) ?) Also, the existing gauge never claims what the default value is; this is something a new datatype would be able to express.

@carl-mastrangelo
Copy link
Author

A sample of the code that would use this: Netflix/zuul#955

@carl-mastrangelo
Copy link
Author

A followup to this: I had tried to work around this issuer by synchronizing on the gauge returned by spectator, but this doesn't work. The returned Gauge object is a wrapper, rather than the true object. This means locking on it provides no threadsafety. Any threadsafe actions need to be implement in Spectator itself.

@brharrington brharrington modified the milestones: 0.123.0, 0.124.0 Jan 12, 2021
@carl-mastrangelo
Copy link
Author

A followup ping on this. My work around for this has been using a synchronized block to enforce invariants about the gauge, but this breaks down when the gauge is a NoopGauge. It discards all writes, and always returns NaN, which breaks my checking. Since NoopGauge is package private, it is unnecessarily difficult to work around it in tests.

See: Netflix/zuul#989

@brharrington
Copy link
Contributor

It has been busy recently so haven't had much time to look at Spectator stuff. Reading through the original description here are my thoughts:

For Gauges that are logically counters

Spectator has a counter type that is used for tracking a rate of change and is different than what you are describing. To avoid overloading the term I'm going to refer to the notion you are describing as a tally gauge for now.

Use PolledMeter and keep track of the counts myself.

This would be the recommended way currently if you really wanted it to be a gauge. Not sure I follow the desire to keep the registry as a key, though I would agree it is not a good candidate for use a map key. It seems like you could just create a simple class something like:

import com.netflix.spectator.api.Id;
import com.netflix.spectator.api.Registry;

import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.atomic.AtomicLong;

public final class TallyGauges {

  private final Registry registry;
  private final ConcurrentHashMap<Id, AtomicLong> gauges;

  public TallyGauges(Registry registry) {
    this.registry = registry;
    this.gauges = new ConcurrentHashMap<>();
  }

  public AtomicLong get(Id id) {
    return gauges.computeIfAbsent(id, this::createAndRegister);
  }

  private AtomicLong createAndRegister(Id id) {
    return PolledMeter.using(registry)
        .withId(id)
        .monitorValue(new AtomicLong());
  }

  public void remove(Id id) {
    gauges.remove(id);
    PolledMeter.remove(registry, id);
  }

  public void increment(Id id) {
    get(id).incrementAndGet();
  }

  public void decrement(Id id) {
    get(id).decrementAndGet();
  }
}

Regarding testing, you can use PolledMeter.update(registry) to force an update in tests. Something like:

import com.netflix.spectator.api.DefaultRegistry;
import com.netflix.spectator.api.Id;
import com.netflix.spectator.api.Registry;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.BeforeEach;
import org.junit.jupiter.api.Test;

public class TallyGaugesSuite {

  private Registry registry;
  private TallyGauges gauges;

  @BeforeEach
  public void before() {
    registry = new DefaultRegistry();
    gauges = new TallyGauges(registry);
  }

  private void checkGaugeValue(Id id, double expected) {
    PolledMeter.update(registry); // force update of polled gauges
    Assertions.assertEquals(expected, registry.gauge(id).value(), 1e-12);
  }

  @Test
  public void incrementCreatesGauge() {
    Id id = registry.createId("test");
    checkGaugeValue(id, Double.NaN); // doesn't exist yet
    gauges.increment(id);
    checkGaugeValue(id, 1.0);
  }

  @Test
  public void incrementIncreasesTheValue() {
    Id id = registry.createId("test");
    gauges.increment(id);
    gauges.increment(id);
    gauges.increment(id);
    checkGaugeValue(id, 3.0);
  }

  @Test
  public void decrementReducesTheValue() {
    Id id = registry.createId("test");
    gauges.increment(id);
    gauges.increment(id);
    gauges.increment(id);

    gauges.decrement(id);
    checkGaugeValue(id, 2.0);
  }
}

This also gives you full flexibility to use whatever operations on AtomicLong or other concurrent types you want. Similarly you could put in restrictions that make sense for your usage like preventing it from being negative. However, you will need to manage the lifecycle carefully if there is churn on these values and ensure they get removed so it is not a memory leak in your application.

Use Gauge API, and use set(value() + 1).

I wouldn't recommend this approach and as far as I know we don't suggest it anywhere. Gauge is intended to be set to a known value.

Implement a custom Gauge, rather than a whole new Type in Spectator.

Generally we don't recommend implementing custom meters. Just use the basic types (counter, gauge, distribution summary, and timer) as building blocks for other patterns. PolledMeter is an example that we provide.

Use DistributionSummary.

For the use-cases like a busy queue or possibly active connections, this is often more useful than a gauge would be and I would disagree with your assertion that tracking them as distributions is inappropriate. If the measurement will vary frequently (many times per minute), then using a distribution summary allows you to see variance within the sampling interval. For example, lets say that within the step interval we update a distribution summary after inserting into a queue and it has the following updates:

1, 2, 3, 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10, 5, 2

A gauge will get sampled and some value between 0 and 10 would be reported depending on when that happened and what the value was at that time. The distribution summary would allow you to see the number of inserts per second, average size of the queue during the step interval, and max size of the queue during the step interval. It does report a statistic=count, but in most cases users do not need to look at statistic. The UI will handle that for you when choosing an aggregation to use. Given distribution summaries are standard types recognized by the UI, I don't see that as a big source of confusion for users.

The main down sides I'm aware of for using distribution summaries are:

  1. They only reflect the activity within a given interval and reset on interval boundaries. In the queue example where frequent updates are expected that is fine, but if no updates were made for a while then the distribution summary would report 0 in intervals where no update was recorded. That doesn't mean the queue is empty, but that it hasn't been updated.
  2. It requires being able to tap into each update or change which isn't always possible. In some cases polling something to get the size at given instant is the only access that is available.

Proposed Alternative

We could potentially have some sort of TallyGauge pattern, but I'm not fully convinced it should be a core type. I'm pretty busy right now with other commitments so not sure when I'll have the time to devote to it. Hopefully something like the TallyGauges example above will work for now. In my opinion that is a better option than the gauge locking stuff you are doing right now.

@brharrington brharrington modified the milestones: 0.124.0, backlog Feb 3, 2021
@carl-mastrangelo
Copy link
Author

Thanks for the response. Replies inline:

This would be the recommended way currently if you really wanted it to be a gauge.

I feel like PolledMeter is used as the kitchen sink of gauges. It's used for anything that doesn't fit into the pre defined types. You are right that it is probably the best way now to accomplish what I want, but that isn't what this issue is about. (Phrased differently: PolledMeter is always the recommended way to do anything that encounters an API deficiency). Instead, I am asking about slicing out a new pattern from the PolledMeter catchall.

Regarding testing, you can use PolledMeter.update(registry) to force an update in tests. Something like:

Thanks for this, I didn't realize this method existed.

Generally we don't recommend implementing custom meters.

This is not at all obvious from the Javadoc. I think you should call this out explicitly, and in the place most readily visible to users.

It seems like you could just create a simple class something like:

I called out why this is not possible. There may be multiple instances of TallyGauges, which would confuse the counts in the map.

I would disagree with your assertion that tracking them as distributions is inappropriate. If the measurement will vary frequently (many times per minute), then using a distribution summary allows you to see variance within the sampling interval.

Normally, but in my specific case, we already have a counter for connections, independent of what is being asked here. Distribution summaries add no extra value, because it was already being tracked via another means. The thing I am trying to measure is a "steady state" like value. The variance is already discernible via the connection establishment and teardown counters.

We could potentially have some sort of TallyGauge pattern, but I'm not fully convinced it should be a core type.

From reading the response, I feel like the issue wasn't really addressed. In a short, one liner, the request is: "there needs to be a meter which can be incremented and decremented, not just set." Any metric, which is logically an absolute number, but which may be concurrently modified, and is seldomly read, would be able to fit into this pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants