Skip to content

Commit

Permalink
[fix][broker] Remove timestamp from broker metrics (#17419)
Browse files Browse the repository at this point in the history
### Motivation

When a Pulsar topic is unloaded from a broker, certain metrics related to that topic will appear to remain active for the broker for 5 minutes. This is confusing for troubleshooting because it makes the topic appear to be owned by multiple brokers for a short period of time. See below for a way to reproduce this behavior.

In order to solve this "zombie" metric problem, I propose we remove the timestamps that get exported with each Prometheus metric served by the broker.

### Analysis

Since we introduced Prometheus metrics in #294, we have exported a timestamp along with most metrics. This is an optional, valid part of the spec defined [here](https://prometheus.io/docs/instrumenting/exposition_formats/#comments-help-text-and-type-information). However, after our adoption of Prometheus metrics, the Prometheus project released version 2.0 with a significant improvement to its concept of staleness. In short, before 2.0, a metric that was in the last scrape but not the next one (this often happens for topics that are unloaded) will essentially inherit the most recent value for the last 5 minute window. If there isn't one in the past 5 minutes, the metric becomes "stale" and isn't reported. Starting in 2.0, there was new logic to consider a value stale the very first time that it is not reported in a scrape. Importantly, this new behavior is only available if you do not export timestamps with metrics, as documented here: https://prometheus.io/docs/prometheus/latest/querying/basics/#staleness. We want to use the new behavior because it gives better insight into all topic metrics, which are subject to move between brokers at any time.

This presentation https://www.youtube.com/watch?v=GcTzd2CLH7I and slide deck https://promcon.io/2017-munich/slides/staleness-in-prometheus-2-0.pdf document the feature in detail. This blog post was also helpful: https://www.robustperception.io/staleness-and-promql/.

Additional motivation comes from mailing list threads like this one https://groups.google.com/g/prometheus-users/c/8OFAwp1OEcY. It says:

> Note, however, that adding timestamps is an extremely niche use
case. Most of the users who think the need it should actually not do
it.
>
> The main usecases within that tiny niche are federation and mirroring
the data from another monitoring system.

The Prometheus Go client also indicates a similar motivation: https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#NewMetricWithTimestamp.

The OpenMetrics project also recommends against exporting timestamps: https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#exposing-timestamps.

As such, I think we are not a niche use case, and we should not add timestamps to our metrics.

### Reproducing the problem

1. Run any 2.x version of Prometheus (I used 2.31.0) along with the following scrape config:
```yaml
  - job_name: broker
    honor_timestamps: true
    scrape_interval: 30s
    scrape_timeout: 10s
    metrics_path: /metrics
    scheme: http
    follow_redirects: true
    static_configs:
      - targets: ["localhost:8080"]
```
2. Start pulsar standalone on the same machine. I used a recently compiled version of master.
3. Publish messages to a topic.
4. Observe `pulsar_in_messages_total` metric for the topic in the prometheus UI (localhost:9090)
5. Stop the producer.
6. Unload the topic from the broker.
7. Optionally, `curl` the metrics endpoint to verify that the topic’s `pulsar_in_messages_total` metric is no longer reported.
8. Watch the metrics get reported in prometheus for 5 additional minutes.

When you set `honor_timestamps: false`, the metric stops getting reported right after the topic is unloaded, which is the desired behavior.

### Modifications

* Remove all timestamps from metrics
* Fix affected tests and test files (some of those tests were in the proxy and the function worker, but no code was changed for those modules)

### Verifying this change

This change is accompanied by updated tests.

### Does this pull request potentially affect one of the following parts:

This is technically a breaking change to the metrics, though I would consider it a bug fix at this point. I will discuss it on the mailing list to ensure it gets proper visibility.

Given how frequently Pulsar changes which metrics are exposed between each scrape, I think this is an important fix that should be cherry picked to older release branches. Technically, we can avoid cherry picking this change if we advise users to set `honor_timestamps: false`. However, I think it is better to just remove them.

### Documentation
- [x] `doc-not-needed`
  • Loading branch information
michaeljmarshall authored Sep 7, 2022
1 parent 6c02186 commit 0bbc4e1
Show file tree
Hide file tree
Showing 6 changed files with 380 additions and 381 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ void writeSample(String metricName, Number value, String... labelsAndValuesArray
stream.write(',');
}
}
stream.write("} ").write(value).write(' ').write(System.currentTimeMillis()).write('\n');
stream.write("} ").write(value).write('\n');
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -304,8 +304,7 @@ private static void parseMetricsToPrometheusMetrics(Collection<Metrics> metrics,
appendedQuantile = true;
}
}
stream.write("} ").write(String.valueOf(entry.getValue()))
.write(' ').write(System.currentTimeMillis()).write("\n");
stream.write("} ").write(String.valueOf(entry.getValue())).write("\n");
}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -766,10 +766,10 @@ public void testPerConsumerStats() throws Exception {
This can happen when including topic metrics, since the same metric is reported multiple times with different labels. For example:
# TYPE pulsar_subscriptions_count gauge
pulsar_subscriptions_count{cluster="standalone"} 0 1556372982118
pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata"} 1.0 1556372982118
pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate"} 1.0 1556372982118
pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments"} 1.0 1556372982118
pulsar_subscriptions_count{cluster="standalone"} 0
pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/metadata"} 1.0
pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/coordinate"} 1.0
pulsar_subscriptions_count{cluster="standalone",namespace="public/functions",topic="persistent://public/functions/assignments"} 1.0
**/
// Running the test twice to make sure types are present when generated multiple times
Expand Down Expand Up @@ -1598,8 +1598,8 @@ public static Multimap<String, Metric> parseMetrics(String metrics) {
// jvm_threads_current{cluster="standalone",} 203.0
// or
// pulsar_subscriptions_count{cluster="standalone", namespace="public/default",
// topic="persistent://public/default/test-2"} 0.0 1517945780897
Pattern pattern = Pattern.compile("^(\\w+)\\{([^\\}]+)\\}\\s([+-]?[\\d\\w\\.-]+)(\\s(\\d+))?$");
// topic="persistent://public/default/test-2"} 0.0
Pattern pattern = Pattern.compile("^(\\w+)\\{([^\\}]+)\\}\\s([+-]?[\\d\\w\\.-]+)$");
Pattern tagsPattern = Pattern.compile("(\\w+)=\"([^\"]+)\"(,\\s?)?");

Splitter.on("\n").split(metrics).forEach(line -> {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -74,8 +74,8 @@ public static Map<String, Metric> parseMetrics(String metrics) {
// jvm_threads_current{cluster="standalone",} 203.0
// or
// pulsar_subscriptions_count{cluster="standalone", namespace="public/default",
// topic="persistent://public/default/test-2"} 0.0 1517945780897
Pattern pattern = Pattern.compile("^(\\w+)(\\{[^\\}]+\\})?\\s([+-]?[\\d\\w\\.-]+)(\\s(\\d+))?$");
// topic="persistent://public/default/test-2"} 0.0
Pattern pattern = Pattern.compile("^(\\w+)(\\{[^\\}]+\\})?\\s([+-]?[\\d\\w\\.-]+)$");
Pattern tagsPattern = Pattern.compile("(\\w+)=\"([^\"]+)\"(,\\s?)?");
Arrays.asList(metrics.split("\n")).forEach(line -> {
if (line.isEmpty() || line.startsWith("#")) {
Expand Down
Loading

0 comments on commit 0bbc4e1

Please sign in to comment.