Metrics aggregation for same metric with 2 or more nodes #433

ecourreges-orange · 2019-11-25T14:55:28Z

Bug or feature?
Using the demo in example/demo runs fine.
I launched a second metrics-load-generator, hoping it would double the traffic and the metrics on http://localhost:8889/metrics, but instead it resets the metrics and the first metrics-load-generator's metrics stop being taken into account, as if the last Node arrived wins a metric.
Then I thought, "ok looks like the node name needs to be unique", so I used a main2.go with a different service name, and the same metrics as the other one concurrently running, and got the same result of "last one launched wins", the prometheus graph and /metrics is reset to 0 as the new metrics-load-generator is launched.

Is this normal by design? Why?

I am trying to find the best way to expose Prometheus metrics from a C++ Apache Module, being Apache it's multiprocess, so I was hoping the OpenTelemetry Agent would allow me to aggregate the metrics of the different processes. I saw that in php there is an intermediate unix-socket aggregator, but I don't think it exists as exporter for C++.
The opencensus-cpp metrics exporter for ocagent doesn't exist either, but since the Trace one exists, it's probably simple enough to code.

Thank you for your time.
Regards.

ecourreges-orange · 2019-11-25T16:33:40Z

I guess I should have looked at this issue, it seems to be by design:
census-instrumentation/opencensus-service#542

But so I called the client label of my main2.go "cli2" and now I get alternating metrics for one and the other:

http://localhost:8889/metrics

# HELP promexample_opdemo_latency The various latencies of the methods
# TYPE promexample_opdemo_latency histogram
promexample_opdemo_latency_bucket{client="cli",label1="value1",method="repl",le="10"} 5
promexample_opdemo_latency_bucket{client="cli",label1="value1",method="repl",le="50"} 23
promexample_opdemo_latency_bucket{client="cli",label1="value1",method="repl",le="100"} 50

Again http://localhost:8889/metrics

# HELP promexample_opdemo_latency The various latencies of the methods
# TYPE promexample_opdemo_latency histogram
promexample_opdemo_latency_bucket{client="cli2",label1="value1",method="repl",le="10"} 17
promexample_opdemo_latency_bucket{client="cli2",label1="value1",method="repl",le="50"} 116
promexample_opdemo_latency_bucket{client="cli2",label1="value1",method="repl",le="100"} 190
promexample_opdemo_latency_bucket{client="cli2",label1="value1",method="repl",le="200"} 193

So this is a bug right?

firefixmaarten · 2020-07-01T13:16:22Z

Hi, it seems we also run into this issue. The both values are kept but seems not to be contained/exposed correctly in the collector.

firefixmaarten · 2020-07-01T13:19:01Z

Note that in my case it even is the same application, it just uses different labels but the result is at al times, only 1 value is in collector output

jmacd · 2020-07-01T17:54:58Z

I understand that OTel collector fills a role where you can imagine it aggregating host metrics together, but it requires a substantial redesign: it would turn the collector from a manager of pipelines into a manager of transient state. There is a proposal to add metrics transformation (open-telemetry/opentelemetry-collector-contrib#332) but it would not fill thus function because managing state is hard.

I think it should be more difficult to experience the kind of trouble you're seeing, that metrics are overwritten because they appear to be identical. It suggests that (1) there are no host-specific resources, or (2) the Prometheus exporter is ignoring resources.

In the OTel-Go export pipeline, I handled this problem instead of overwriting, but in that situation we have a state on every metric over an interval of time, which the collector does not. (In that situation, synchronous measurements are combined by merging, as the instrument semantics prescribe, and asynchronous measurements overwrite, as the instrument semantics prescribe: open-telemetry/opentelemetry-go#840.)

I would like to propose that every OTel client automatically generate a resource label that distinguishes itself from similar processes. Then the collector will work as intended. Maybe there is already a suitable semantic convention, and we should specify that it MUST be included in the resource.

firefixmaarten · 2020-07-02T07:47:10Z

I am currently observing this:
the go.opentelemetry.io/otel 0.7.0 package using:


import (
	"context"
	"fmt"
	"time"

	"github.com/firefixmaarten/test-otel/metrics"
	"go.opentelemetry.io/otel/api/global"
	"go.opentelemetry.io/otel/api/kv"
	"go.opentelemetry.io/otel/api/metric"
	"go.opentelemetry.io/otel/exporters/otlp"
	"go.opentelemetry.io/otel/sdk/metric/controller/push"
	"go.opentelemetry.io/otel/sdk/metric/selector/simple"
)

func installNewPipeline(options ...push.Option) (*push.Controller, error) {
	controller, err := newExportPipeline(options...)
	if err != nil {
		return controller, err
	}
	global.SetMeterProvider(controller.Provider())
	return controller, err
}

func newExportPipeline(options ...push.Option) (*push.Controller, error) {
	exp, err := otlp.NewExporter(
		otlp.WithInsecure(),
		otlp.WithAddress("127.0.0.1:55681"),
	)
	if err != nil {
		return nil, err
	}

	pusher := push.New(
		simple.NewWithExactDistribution(),
		exp,
		append([]push.Option{push.WithPeriod(time.Duration(3) * time.Second)}, options...)...,
	)
	pusher.Start()

	return pusher, nil
}
func main() {
	pipeline, err := installNewPipeline()
	if err != nil {
		fmt.Printf("%s", err)
	}
	defer pipeline.Stop()

	metricCreatedBagsTotal := metric.Must(global.Meter("esoptra/pluglet-golang-zaza")).NewFloat64Counter("created_bags_total", metric.WithDescription("The number of bags created per country"))
	country := "unknown"
	labels := []kv.KeyValue{}
	labels = append(labels, metrics.DefaultLabels...)
	labels = append(labels, kv.String("country", country))
	metricCreatedBagsTotal.Add(context.Background(), 1, labels...)

	time.Sleep(time.Second * time.Duration(3))

	metricCreatedBagsTotal = metric.Must(global.Meter("esoptra/pluglet-golang-zaza")).NewFloat64Counter("created_bags_total", metric.WithDescription("The number of bags created per country"))
	country = "unknown2"
	labels = []kv.KeyValue{}
	labels = append(labels, metrics.DefaultLabels...)
	labels = append(labels, kv.String("country", country))
	metricCreatedBagsTotal.Add(context.Background(), 3, labels...)

	time.Sleep(time.Second * time.Duration(3))

	metricCreatedBagsTotal = metric.Must(global.Meter("esoptra/pluglet-golang-zaza")).NewFloat64Counter("created_bags_total", metric.WithDescription("The number of bags created per country"))
	country = "unknown2"
	labels = []kv.KeyValue{}
	labels = append(labels, metrics.DefaultLabels...)
	labels = append(labels, kv.String("country", country))
	metricCreatedBagsTotal.Add(context.Background(), 2, labels...)

	time.Sleep(time.Second * time.Duration(3))

	metricCreatedBagsTotal = metric.Must(global.Meter("esoptra/pluglet-golang-zaza")).NewFloat64Counter("created_bags_total", metric.WithDescription("The number of bags created per country"))
	country = "unknown2"
	labels = []kv.KeyValue{}
	labels = append(labels, metrics.DefaultLabels...)
	labels = append(labels, kv.String("country", country))
	metricCreatedBagsTotal.Add(context.Background(), 2, labels...)

	time.Sleep(time.Second * time.Duration(60))

}

Result in send the new metric values only once. If is set it to sleep 0 seconds between them it seem to push the correct accumulated values and both metrics (with different labelsets):

# TYPE created_bags_total gauge
created_bags_total{application="zaza",cluster="local",country="unknown",minion_ip="127.0.0.1",pluglet_name="test-otel",pluglet_share="1486468616",pluglet_tenant=""} 1
created_bags_total{application="zaza",cluster="local",country="unknown2",minion_ip="127.0.0.1",pluglet_name="test-otel",pluglet_share="1486468616",pluglet_tenant=""} 7

It seems to only show the latest send metric in the prometheus exporter.
Should it not be so that the prometheus exporter in the collector retains the previous metrics that came in ?
Prometheus expect metrics to not disappear, so the prom exporter should not clear his cache on every incoming metric.
Or would I need to tell the client somehow to always send ALL metrics (seems overkill) ?

jmacd · 2020-07-02T22:43:50Z

@firefixmaarten This sounds like an OTel-Go issue already under investigation: open-telemetry/opentelemetry-go#875

Note that part of the problem being investigated there is a collector issue: #1255

firefixmaarten · 2020-07-03T05:37:54Z

Seems indeed to be the case, I will follow up on those issues. Thanks

tigrannajaryan · 2020-07-30T21:25:47Z

@bogdandrutu assigning this to you for triaging.

* histogram aggregator draft * add tests for buckets * naming stuffs * docs * add tests for buckets * fix doc * update year * adds docs for Histogram * docs for boundaries. * addresses review comments Change to less-than buckets. Add offset checks. Unexport fields that don't need to be exported. Fix tests when running on profile with int64 number kind. * sort boundaries * remove testing field * fixes import order * remove print 🙈

ambition-consulting · 2022-10-19T08:27:40Z

I understand that OTel collector fills a role where you can imagine it aggregating host metrics together, but it requires a substantial redesign: it would turn the collector from a manager of pipelines into a manager of transient state. There is a proposal to add metrics transformation (open-telemetry/opentelemetry-collector-contrib#332) but it would not fill thus function because managing state is hard.

I think it should be more difficult to experience the kind of trouble you're seeing, that metrics are overwritten because they appear to be identical. It suggests that (1) there are no host-specific resources, or (2) the Prometheus exporter is ignoring resources.

In the OTel-Go export pipeline, I handled this problem instead of overwriting, but in that situation we have a state on every metric over an interval of time, which the collector does not. (In that situation, synchronous measurements are combined by merging, as the instrument semantics prescribe, and asynchronous measurements overwrite, as the instrument semantics prescribe: open-telemetry/opentelemetry-go#840.)

I would like to propose that every OTel client automatically generate a resource label that distinguishes itself from similar processes. Then the collector will work as intended. Maybe there is already a suitable semantic convention, and we should specify that it MUST be included in the resource.

Wouldn't it be nice, if the collector could also receive metrics from browser clients, e.g. angular applications, and merge them across all clients? This way, we could better understand what's going on/wrong for our user's.

I understand this would require a shift in thinking, but for client/server architectures it would be worth the work.

…y#433) Bumps [boto3](https://github.com/boto/boto3) from 1.17.80 to 1.17.82. - [Release notes](https://github.com/boto/boto3/releases) - [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst) - [Commits](boto/boto3@1.17.80...1.17.82) Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…#433)

…y#433)

tigrannajaryan assigned bogdandrutu Jul 30, 2020

bogdandrutu added this to the Backlog milestone Aug 4, 2020

nabam mentioned this issue Nov 12, 2020

Reimplement prometheus exporter #2122

Closed

nabam mentioned this issue Jan 22, 2021

Reimplement prometheus exporter #2393

Merged

bogdandrutu closed this as completed in #2393 Mar 1, 2021

crobert-1 mentioned this issue Mar 12, 2024

Merge metrics from different resources export open-telemetry/opentelemetry-collector-contrib#31654

Closed

Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024

README: move and add links to community documentation (open-telemetry…

6355819

…#433)

swiatekm pushed a commit to swiatekm/opentelemetry-collector that referenced this issue Oct 9, 2024

Add support of priorityClassName to the operator chart (open-telemetr…

77fd275

…y#433)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics aggregation for same metric with 2 or more nodes #433

Metrics aggregation for same metric with 2 or more nodes #433

ecourreges-orange commented Nov 25, 2019

ecourreges-orange commented Nov 25, 2019

firefixmaarten commented Jul 1, 2020

firefixmaarten commented Jul 1, 2020

jmacd commented Jul 1, 2020

firefixmaarten commented Jul 2, 2020 •

edited

Loading

jmacd commented Jul 2, 2020

firefixmaarten commented Jul 3, 2020

tigrannajaryan commented Jul 30, 2020

ambition-consulting commented Oct 19, 2022

Metrics aggregation for same metric with 2 or more nodes #433

Metrics aggregation for same metric with 2 or more nodes #433

Comments

ecourreges-orange commented Nov 25, 2019

ecourreges-orange commented Nov 25, 2019

firefixmaarten commented Jul 1, 2020

firefixmaarten commented Jul 1, 2020

jmacd commented Jul 1, 2020

firefixmaarten commented Jul 2, 2020 • edited Loading

jmacd commented Jul 2, 2020

firefixmaarten commented Jul 3, 2020

tigrannajaryan commented Jul 30, 2020

ambition-consulting commented Oct 19, 2022

firefixmaarten commented Jul 2, 2020 •

edited

Loading