Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics aggregation for same metric with 2 or more nodes #433

Closed
ecourreges-orange opened this issue Nov 25, 2019 · 9 comments · Fixed by #2393
Closed

Metrics aggregation for same metric with 2 or more nodes #433

ecourreges-orange opened this issue Nov 25, 2019 · 9 comments · Fixed by #2393
Assignees
Milestone

Comments

@ecourreges-orange
Copy link

Bug or feature?
Using the demo in example/demo runs fine.
I launched a second metrics-load-generator, hoping it would double the traffic and the metrics on http://localhost:8889/metrics, but instead it resets the metrics and the first metrics-load-generator's metrics stop being taken into account, as if the last Node arrived wins a metric.
Then I thought, "ok looks like the node name needs to be unique", so I used a main2.go with a different service name, and the same metrics as the other one concurrently running, and got the same result of "last one launched wins", the prometheus graph and /metrics is reset to 0 as the new metrics-load-generator is launched.

Is this normal by design? Why?

I am trying to find the best way to expose Prometheus metrics from a C++ Apache Module, being Apache it's multiprocess, so I was hoping the OpenTelemetry Agent would allow me to aggregate the metrics of the different processes. I saw that in php there is an intermediate unix-socket aggregator, but I don't think it exists as exporter for C++.
The opencensus-cpp metrics exporter for ocagent doesn't exist either, but since the Trace one exists, it's probably simple enough to code.

Thank you for your time.
Regards.

@ecourreges-orange
Copy link
Author

I guess I should have looked at this issue, it seems to be by design:
census-instrumentation/opencensus-service#542

But so I called the client label of my main2.go "cli2" and now I get alternating metrics for one and the other:

http://localhost:8889/metrics

# HELP promexample_opdemo_latency The various latencies of the methods
# TYPE promexample_opdemo_latency histogram
promexample_opdemo_latency_bucket{client="cli",label1="value1",method="repl",le="10"} 5
promexample_opdemo_latency_bucket{client="cli",label1="value1",method="repl",le="50"} 23
promexample_opdemo_latency_bucket{client="cli",label1="value1",method="repl",le="100"} 50

Again http://localhost:8889/metrics

# HELP promexample_opdemo_latency The various latencies of the methods
# TYPE promexample_opdemo_latency histogram
promexample_opdemo_latency_bucket{client="cli2",label1="value1",method="repl",le="10"} 17
promexample_opdemo_latency_bucket{client="cli2",label1="value1",method="repl",le="50"} 116
promexample_opdemo_latency_bucket{client="cli2",label1="value1",method="repl",le="100"} 190
promexample_opdemo_latency_bucket{client="cli2",label1="value1",method="repl",le="200"} 193

So this is a bug right?

@firefixmaarten
Copy link

Hi, it seems we also run into this issue. The both values are kept but seems not to be contained/exposed correctly in the collector.
image

@firefixmaarten
Copy link

Note that in my case it even is the same application, it just uses different labels but the result is at al times, only 1 value is in collector output

@jmacd
Copy link
Contributor

jmacd commented Jul 1, 2020

I understand that OTel collector fills a role where you can imagine it aggregating host metrics together, but it requires a substantial redesign: it would turn the collector from a manager of pipelines into a manager of transient state. There is a proposal to add metrics transformation (open-telemetry/opentelemetry-collector-contrib#332) but it would not fill thus function because managing state is hard.

I think it should be more difficult to experience the kind of trouble you're seeing, that metrics are overwritten because they appear to be identical. It suggests that (1) there are no host-specific resources, or (2) the Prometheus exporter is ignoring resources.

In the OTel-Go export pipeline, I handled this problem instead of overwriting, but in that situation we have a state on every metric over an interval of time, which the collector does not. (In that situation, synchronous measurements are combined by merging, as the instrument semantics prescribe, and asynchronous measurements overwrite, as the instrument semantics prescribe: open-telemetry/opentelemetry-go#840.)

I would like to propose that every OTel client automatically generate a resource label that distinguishes itself from similar processes. Then the collector will work as intended. Maybe there is already a suitable semantic convention, and we should specify that it MUST be included in the resource.

@firefixmaarten
Copy link

firefixmaarten commented Jul 2, 2020

I am currently observing this:
the go.opentelemetry.io/otel 0.7.0 package using:


import (
	"context"
	"fmt"
	"time"

	"github.com/firefixmaarten/test-otel/metrics"
	"go.opentelemetry.io/otel/api/global"
	"go.opentelemetry.io/otel/api/kv"
	"go.opentelemetry.io/otel/api/metric"
	"go.opentelemetry.io/otel/exporters/otlp"
	"go.opentelemetry.io/otel/sdk/metric/controller/push"
	"go.opentelemetry.io/otel/sdk/metric/selector/simple"
)

func installNewPipeline(options ...push.Option) (*push.Controller, error) {
	controller, err := newExportPipeline(options...)
	if err != nil {
		return controller, err
	}
	global.SetMeterProvider(controller.Provider())
	return controller, err
}

func newExportPipeline(options ...push.Option) (*push.Controller, error) {
	exp, err := otlp.NewExporter(
		otlp.WithInsecure(),
		otlp.WithAddress("127.0.0.1:55681"),
	)
	if err != nil {
		return nil, err
	}

	pusher := push.New(
		simple.NewWithExactDistribution(),
		exp,
		append([]push.Option{push.WithPeriod(time.Duration(3) * time.Second)}, options...)...,
	)
	pusher.Start()

	return pusher, nil
}
func main() {
	pipeline, err := installNewPipeline()
	if err != nil {
		fmt.Printf("%s", err)
	}
	defer pipeline.Stop()

	metricCreatedBagsTotal := metric.Must(global.Meter("esoptra/pluglet-golang-zaza")).NewFloat64Counter("created_bags_total", metric.WithDescription("The number of bags created per country"))
	country := "unknown"
	labels := []kv.KeyValue{}
	labels = append(labels, metrics.DefaultLabels...)
	labels = append(labels, kv.String("country", country))
	metricCreatedBagsTotal.Add(context.Background(), 1, labels...)

	time.Sleep(time.Second * time.Duration(3))

	metricCreatedBagsTotal = metric.Must(global.Meter("esoptra/pluglet-golang-zaza")).NewFloat64Counter("created_bags_total", metric.WithDescription("The number of bags created per country"))
	country = "unknown2"
	labels = []kv.KeyValue{}
	labels = append(labels, metrics.DefaultLabels...)
	labels = append(labels, kv.String("country", country))
	metricCreatedBagsTotal.Add(context.Background(), 3, labels...)

	time.Sleep(time.Second * time.Duration(3))

	metricCreatedBagsTotal = metric.Must(global.Meter("esoptra/pluglet-golang-zaza")).NewFloat64Counter("created_bags_total", metric.WithDescription("The number of bags created per country"))
	country = "unknown2"
	labels = []kv.KeyValue{}
	labels = append(labels, metrics.DefaultLabels...)
	labels = append(labels, kv.String("country", country))
	metricCreatedBagsTotal.Add(context.Background(), 2, labels...)

	time.Sleep(time.Second * time.Duration(3))

	metricCreatedBagsTotal = metric.Must(global.Meter("esoptra/pluglet-golang-zaza")).NewFloat64Counter("created_bags_total", metric.WithDescription("The number of bags created per country"))
	country = "unknown2"
	labels = []kv.KeyValue{}
	labels = append(labels, metrics.DefaultLabels...)
	labels = append(labels, kv.String("country", country))
	metricCreatedBagsTotal.Add(context.Background(), 2, labels...)

	time.Sleep(time.Second * time.Duration(60))

}

Result in send the new metric values only once. If is set it to sleep 0 seconds between them it seem to push the correct accumulated values and both metrics (with different labelsets):

# TYPE created_bags_total gauge
created_bags_total{application="zaza",cluster="local",country="unknown",minion_ip="127.0.0.1",pluglet_name="test-otel",pluglet_share="1486468616",pluglet_tenant=""} 1
created_bags_total{application="zaza",cluster="local",country="unknown2",minion_ip="127.0.0.1",pluglet_name="test-otel",pluglet_share="1486468616",pluglet_tenant=""} 7

It seems to only show the latest send metric in the prometheus exporter.
Should it not be so that the prometheus exporter in the collector retains the previous metrics that came in ?
Prometheus expect metrics to not disappear, so the prom exporter should not clear his cache on every incoming metric.
Or would I need to tell the client somehow to always send ALL metrics (seems overkill) ?

@jmacd
Copy link
Contributor

jmacd commented Jul 2, 2020

@firefixmaarten This sounds like an OTel-Go issue already under investigation: open-telemetry/opentelemetry-go#875

Note that part of the problem being investigated there is a collector issue: #1255

@firefixmaarten
Copy link

Seems indeed to be the case, I will follow up on those issues. Thanks

@tigrannajaryan
Copy link
Member

@bogdandrutu assigning this to you for triaging.

@bogdandrutu bogdandrutu added this to the Backlog milestone Aug 4, 2020
MovieStoreGuy pushed a commit to atlassian-forks/opentelemetry-collector that referenced this issue Nov 11, 2021
* histogram aggregator draft

* add tests for buckets

* naming stuffs

* docs

* add tests for buckets

* fix doc

* update year

* adds docs for Histogram

* docs for boundaries.

* addresses review comments
Change to less-than buckets. Add offset checks. Unexport fields that don't need to be exported. Fix tests when running on profile with int64 number kind.

* sort boundaries

* remove testing field

* fixes import order

* remove print 🙈
@ambition-consulting
Copy link

I understand that OTel collector fills a role where you can imagine it aggregating host metrics together, but it requires a substantial redesign: it would turn the collector from a manager of pipelines into a manager of transient state. There is a proposal to add metrics transformation (open-telemetry/opentelemetry-collector-contrib#332) but it would not fill thus function because managing state is hard.

I think it should be more difficult to experience the kind of trouble you're seeing, that metrics are overwritten because they appear to be identical. It suggests that (1) there are no host-specific resources, or (2) the Prometheus exporter is ignoring resources.

In the OTel-Go export pipeline, I handled this problem instead of overwriting, but in that situation we have a state on every metric over an interval of time, which the collector does not. (In that situation, synchronous measurements are combined by merging, as the instrument semantics prescribe, and asynchronous measurements overwrite, as the instrument semantics prescribe: open-telemetry/opentelemetry-go#840.)

I would like to propose that every OTel client automatically generate a resource label that distinguishes itself from similar processes. Then the collector will work as intended. Maybe there is already a suitable semantic convention, and we should specify that it MUST be included in the resource.

Wouldn't it be nice, if the collector could also receive metrics from browser clients, e.g. angular applications, and merge them across all clients? This way, we could better understand what's going on/wrong for our user's.

I understand this would require a shift in thinking, but for client/server architectures it would be worth the work.

hughesjj pushed a commit to hughesjj/opentelemetry-collector that referenced this issue Apr 27, 2023
…y#433)

Bumps [boto3](https://github.com/boto/boto3) from 1.17.80 to 1.17.82.
- [Release notes](https://github.com/boto/boto3/releases)
- [Changelog](https://github.com/boto/boto3/blob/develop/CHANGELOG.rst)
- [Commits](boto/boto3@1.17.80...1.17.82)

Signed-off-by: dependabot[bot] <[email protected]>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Troels51 pushed a commit to Troels51/opentelemetry-collector that referenced this issue Jul 5, 2024
swiatekm pushed a commit to swiatekm/opentelemetry-collector that referenced this issue Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants