Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicit sign of dropped spans #5557

Open
amanakin opened this issue Jun 27, 2024 · 8 comments
Open

Explicit sign of dropped spans #5557

amanakin opened this issue Jun 27, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@amanakin
Copy link
Contributor

amanakin commented Jun 27, 2024

Now it is possible to understand that processor dropped spans only by Debug log:
https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/trace/batch_span_processor.go#L276
Which in itself is very doubtful, for example, count of dropped logs is written to Warn log:
https://github.com/open-telemetry/opentelemetry-go/blob/main/sdk/log/batch.go#L154

Proposed Solution

In any case, I would like to be able to observe the loss of spans/logs not only in log, but also, for example, in metrics.
We can export the global number of dropped spans/logs. This way users will be able to use this number for their triggers.

Alternatives

Should I use logs for such alerts?
At least we can write the number of lost spans with the Warn level.

@amanakin amanakin added the enhancement New feature or request label Jun 27, 2024
@dmathieu
Copy link
Member

dmathieu commented Jul 1, 2024

Why logs? This feels like it should be a metric.

@amanakin
Copy link
Contributor Author

amanakin commented Jul 1, 2024

@dmathieu
Yes, I meant that maybe my understanding is wrong (that it is necessary to make a metric of dropped spans/logs) and I should use log for that.

@OrangeFlag
Copy link

OrangeFlag commented Aug 30, 2024

@dmathieu Could you give a good example of how to do this?
I would like to try to take on the issue

p.s.
Something like this would be fine?

var meter = otel.Meter("go.opentelemetry.io/otel/sdk/trace")

func NewBatchSpanProcessor(exporter SpanExporter, options ...BatchSpanProcessorOption) SpanProcessor {
	//...
	dropCounter, err := meter.Int64Counter(
		"drop.counter",
		metric.WithDescription("Number of dropped spans."),
	)
	if err != nil {
		panic(err) // what to do with error? And what error could be here?
	}
	bsp := &batchSpanProcessor{
		//...,
		dropCounter,
	}
	//...
}

func (bsp *batchSpanProcessor) enqueueDrop(ctx context.Context, sd ReadOnlySpan) bool {
	if !sd.SpanContext().IsSampled() {
		return false
	}

	select {
	case bsp.queue <- sd:
		return true
	default:
		atomic.AddUint32(&bsp.dropped, 1)
		bsp.dropCounter.Add(ctx, 1)
	}
	return false
}

@amanakin
Copy link
Contributor Author

@OrangeFlag FYI

I have asked this question in otel-go slack, and these changes should follow semantic conventions for internal SDK metrics, because this is kind of significant change. For example, this is an attempt to add ones.

@dmathieu
Copy link
Member

dmathieu commented Sep 2, 2024

I agree, the big thing there is to have a semantic convention to emit the metric on. Setting up a meter provider and emitting a metric is trivial, and done in multiple other places, for example in many contrib instrumentations.

@OrangeFlag
Copy link

Sounds long, how can we potentially approach this?

Perhaps it is worth making a public method for getting the number of dropped spans? Then users will be able to at least make their own metrics, while we are negotiating.
Without passing to the interface, only implementing on batchSpanProcessor, for example

@dmathieu
Copy link
Member

dmathieu commented Sep 3, 2024

I don't know if exposing the number of dropped spans is a good idea. That's exposing an implementation detail into the struct.
There is a bit of a hacky way to retrieve the number of dropped spans which wouldn't require a change in a public and stable struct.
We emit a log event whenever a span is dropped:

global.Debug("exporting spans", "count", len(bsp.batch), "total_dropped", atomic.LoadUint32(&bsp.dropped))

So anything listening for the logs can emit a metric whenever this is being emitted.
That can be an otel collector fluentbit or any other log processor, but a custom implementation of logr.Logger could also do that in-process.

@OrangeFlag
Copy link

In production, we can't use debug logs for all our applications, making it nearly impossible to understand when and which applications are silently dropping spans, especially since these applications are developed by other teams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants