Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Supportability: make the queued_retry observable #2434

Closed
jpkrohling opened this issue Feb 2, 2021 · 6 comments
Closed

Supportability: make the queued_retry observable #2434

jpkrohling opened this issue Feb 2, 2021 · 6 comments
Assignees
Labels
area:exporter help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p3 Lowest release:allowed-for-ga Stale

Comments

@jpkrohling
Copy link
Member

As part of #2430, it's clear that the queued_retry, used by default by core exporters such as Jaeger and OTLP isn't easily observable. When a Jaeger backend becomes unavailable, the following is seen:

2021-02-02T13:40:44.298+0100	INFO	jaegerexporter/exporter.go:190	State of the connection with the Jaeger Collector backend	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "state": "TRANSIENT_FAILURE"}
2021-02-02T13:40:49.424+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "5.52330144s"}
2021-02-02T13:41:05.753+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "13.101300599s"}
2021-02-02T13:41:18.856+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "15.823926909s"}
2021-02-02T13:41:34.681+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "23.404886645s"}
2021-02-02T13:41:58.087+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "35.604692186s"}
2021-02-02T13:42:33.693+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "16.969110576s"}
2021-02-02T13:42:50.664+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "19.695577642s"}
2021-02-02T13:43:10.361+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "17.909085567s"}
2021-02-02T13:43:28.271+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "24.027355817s"}
2021-02-02T13:43:52.299+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "30.456378855s"}
2021-02-02T13:44:22.756+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "39.40919883s"}
2021-02-02T13:45:02.167+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "21.427916177s"}
2021-02-02T13:45:23.596+0100	INFO	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "interval": "26.419715679s"}
2021-02-02T13:45:50.016+0100	ERROR	exporterhelper/queued_retry.go:257	Exporting failed. No more retries left. Dropping data.	{"component_kind": "exporter", "component_type": "jaeger", "component_name": "jaeger", "error": "max elapsed time expired failed to push trace data via Jaeger exporter: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp [::1]:14250: connect: connection refused\"", "dropped_items": 2}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
	/home/jpkroehling/Projects/src/github.com/open-telemetry/opentelemetry-collector/exporter/exporterhelper/queued_retry.go:257
go.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send
	/home/jpkroehling/Projects/src/github.com/open-telemetry/opentelemetry-collector/exporter/exporterhelper/tracehelper.go:114
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
	/home/jpkroehling/Projects/src/github.com/open-telemetry/opentelemetry-collector/exporter/exporterhelper/queued_retry.go:133
github.com/jaegertracing/jaeger/pkg/queue.(*BoundedQueue).StartConsumers.func1
	/home/jpkroehling/go/pkg/mod/github.com/jaegertracing/[email protected]/pkg/queue/bounded_queue.go:77

Before the permanent failure, there aren't metrics indicating that batches are in the process of being retried. When they finally fail, and only then, the following metric is updated:

otelcol_exporter_send_failed_spans{exporter="jaeger",service_instance_id="..."} 2

Suggestions:

  • Add a gauge with the current size of the retry queue
  • Add a counter with the number of attempts that failed

Additionally, the logs are polluted with information that isn't that useful. Under heavy-load, this would put even more pressure on the collector. We should log those failures only under the DEBUG mode while the failure is still transient. When the failure becomes permanent, we should add an INFO-level statement, capped at most one message every 5s.

@jpkrohling
Copy link
Member Author

cc @pkositsyn, as this last item might be what you mentioned in another issue.

@albertteoh
Copy link
Contributor

albertteoh commented Feb 3, 2021

@jpkrohling it seems queued_retry was deprecated and removed in v0.19.0 Beta (I don't know why).

Maintainers: the monitoring documentation still states "The queued_retry processor is recommended as the retry mechanism for the Collector and as such should be used in any production deployment."
With queued_retry gone, what is the recommended retry mechanism or is this somehow default behaviour?

@jpkrohling
Copy link
Member Author

The processor was, but this is about the helper that is the replacement for that.

@albertteoh
Copy link
Contributor

I see, thanks!

@jpkrohling
Copy link
Member Author

@bogdandrutu, @tigrannajaryan given the discussion about refraining from using OpenCensus instrumentation libraries in favor of OpenTelemetry's, should this item be postponed?

@bogdandrutu bogdandrutu added this to the Phase2-GA-Roadmap milestone Mar 10, 2021
@tigrannajaryan tigrannajaryan added the help wanted Good issue for contributors to OpenTelemetry Service to pick up label Mar 16, 2021
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Mar 30, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
dmitryax added a commit to dmitryax/opentelemetry-collector that referenced this issue Apr 1, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
bogdandrutu pushed a commit that referenced this issue Apr 2, 2021
This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue #2434
pjanotti pushed a commit to pjanotti/opentelemetry-service that referenced this issue Apr 6, 2021
…lemetry#2858)

This commit adds observability to queue_retry exporter helper. It adds the first metric "queue_length" that indicates current size of the queue per exporter. The metrics is updated every second.

This is the first commit to address the issue open-telemetry#2434
@alolita
Copy link
Member

alolita commented May 12, 2021

@bogdandrutu can this be moved to a Phase 3 backlog which will be done post GA?

@bogdandrutu bogdandrutu removed this from the Phase2-GA-Roadmap milestone May 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:exporter help wanted Good issue for contributors to OpenTelemetry Service to pick up priority:p3 Lowest release:allowed-for-ga Stale
Projects
None yet
Development

No branches or pull requests

6 participants