Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: Assess the impact of direct Prometheus queries on reconciliation duration #526

Closed
skhalash opened this issue Nov 8, 2023 · 2 comments
Assignees
Labels
area/metrics MetricPipeline area/traces TracePipeline kind/decision Marks a decision document
Milestone

Comments

@skhalash
Copy link
Collaborator

skhalash commented Nov 8, 2023

Description

This effort is a follow-up to the work done in #516 we need to understand the impact of direct Prometheus queries on reconciliation duration.

Acceptance Criteria

  • Build a PoC that setup a simple Prometheus alongside the Telemetry Manager
  • Develop a set of PromQL queries based on OTel Collector metrics to detect specific scenarios, such as backend connectivity issues, backpressure/throttling due to backend overload, and reaching the ingestion limit of a pipeline.
  • Check the different patterns we identified in the ADR for impact on reconcilation, especially the duration for one reconcilation
  • Document the decisions in the ADR as part of the story

Related Issues
#425

@skhalash skhalash changed the title PoC: Assess the Impact of direct Prometheus queries on reconciliation duration PoC: Assess the impact of direct Prometheus queries on reconciliation duration Nov 8, 2023
@a-thaler a-thaler added the area/metrics MetricPipeline label Jan 8, 2024
@skhalash skhalash self-assigned this Jan 8, 2024
@skhalash
Copy link
Collaborator Author

skhalash commented Jan 10, 2024

Selected Queries:

rate(otelcol_exporter_send_failed_metric_points[5m]) > 0
rate(otelcol_exporter_enqueue_failed_metric_points[5m]) > 0
(otelcol_exporter_queue_size/otelcol_exporter_queue_capacity)*100 > 90

rate(otelcol_processor_dropped_metric_points[5m]) > 0
rate(otelcol_processor_refused_metric_points[5m]) > 0

rate(otelcol_receiver_refused_metric_points[5m]) > 0

Prometheus Query Execution Time

Testing individual Prometheus queries has demonstrated that the execution time on k3d is negligible, typically in the range of 10s of milliseconds. In a real cluster environment, the execution time might be somewhat higher, b ut still primarily influenced by network latency.

@skhalash skhalash added this to the 1.7.0 milestone Jan 11, 2024
@skhalash skhalash added area/traces TracePipeline kind/decision Marks a decision document labels Jan 11, 2024
@skhalash
Copy link
Collaborator Author

To wrap it up, the consensus was to implement the integration of Prometheus and Telemetry Manager using the Alerting approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics MetricPipeline area/traces TracePipeline kind/decision Marks a decision document
Projects
None yet
Development

No branches or pull requests

2 participants