Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spanmetricsconnector produces a lot of exemplars #23872

Closed
povilasv opened this issue Jun 30, 2023 · 8 comments
Closed

spanmetricsconnector produces a lot of exemplars #23872

povilasv opened this issue Jun 30, 2023 · 8 comments
Labels
bug Something isn't working connector/spanmetrics

Comments

@povilasv
Copy link
Contributor

povilasv commented Jun 30, 2023

Component(s)

connector/spanmetrics

What happened?

Description

Every single span sent to span2metrics adds an Exemplar to the histogram metric, which can be quite a lot of data. Since Exemplars record, span id, trace id, latency and timestamp.

General question - whether this actually defeats the purpose of the Histogram? As with histograms I mainly expect aggregated data , not every single recorded event.

I added a bunch of debugging info on span2metric connector and I can see them:

  (v1.Exemplar) time_unix_nano:1688126657105088872 as_double:0.125755 span_id:[31 19 133 30 111 26 52 140] trace_id:[95 19 233 90 229 29 22 1
34 83 84 4 117 25 77 5 180] ,                                                                                                                
  (v1.Exemplar) time_unix_nano:1688126657105088872 as_double:0.125691 span_id:[252 153 209 139 224 166 71 240] trace_id:[173 169 237 196 254 
215 126 201 50 55 147 42 111 33 255 52] ,                                                                                                    
  (v1.Exemplar) time_unix_nano:1688126657105088872 as_double:0.127795 span_id:[167 212 174 43 46 170 241 103] trace_id:[94 109 78 4 69 159 17
8 164 62 111 190 199 101 49 196 217] ,                                                                                                       
  (v1.Exemplar) time_unix_nano:1688126657105088872 as_double:0.125864 span_id:[110 140 57 72 41 141 239 176] trace_id:[49 16 131 177 44 198 1
06 255 13 100 16 18 198 125 139 66]                                                                                                          
 })     

I couldn't find other way to print exemplars to console. But my test shows If in histogram sample count is 248819, then exemplar count is also 248819

Steps to Reproduce

  • add some println where we send the aggregated histogrma
  • run otel collector
  • ./bin/telemetrygen_linux_amd64 traces --otlp-insecure --otlp-endpoint localhost:4317 --duration 30s

Expected Result

Actual Result

Collector version

v0.80.0

Environment information

Environment

OS: (e.g., "Ubuntu 20.04")
Compiler(if manually compiled): (e.g., "go 14.2")

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:


connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms]
    dimensions:
      - name: http.method
        default: GET
      - name: http.status_code
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"    
    metrics_flush_interval: 15s 

exporters:
  logging:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [spanmetrics]
    metrics:
      receivers: [otlp, spanmetrics]
      exporters: [logging]

Log output

No response

Additional context

No response

@povilasv povilasv added bug Something isn't working needs triage New item requiring triage labels Jun 30, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@povilasv
Copy link
Contributor Author

One solution could be having an option to disable exemplars in spanmetricsconnector?

I can work on this if you folks agree.

@Frapschen
Copy link
Contributor

I think we can simply have the option to disable exemplars. BTW, can we have some strategies to determine whether one metric event needs an exemplar? The first strategy coming into my mind is that we can have a latency threshold to determine it. e.g.

exemplar:
  enabled: true
  strategy:
    latencyThreshold: 500ms

@albertteoh
Copy link
Contributor

albertteoh commented Jul 7, 2023

I like @Frapschen's idea of adding configurability into examplars. I would only suggest to use the naming convention of duration (used in this spanmetrics connector) rather than latency (used in the deprecated spanmetrics processor)

I also do prefer @Frapschen's enabled config rather than a disabled config option (so exemplars are disabled by default). It does, however, mean a breaking change. I'm okay with that as long as it's documented in the changelog, and better to get it right now while still in alpha, than later when it's considered stable.

Thoughts? @povilasv @kovrus

@povilasv
Copy link
Contributor Author

povilasv commented Jul 9, 2023

I wan't to do this step by step, so pushed a PR which disables Exemplars by default -> #24048

I like the concept of strategies. I was thinking we could have another "OnePerHistogramBucket" strategy, which would allow to collect up to one exemplar per histogram's latency bucket.

Since in agent deployment model we might get quite a lot of different services sending spans, and for one service X ms is slow, for another it's fast, so hard to have one threshold :)

So config would be like this:

exemplar:
  enabled: true
  strategy:
    # either durationThreshold or onePerHistogramBucket should be set
    durationThreshold: 500ms
    onePerHistogramBucket: true

Thoughts?

Also should we add the existing "collect all" behaviour as strategy?

@JaredTan95 JaredTan95 removed the needs triage New item requiring triage label Jul 10, 2023
mx-psi added a commit that referenced this issue Jul 13, 2023
**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

Breaking change! Allows enabling / disabling Exemplars.

**Link to tracking Issue:** <Issue number if applicable>

#23872
**Testing:** <Describe what testing was performed and which tests were
added.>

- Added unit test
**Documentation:** <Describe the documentation added.>
- Added docs

---------

Co-authored-by: Albert <[email protected]>
Co-authored-by: Pablo Baeyens <[email protected]>
@povilasv
Copy link
Contributor Author

povilasv commented Jul 13, 2023

FYI we just merged the default exemplars disabled behaviour. What should we do for exemplar "sampling" strategies? Which one should we implement? Let me know, I can find some time to work on it :)

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Sep 12, 2023
@mx-psi mx-psi removed the Stale label Sep 12, 2023
@povilasv
Copy link
Contributor Author

Given the original issue is fixed. Im closing this, we can add / discuss strategies in different issues

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/spanmetrics
Projects
None yet
Development

No branches or pull requests

5 participants