Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spanmetricsprocessor] panic when using spanmetric with v0.56 #12644

Closed
amoscatelli opened this issue Jul 22, 2022 · 10 comments · Fixed by #12661
Closed

[spanmetricsprocessor] panic when using spanmetric with v0.56 #12644

amoscatelli opened this issue Jul 22, 2022 · 10 comments · Fixed by #12661
Labels
bug Something isn't working priority:p1 High processor/spanmetrics Span Metrics processor

Comments

@amoscatelli
Copy link

amoscatelli commented Jul 22, 2022

Describe the bug
using v0.56 and spanmetricsprocessor causes panic and otel collector contrib crash with using spanmetric processor and of course becames totally unusable and unresponsive

Steps to reproduce
dunno, probably just using spanmetric ?
for me it usually happens in a bunch of minutes after starting the server

What did you expect to see?
no crash, no panic, no errors

What did you see instead?

panic: runtime error: index out of range [0] with length 0

goroutine 5236 [running]:
go.opentelemetry.io/collector/pdata/internal.Map.Get(...)
        go.opentelemetry.io/collector/[email protected]/internal/common.go:636
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.getDimensionValue({{0x40001570e4?, 0x40022e562c?}, 0x0?}, {0x1?}, {0x4002af2d98?})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:493 +0x1c4
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.buildKey({0x40016f29f0?, 0x4001697500?}, {0x40024e0ae0?}, {0x4000a882c0, 0x7, 0x52?}, {0x3f?})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:475 +0x130
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetricsForSpan(0x40005712c0, {0x40016f29f0, 0x11}, {0x4?}, {0x4000be6ed8?})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:386 +0xa4
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetricsForServiceSpans(0x45e4a40?, {0x468bc60?}, {0x40016f29f0, 0x11})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:375 +0x94
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetrics(0x40017d0f60?, {0x5b170e8?})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:364 +0xdc
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1()
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:239 +0xd0
created by github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:231 +0xc0

What version did you use?
otel collector contrib docker image v0.56

What config did you use?
Config: (e.g. the yaml config file)

receivers:
    otlp:
        protocols:
            http:
                include_metadata: true
                cors:
                    allowed_origins: [
                        #OMITTED#
                    ]                    
    # Dummy receiver that's never used, because a pipeline is required to have one.
    otlp/spanmetrics:
        protocols:
            grpc:
                endpoint: "localhost:12345"

    prometheus:
      config:
        scrape_configs:
          - job_name: 'ratelimiter'
            scrape_interval: ${RECEIVERS_PROMETHEUS_CONFIG_SCRAPE_INTERVAL}
            static_configs:
              - targets: ['${HOSTNAME}:8889']

processors:
    attributes/traces:
        actions:
          - key: http.client_latitude
            from_context: CloudFront-Viewer-Latitude
            action: upsert
          - key: http.client_longitude
            from_context: CloudFront-Viewer-Longitude
            action: upsert
    memory_limiter:
        check_interval: 5s
        limit_mib: 448
        spike_limit_mib: 64
    batch:
        send_batch_size: 48
        send_batch_max_size: 48
        timeout: 15s
    transform:
        logs:
            queries:
                - set(attributes["severity_text"], severity_text)
    filter:
        logs:
            exclude:
                match_type: regexp
                record_attributes:
                    - key: severity_text
                      value: "(TRACE|DEBUG)"
    spanmetrics:
        metrics_exporter: prometheus
        dimensions:
            - name: enduser.id
            - name: deployment.environment
            - name: http.status_code
            - name: http.client_ip
            - name: http.client_latitude
            - name: http.client_longitude
            - name: http.method

exporters:
    logging:
        loglevel: ${EXPORTERS_LOGGING_LOGLEVEL}
    otlp:
        endpoint: tempo-eu-west-0.grafana.net:443
        headers:
            authorization: ${EXPORTERS_OTLP_HEADERS_AUTHORIZATION}
            
    prometheusremotewrite:
        endpoint: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
        headers:
            authorization: ${EXPORTERS_PROMETHEUSREMOTEWRITE_HEADERS_AUTHORIZATION}
            
    loki:
        endpoint: https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push
        headers:
            authorization: ${EXPORTERS_LOKI_HEADERS_AUTHORIZATION}
        format: json
        labels:
            attributes:
                container_name: ""
                source: ""
            resource:
                host.name: "hostname"
                
    prometheus:
        endpoint: "${HOSTNAME}:8889"
        metric_expiration: ${EXPORTERS_PROMETHEUS_METRIC_EXPIRATION}

extensions:
    health_check:
service:
    telemetry:
        logs:
            level: ${SERVICE_TELEMETRY_LOGS_LEVEL}
    extensions: [health_check]
    pipelines:
        traces:
            receivers: [otlp]
            processors: [attributes/traces, spanmetrics, memory_limiter, batch]
            exporters: [otlp, logging]
        metrics/spanmetrics:
            # This receiver is just a dummy and never used.
            # Added to pass validation requiring at least one receiver in a pipeline.
            receivers: [otlp/spanmetrics]
            # The metrics_exporter must be present in this list.
            exporters: [prometheus, logging]
        metrics:
            receivers: [otlp]
            processors: [memory_limiter, batch]
            exporters: [prometheus, logging]
        metrics/prometheus:
            receivers: [prometheus]
            processors: [memory_limiter, batch]
            exporters: [prometheusremotewrite, logging]
        logs:
            receivers: [otlp]
            processors: [transform, filter, memory_limiter, batch]
            exporters: [loki, logging]

Environment
AWS Elastic Beanstalk Linux v2

Additional context
This makes otel collector completely unusable

@amoscatelli amoscatelli added the bug Something isn't working label Jul 22, 2022
@mx-psi mx-psi added the processor/spanmetrics Span Metrics processor label Jul 22, 2022
@github-actions
Copy link
Contributor

Pinging code owners: @albertteoh

@TylerHelmuth
Copy link
Member

@amoscatelli are you able to reproduce this error in 0.55.0 or before?

@amoscatelli
Copy link
Author

@TylerHelmuth I can confirm 100% this issue is reproducible only with 0.56.0
I rolled back to 0.55.0 and it's working like before.
Maybe this issue is caused by the fix for #11149

@albertteoh
Copy link
Contributor

It appears to be a race condition in Map.Get:

	for i := range *m.orig {
		akv := &(*m.orig)[i]

where the error index out of range [0] with length 0 suggests there was 1 element detected in the slice, enters the loop and, while attempting to access the first element, finds the slice is empty.

I suspect this is because the metrics are computed in a separate goroutine from the trace "stream":

func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces) error {
	go func() {
...
		p.lock.Lock()
		defer p.lock.Unlock()

		p.aggregateMetrics(traces)
...
}()
	// Forward trace data unmodified.
	return p.nextConsumer.ConsumeTraces(ctx, traces) // <-- this could modify the span attributes downstream while 
                                                                                                 // the goroutine above is trying to read it.
}

I'll see if I can reproduce the panic locally first to prove this hunch.

@TylerHelmuth
Copy link
Member

TylerHelmuth commented Jul 23, 2022

Would making a deep copy of the trace data to use in aggregateMetrics solve this issue?

@albertteoh
Copy link
Contributor

albertteoh commented Jul 23, 2022

Would making a deep copy of the trace data to use in aggregateMetrics solve this issue?

I think so 😄

The following seems to reproduce the race condition via unit tests, with some hacks:

  1. Simulate mutation of the span's attributes in downstream trace consumers by modifying spanmetricsprocessor.ConsumeTraces to sleep for a short period of time (1 ms) after the goroutine, then clearing every span's attributes. Why the short sleep? Because we want the spanmetrics processor to think that nothing has change right up until it enters the loop in Map.Get (see next step):
func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces) error {
  go func() {
  ...
  }()
  time.Sleep(time.Millisecond)
  for i := 0; i < traces.ResourceSpans().Len(); i++ {
  	attr := traces.ResourceSpans().At(i).Resource().Attributes()
  	attr.Clear()
  }
  return p.nextConsumer.ConsumeTraces(ctx, traces)
}
  1. Modify Map.Get to sleep slightly longer than step 1. (2 ms) within the for loop to allow step 1. to finish emptying all the attributes so that by the time we invoke akv := &(*m.orig)[i], m.orig will be empty:
func (m Map) Get(key string) (Value, bool) {
  for i := range *m.orig {
  	time.Sleep(2*time.Millisecond)
  	akv := &(*m.orig)[i]
  	if akv.Key == key {
  		return Value{&akv.Value}, true
  	}
  }
  return Value{nil}, false
}
  1. This successfully triggers the panic:
=== RUN   TestProcessorConsumeTraces/Test_single_consumption,_three_spans_(Cumulative).
panic: runtime error: index out of range [0] with length 0

goroutine 35 [running]:
go.opentelemetry.io/collector/pdata/internal.Map.Get({0xb4c57012b70?}, {0x1bca27f, 0xc})
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/vendor/go.opentelemetry.io/collector/pdata/internal/common.go:639 +0xfc
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetrics(0xc0001b87b0?, {0x1118102?})
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:365 +0x69
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1()
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:240 +0xa5
created by github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:232 +0xb5

Applying the following fix passes the unit test:

  tracesClone := traces.Clone()
  go func() {
  	...
  	p.aggregateMetrics(tracesClone)
  	...
  }()
  ...
  return p.nextConsumer.ConsumeTraces(ctx, traces)

@mx-psi
Copy link
Member

mx-psi commented Jul 25, 2022

Do we know of any workaround that would prevent this?

@albertteoh
Copy link
Contributor

Do we know of any workaround that would prevent this?

A possible workaround (I have not tried this) is to not configure additional spanmetricsprocessor dimensions on top of the defaults, if it's acceptable to the use case. This is because the panic comes from trying to fetch a span's attributes to populate the additional custom dimensions. For example, in this issue, the spanmetrics config contains:

    spanmetrics:
        metrics_exporter: prometheus
        dimensions:
            - name: enduser.id
            - name: deployment.environment
            - name: http.status_code
            - name: http.client_ip
            - name: http.client_latitude
            - name: http.client_longitude
            - name: http.method

So the possible workaround (again, if acceptable to the use case) is:

    spanmetrics:
        metrics_exporter: prometheus

Otherwise, a code change is the only alternative I can think of to address this issue.

@amoscatelli
Copy link
Author

amoscatelli commented Jul 26, 2022

Thank you but, sadly, it's not a workaroud for us

@spurge
Copy link

spurge commented Aug 1, 2022

Do we know of any workaround that would prevent this?

A possible workaround (I have not tried this) is to not configure additional spanmetricsprocessor dimensions on top of the defaults, if it's acceptable to the use case. This is because the panic comes from trying to fetch a span's attributes to populate the additional custom dimensions.

We tried it and it did workout for us, as this is still useful without any additional dimensions 👍🏻 ⭐

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working priority:p1 High processor/spanmetrics Span Metrics processor
Projects
None yet
5 participants