[spanmetricsprocessor] panic when using spanmetric with v0.56 #12644

amoscatelli · 2022-07-22T08:00:50Z

Describe the bug
using v0.56 and spanmetricsprocessor causes panic and otel collector contrib crash with using spanmetric processor and of course becames totally unusable and unresponsive

Steps to reproduce
dunno, probably just using spanmetric ?
for me it usually happens in a bunch of minutes after starting the server

What did you expect to see?
no crash, no panic, no errors

What did you see instead?

panic: runtime error: index out of range [0] with length 0

goroutine 5236 [running]:
go.opentelemetry.io/collector/pdata/internal.Map.Get(...)
        go.opentelemetry.io/collector/[email protected]/internal/common.go:636
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.getDimensionValue({{0x40001570e4?, 0x40022e562c?}, 0x0?}, {0x1?}, {0x4002af2d98?})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:493 +0x1c4
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.buildKey({0x40016f29f0?, 0x4001697500?}, {0x40024e0ae0?}, {0x4000a882c0, 0x7, 0x52?}, {0x3f?})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:475 +0x130
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetricsForSpan(0x40005712c0, {0x40016f29f0, 0x11}, {0x4?}, {0x4000be6ed8?})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:386 +0xa4
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetricsForServiceSpans(0x45e4a40?, {0x468bc60?}, {0x40016f29f0, 0x11})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:375 +0x94
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetrics(0x40017d0f60?, {0x5b170e8?})
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:364 +0xdc
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1()
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:239 +0xd0
created by github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces
        github.com/open-telemetry/opentelemetry-collector-contrib/processor/[email protected]/processor.go:231 +0xc0

What version did you use?
otel collector contrib docker image v0.56

What config did you use?
Config: (e.g. the yaml config file)

receivers:
    otlp:
        protocols:
            http:
                include_metadata: true
                cors:
                    allowed_origins: [
                        #OMITTED#
                    ]                    
    # Dummy receiver that's never used, because a pipeline is required to have one.
    otlp/spanmetrics:
        protocols:
            grpc:
                endpoint: "localhost:12345"

    prometheus:
      config:
        scrape_configs:
          - job_name: 'ratelimiter'
            scrape_interval: ${RECEIVERS_PROMETHEUS_CONFIG_SCRAPE_INTERVAL}
            static_configs:
              - targets: ['${HOSTNAME}:8889']

processors:
    attributes/traces:
        actions:
          - key: http.client_latitude
            from_context: CloudFront-Viewer-Latitude
            action: upsert
          - key: http.client_longitude
            from_context: CloudFront-Viewer-Longitude
            action: upsert
    memory_limiter:
        check_interval: 5s
        limit_mib: 448
        spike_limit_mib: 64
    batch:
        send_batch_size: 48
        send_batch_max_size: 48
        timeout: 15s
    transform:
        logs:
            queries:
                - set(attributes["severity_text"], severity_text)
    filter:
        logs:
            exclude:
                match_type: regexp
                record_attributes:
                    - key: severity_text
                      value: "(TRACE|DEBUG)"
    spanmetrics:
        metrics_exporter: prometheus
        dimensions:
            - name: enduser.id
            - name: deployment.environment
            - name: http.status_code
            - name: http.client_ip
            - name: http.client_latitude
            - name: http.client_longitude
            - name: http.method

exporters:
    logging:
        loglevel: ${EXPORTERS_LOGGING_LOGLEVEL}
    otlp:
        endpoint: tempo-eu-west-0.grafana.net:443
        headers:
            authorization: ${EXPORTERS_OTLP_HEADERS_AUTHORIZATION}
            
    prometheusremotewrite:
        endpoint: https://prometheus-prod-01-eu-west-0.grafana.net/api/prom/push
        headers:
            authorization: ${EXPORTERS_PROMETHEUSREMOTEWRITE_HEADERS_AUTHORIZATION}
            
    loki:
        endpoint: https://logs-prod-eu-west-0.grafana.net/loki/api/v1/push
        headers:
            authorization: ${EXPORTERS_LOKI_HEADERS_AUTHORIZATION}
        format: json
        labels:
            attributes:
                container_name: ""
                source: ""
            resource:
                host.name: "hostname"
                
    prometheus:
        endpoint: "${HOSTNAME}:8889"
        metric_expiration: ${EXPORTERS_PROMETHEUS_METRIC_EXPIRATION}

extensions:
    health_check:
service:
    telemetry:
        logs:
            level: ${SERVICE_TELEMETRY_LOGS_LEVEL}
    extensions: [health_check]
    pipelines:
        traces:
            receivers: [otlp]
            processors: [attributes/traces, spanmetrics, memory_limiter, batch]
            exporters: [otlp, logging]
        metrics/spanmetrics:
            # This receiver is just a dummy and never used.
            # Added to pass validation requiring at least one receiver in a pipeline.
            receivers: [otlp/spanmetrics]
            # The metrics_exporter must be present in this list.
            exporters: [prometheus, logging]
        metrics:
            receivers: [otlp]
            processors: [memory_limiter, batch]
            exporters: [prometheus, logging]
        metrics/prometheus:
            receivers: [prometheus]
            processors: [memory_limiter, batch]
            exporters: [prometheusremotewrite, logging]
        logs:
            receivers: [otlp]
            processors: [transform, filter, memory_limiter, batch]
            exporters: [loki, logging]

Environment
AWS Elastic Beanstalk Linux v2

Additional context
This makes otel collector completely unusable

The text was updated successfully, but these errors were encountered:

github-actions · 2022-07-22T08:05:05Z

Pinging code owners: @albertteoh

TylerHelmuth · 2022-07-22T17:12:42Z

@amoscatelli are you able to reproduce this error in 0.55.0 or before?

amoscatelli · 2022-07-22T18:19:47Z

@TylerHelmuth I can confirm 100% this issue is reproducible only with 0.56.0
I rolled back to 0.55.0 and it's working like before.
Maybe this issue is caused by the fix for #11149

albertteoh · 2022-07-23T12:16:10Z

It appears to be a race condition in Map.Get:

	for i := range *m.orig {
		akv := &(*m.orig)[i]

where the error index out of range [0] with length 0 suggests there was 1 element detected in the slice, enters the loop and, while attempting to access the first element, finds the slice is empty.

I suspect this is because the metrics are computed in a separate goroutine from the trace "stream":

func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces) error {
	go func() {
...
		p.lock.Lock()
		defer p.lock.Unlock()

		p.aggregateMetrics(traces)
...
}()
	// Forward trace data unmodified.
	return p.nextConsumer.ConsumeTraces(ctx, traces) // <-- this could modify the span attributes downstream while 
                                                                                                 // the goroutine above is trying to read it.
}

I'll see if I can reproduce the panic locally first to prove this hunch.

TylerHelmuth · 2022-07-23T15:59:13Z

Would making a deep copy of the trace data to use in aggregateMetrics solve this issue?

albertteoh · 2022-07-23T21:58:40Z

Would making a deep copy of the trace data to use in aggregateMetrics solve this issue?

I think so 😄

The following seems to reproduce the race condition via unit tests, with some hacks:

Simulate mutation of the span's attributes in downstream trace consumers by modifying spanmetricsprocessor.ConsumeTraces to sleep for a short period of time (1 ms) after the goroutine, then clearing every span's attributes. Why the short sleep? Because we want the spanmetrics processor to think that nothing has change right up until it enters the loop in Map.Get (see next step):

func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces) error {
  go func() {
  ...
  }()
  time.Sleep(time.Millisecond)
  for i := 0; i < traces.ResourceSpans().Len(); i++ {
  	attr := traces.ResourceSpans().At(i).Resource().Attributes()
  	attr.Clear()
  }
  return p.nextConsumer.ConsumeTraces(ctx, traces)
}

Modify Map.Get to sleep slightly longer than step 1. (2 ms) within the for loop to allow step 1. to finish emptying all the attributes so that by the time we invoke akv := &(*m.orig)[i], m.orig will be empty:

func (m Map) Get(key string) (Value, bool) {
  for i := range *m.orig {
  	time.Sleep(2*time.Millisecond)
  	akv := &(*m.orig)[i]
  	if akv.Key == key {
  		return Value{&akv.Value}, true
  	}
  }
  return Value{nil}, false
}

This successfully triggers the panic:

=== RUN   TestProcessorConsumeTraces/Test_single_consumption,_three_spans_(Cumulative).
panic: runtime error: index out of range [0] with length 0

goroutine 35 [running]:
go.opentelemetry.io/collector/pdata/internal.Map.Get({0xb4c57012b70?}, {0x1bca27f, 0xc})
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/vendor/go.opentelemetry.io/collector/pdata/internal/common.go:639 +0xfc
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetrics(0xc0001b87b0?, {0x1118102?})
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:365 +0x69
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1()
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:240 +0xa5
created by github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:232 +0xb5

Applying the following fix passes the unit test:

  tracesClone := traces.Clone()
  go func() {
  	...
  	p.aggregateMetrics(tracesClone)
  	...
  }()
  ...
  return p.nextConsumer.ConsumeTraces(ctx, traces)

mx-psi · 2022-07-25T15:09:46Z

Do we know of any workaround that would prevent this?

albertteoh · 2022-07-26T11:07:56Z

Do we know of any workaround that would prevent this?

A possible workaround (I have not tried this) is to not configure additional spanmetricsprocessor dimensions on top of the defaults, if it's acceptable to the use case. This is because the panic comes from trying to fetch a span's attributes to populate the additional custom dimensions. For example, in this issue, the spanmetrics config contains:

    spanmetrics:
        metrics_exporter: prometheus
        dimensions:
            - name: enduser.id
            - name: deployment.environment
            - name: http.status_code
            - name: http.client_ip
            - name: http.client_latitude
            - name: http.client_longitude
            - name: http.method

So the possible workaround (again, if acceptable to the use case) is:

    spanmetrics:
        metrics_exporter: prometheus

Otherwise, a code change is the only alternative I can think of to address this issue.

amoscatelli · 2022-07-26T12:16:23Z

Thank you but, sadly, it's not a workaroud for us

spurge · 2022-08-01T12:05:21Z

Do we know of any workaround that would prevent this?

A possible workaround (I have not tried this) is to not configure additional spanmetricsprocessor dimensions on top of the defaults, if it's acceptable to the use case. This is because the panic comes from trying to fetch a span's attributes to populate the additional custom dimensions.

We tried it and it did workout for us, as this is still useful without any additional dimensions 👍🏻 ⭐

amoscatelli added the bug Something isn't working label Jul 22, 2022

mx-psi added the processor/spanmetrics Span Metrics processor label Jul 22, 2022

TylerHelmuth added the priority:p1 High label Jul 22, 2022

albertteoh mentioned this issue Jul 23, 2022

Fix panic caused by race condition when accessing span attributes #12661

Merged

bogdandrutu closed this as completed in #12661 Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spanmetricsprocessor] panic when using spanmetric with v0.56 #12644

[spanmetricsprocessor] panic when using spanmetric with v0.56 #12644

amoscatelli commented Jul 22, 2022 •

edited

Loading

github-actions bot commented Jul 22, 2022

TylerHelmuth commented Jul 22, 2022

amoscatelli commented Jul 22, 2022

albertteoh commented Jul 23, 2022

TylerHelmuth commented Jul 23, 2022 •

edited

Loading

albertteoh commented Jul 23, 2022 •

edited

Loading

mx-psi commented Jul 25, 2022

albertteoh commented Jul 26, 2022

amoscatelli commented Jul 26, 2022 •

edited

Loading

spurge commented Aug 1, 2022

[spanmetricsprocessor] panic when using spanmetric with v0.56 #12644

[spanmetricsprocessor] panic when using spanmetric with v0.56 #12644

Comments

amoscatelli commented Jul 22, 2022 • edited Loading

github-actions bot commented Jul 22, 2022

TylerHelmuth commented Jul 22, 2022

amoscatelli commented Jul 22, 2022

albertteoh commented Jul 23, 2022

TylerHelmuth commented Jul 23, 2022 • edited Loading

albertteoh commented Jul 23, 2022 • edited Loading

mx-psi commented Jul 25, 2022

albertteoh commented Jul 26, 2022

amoscatelli commented Jul 26, 2022 • edited Loading

spurge commented Aug 1, 2022

amoscatelli commented Jul 22, 2022 •

edited

Loading

TylerHelmuth commented Jul 23, 2022 •

edited

Loading

albertteoh commented Jul 23, 2022 •

edited

Loading

amoscatelli commented Jul 26, 2022 •

edited

Loading