Fix panic caused by race condition when accessing span attributes #12661

albertteoh · 2022-07-23T22:47:45Z

Signed-off-by: albertteoh [email protected]

Description:
There is a race condition in Map.Get exposed by spanmetricsprocessor:

	for i := range *m.orig {
		akv := &(*m.orig)[i]

that can trigger an index out of range [0] with length 0 error when there was 1 element detected in the slice, the loop is entered and, while attempting to access the first element, the slice is found to be empty.

This is because the metrics are computed in a separate goroutine from the trace "stream":

func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces) error {
	go func() {
...
		p.lock.Lock()
		defer p.lock.Unlock()

		p.aggregateMetrics(traces)
...
}()
	// Forward trace data unmodified.
	return p.nextConsumer.ConsumeTraces(ctx, traces) // <-- this could modify the span attributes downstream while 
                                                                                                 // the goroutine above is trying to read it.
}

The fix is to Clone() the traces and use this clone within the traces->metrics aggregation goroutine.

Link to tracking Issue: Fixes #12644

Testing:

The race condition was reproduced via unit tests, with some hacks:

Simulate mutation of the span's attributes in downstream trace consumers by modifying spanmetricsprocessor.ConsumeTraces to sleep for a short period of time (1 ms) after the goroutine, then clearing every span's attributes. Why the short sleep? Because we want the spanmetrics processor to think that nothing has change right up until it enters the loop in Map.Get (see next step):

func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces) error {
  go func() {
  ...
  }()
  time.Sleep(time.Millisecond)
  for i := 0; i < traces.ResourceSpans().Len(); i++ {
  	attr := traces.ResourceSpans().At(i).Resource().Attributes()
  	attr.Clear()
  }
  return p.nextConsumer.ConsumeTraces(ctx, traces)
}

Modify Map.Get to sleep slightly longer than step 1. (2 ms) within the for loop to allow step 1. to finish emptying all the attributes so that by the time we invoke akv := &(*m.orig)[i], m.orig will be empty:

func (m Map) Get(key string) (Value, bool) {
  for i := range *m.orig {
  	time.Sleep(2*time.Millisecond)
  	akv := &(*m.orig)[i]
  	if akv.Key == key {
  		return Value{&akv.Value}, true
  	}
  }
  return Value{nil}, false
}

This successfully triggers the panic:

=== RUN   TestProcessorConsumeTraces/Test_single_consumption,_three_spans_(Cumulative).
panic: runtime error: index out of range [0] with length 0

goroutine 35 [running]:
go.opentelemetry.io/collector/pdata/internal.Map.Get({0xb4c57012b70?}, {0x1bca27f, 0xc})
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/vendor/go.opentelemetry.io/collector/pdata/internal/common.go:639 +0xfc
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).aggregateMetrics(0xc0001b87b0?, {0x1118102?})
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:365 +0x69
github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces.func1()
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:240 +0xa5
created by github.com/open-telemetry/opentelemetry-collector-contrib/processor/spanmetricsprocessor.(*processorImp).ConsumeTraces
	/Users/albertteoh/go/src/github.com/albertteoh/opentelemetry-collector-contrib/processor/spanmetricsprocessor/processor.go:232 +0xb5

Applying the following fix passes the unit test:

  tracesClone := traces.Clone()
  go func() {
  	...
  	p.aggregateMetrics(tracesClone)
  	...
  }()
  ...
  return p.nextConsumer.ConsumeTraces(ctx, traces)

Unfortunately, there does not appear to be a feasible means to add a unit test to guard against this race condition because the problem is triggered within an external library.

At the very least, it should be possible to assert that the metrics are aggregated on a copy traces and the next ConsumerTraces invocation uses the original.

As such, I propose to do this in another PR because it would involve a fairly large code refactor and would not want this to: 1. block the fix from being released 2. create too much noise in this PR.

Open to suggestions!

Signed-off-by: albertteoh <[email protected]>

albertteoh · 2022-07-23T22:51:32Z

processor/spanmetricsprocessor/processor_test.go

@@ -293,6 +299,8 @@ func TestProcessorConsumeTraces(t *testing.T) {
 				err := p.ConsumeTraces(ctx, traces)

 				// Verify
+				<-done // Wait till ConsumeMetrics verification is done.


This fixes a "test bug" I discovered while trying to reproduce the panic. As we now compute metrics in a separate goroutine, the execution of p.ConsumeTraces(ctx, traces) completes immediately, not allowing the metrics aggregation goroutine to complete and terminating it prematurely, that could lead to false positives.

This change ensures that the verification will only be executed once metrics aggregation is complete (i.e. when the "next.ConsumeMetrics" is invoked).

Would the test in its current form cause the panic when the main code does not have the fix?

No, the test in its current form won't cause the panic.

Signed-off-by: albertteoh <[email protected]>

bogdandrutu · 2022-07-25T15:50:12Z

processor/spanmetricsprocessor/processor.go

@@ -228,6 +228,7 @@ func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces)
 	// that should not interfere with the flow of trace data because
 	// it is an orthogonal concern to the trace flow (it should not impact
 	// upstream or downstream pipeline trace components).
+	tracesClone := traces.Clone()


This to me sounds right the correct fix in this design, but this smells to me as a bad design somehow that we have to do this. The "pipelines" are correctly constructed and passed the right clone of the data by the collector (using capabilities, etc.), and somehow this breaks that assumption and considerably adds more CPU than simply executing the code on the critical path.

Thanks for the feedback @bogdandrutu. I'm definitely open to ideas to improve the design. 😄

The original intent with the design change was that we're effectively creating a new stream of metrics data coming out of traces and, in my mind, the propagation of trace data is an orthogonal concern to metrics. That is, the trace pipeline should not depend on, or be impacted by, other components of a downstream metrics pipeline as well as any errors and latency introduced by them. That is, I feel that spanmetricsprocessor should just be a passthrough operation for traces.

Are there existing components that translate from one telemetry type to another, whose design pattern we could follow for this processor?

The "pipelines" are correctly constructed and passed the right clone of the data by the collector (using capabilities, etc.), and somehow this breaks that assumption and considerably adds more CPU than simply executing the code on the critical path.

I'm not very familiar with some concepts around the collector. For my learning, could you please elaborate more around what is meant by the "right clone" and "capabilities" and how this design adds considerably more CPU?

This also raises an existential question of whether this component should exist at all if it breaks some fundamental assumptions of the collector.

I think this is the only processor housed in contrib that generates telemetry of a different signal type than the signal of the pipeline the processor runs in, but it is not the only processor that "short-circuits" data immediately to an exporter (the routingprocessor also does that). The need to branch between signal types has been brought up before.

Since this processor has already been accepted and is widely used, I don't think now is the time to discuss whether or not it should've been rejected due to the way it interacts with multiple signals. Feels like a good discussion for the SIG meeting if it even needs to happen at all.

For this issue, I think cloning the traces payload is a reasonable solution to allow downstream trace processing to continue while this processor generates and forwards new metrics from its trace payload.

If cloning the traces payload has substantial performance impact then it could be noted in this processor's README as a Warning, similar to the warning sections of the cumulativetodelta processor or transform processor. Would probably be appropriate to add a "Performance Impact" to the Collector's standard warnings.

I think I am proposing actually executing the logic on the critical path, instead of cloning to execute async the metrics logic. Not suggesting to remove the processor.

Okay, I can do this.

As for error handling, I'm proposing to return errors caused from any logic in this processor, but only logging errors from the metrics pipeline (i.e. ConsumeMetrics).

That is:

func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces) error { if err := p.tracesToMetrics(ctx, traces); err != nil { return err } // Forward trace data unmodified. return p.nextConsumer.ConsumeTraces(ctx, traces) } func (p *processorImp) tracesToMetrics(ctx context.Context, traces ptrace.Traces) error { p.lock.Lock() defer p.lock.Unlock() p.aggregateMetrics(traces) m, err := p.buildMetrics() if err != nil { return err } if err = p.metricsExporter.ConsumeMetrics(ctx, *m); err != nil { // Avoid penalising trace processing caused by metrics processing. p.logger.Error(err.Error()) } p.resetExemplarData() return nil }

Thoughts?

I would do multierr.Combine(p.tracesToMetrics(ctx, traces), p.nextConsumer.ConsumeTraces(ctx, traces)) since this processor fanout to 2 components, would call both independent.

Good suggestion. Updated in: 9858d30

I did not mean to approve this.

amoscatelli · 2022-07-26T08:17:34Z

Can this be merged please ?
Or are you thinking about a refactor ?

The issue is blocking my environment.

Thank you in advance

Signed-off-by: albertteoh <[email protected]>

francescopotenziani · 2022-08-01T11:36:13Z

Hi,
i'm waiting for this merge. Can you accept this pull request? ty

jpkrohling · 2022-08-01T15:27:16Z

This will get merged once it gets all the approvals. From what I can see, it's currently pending on @bogdandrutu's review.

TylerHelmuth

Left one nit, otherwise LGTM.

TylerHelmuth · 2022-08-01T16:22:59Z

processor/spanmetricsprocessor/processor.go

@@ -228,28 +229,33 @@ func (p *processorImp) ConsumeTraces(ctx context.Context, traces ptrace.Traces)
 	// that should not interfere with the flow of trace data because
 	// it is an orthogonal concern to the trace flow (it should not impact
 	// upstream or downstream pipeline trace components).


Parts of this comment are no longer accurate. Can you update it to reflect the new process?

Thanks, updated comment in: af3dc8f.

It's mostly deleting since I don't think that line of code is too controversial.

Signed-off-by: albertteoh <[email protected]>

Fix race condition when accessing span attributes

9898bfb

Signed-off-by: albertteoh <[email protected]>

albertteoh requested review from a team and dashpole July 23, 2022 22:47

github-actions bot assigned jpkrohling Jul 23, 2022

albertteoh commented Jul 23, 2022

View reviewed changes

Add changelog

92a5a16

Signed-off-by: albertteoh <[email protected]>

albertteoh changed the title ~~Fix race condition when accessing span attributes~~ Fix panic caused by race condition when accessing span attributes Jul 23, 2022

bogdandrutu previously approved these changes Jul 25, 2022

View reviewed changes

albertteoh added 3 commits July 28, 2022 21:36

Use multierr to capture both metrics and traces errs

334ce32

Signed-off-by: albertteoh <[email protected]>

Fix tests

9858d30

Signed-off-by: albertteoh <[email protected]>

make gotidy

9b9e4ce

Signed-off-by: albertteoh <[email protected]>

jpkrohling requested a review from bogdandrutu August 1, 2022 15:27

TylerHelmuth approved these changes Aug 1, 2022

View reviewed changes

jpkrohling added the release:blocker The issue must be resolved before cutting the next release label Aug 1, 2022

jpkrohling mentioned this pull request Aug 1, 2022

Release v0.57.2 open-telemetry/opentelemetry-collector#5769

Closed

8 tasks

Update comment

af3dc8f

Signed-off-by: albertteoh <[email protected]>

bogdandrutu approved these changes Aug 2, 2022

View reviewed changes

bogdandrutu merged commit 6b553b9 into open-telemetry:main Aug 2, 2022

albertteoh deleted the 12644-fix-panic-race branch August 3, 2022 10:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix panic caused by race condition when accessing span attributes #12661

Fix panic caused by race condition when accessing span attributes #12661

albertteoh commented Jul 23, 2022 •

edited

Loading

albertteoh Jul 23, 2022

jpkrohling Jul 25, 2022

albertteoh Jul 26, 2022

bogdandrutu Jul 25, 2022

albertteoh Jul 26, 2022

TylerHelmuth Jul 26, 2022

TylerHelmuth Jul 26, 2022 •

edited

Loading

bogdandrutu Jul 26, 2022

albertteoh Jul 27, 2022

bogdandrutu Jul 27, 2022

albertteoh Jul 28, 2022

amoscatelli commented Jul 26, 2022

francescopotenziani commented Aug 1, 2022

jpkrohling commented Aug 1, 2022

TylerHelmuth left a comment

TylerHelmuth Aug 1, 2022

albertteoh Aug 2, 2022

Fix panic caused by race condition when accessing span attributes #12661

Fix panic caused by race condition when accessing span attributes #12661

Conversation

albertteoh commented Jul 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TylerHelmuth Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amoscatelli commented Jul 26, 2022

francescopotenziani commented Aug 1, 2022

jpkrohling commented Aug 1, 2022

TylerHelmuth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

albertteoh commented Jul 23, 2022 •

edited

Loading

TylerHelmuth Jul 26, 2022 •

edited

Loading