Add Decouple and Batch Processors to Collector #959

adcharre · 2023-10-20T11:05:23Z

Problem

The introduction of force flush before a Lambda function finishes causes an increase in responses times due to the need to wait for the collector pipelines to complete before the lambda function returns.

Solution

By adding a new processor that decouples the receiver and exporter sides of the pipeline and is aware of lambda lifecycle events, the response time of the function is unaffected by the time it takes to export opentelemetry data.

In addition, adding the Batch processor to the list of available processors can reduce the cost of lambda function invocation by not needing to send opentelemetry data on every run. This is at the expense of data being delayed.

Add a new processor that decouples the receiver and exporter sides of the pipeline and is aware of lambda lifecycle events. Also add the Batch processor to the list of available processor to reduce the cost of lambda function invocation at the expense of data being delayed.

linux-foundation-easycla · 2023-10-20T11:05:28Z

The committers listed above are authorized under a signed CLA.

✅ login: adcharre / name: Adam Charrett (a3656b8, 656d11b, e51c41d, 828b649, 2a8812c, fa15461, 5bf3d08, 8802fbd, 34bb381)

collector/processor/decoupleprocessor/README.md

tylerbenson · 2023-10-23T17:41:43Z

@adcharre please go through the CLA process.

adcharre · 2023-10-23T18:14:41Z

@adcharre please go through the CLA process.

Trying to organise that at the moment - unfortunately my company hasn't yet signed the CLA so it's taking a bit of probing internally to get this done.

codeboten

Thanks for the PR @adcharre! I'm curious if the batch processor alone would work if the collector is able to respond to shutdown events from the environment.

I remember a few years ago, the batch processor was not useful, since batches would get stuck and never sent when lambdas were frozen.

codeboten · 2023-10-23T18:51:11Z

collector/processor/decoupleprocessor/README.md

+This processor decouples the receiver and exporter ends of the pipeline allowing the lambda function to finish before traces/metrics/logs are exported by the collector. The processor is aware of the Lambda lifecycle and will prevent the environment from being frozen or shutdown until any pending traces/metrics/logs have been exported.
+In this way the response times of the Lambda function is not impacted by the need to export data, however the billed duration will include the time taken to export data as well as runtime of the lambda function. 
+
+When combined with the batch processor, the number of exports required can be significantly reduced and therefore the cost of running the lambda. This is with the trade-off that the data will not be available at your chosen endpoint until some time after the invocation, up to a maximum of 5 minutes (the timeout that the environment is shutdown when no further invocations are received). 


I wonder if the decouple processor would be better written as a lifecycle-aware batch processor. I worry that without this, data would likely end up sitting in the existing batch processor whether it was placed before or after the decouple processor.

Another question that comes to mind is whether the batch processor alone would work here since it would in theory be able to send all the data left in any existing batches through the normal shutdown process.

I wonder if the decouple processor would be better written as a lifecycle-aware batch processor. I worry that without this, data would likely end up sitting in the existing batch processor whether it was placed before or after the decouple processor.

Actually I was going to update the README to indicate that the batch processor must come before the decouple processor otherwise you run into problems (as mentioned below). It would be possible to rewrite this as you say, however the simplest approach was to do one thing and one thing well and re-use the existing batch processor in front of the decouple processor.
Also it's reasonable to just use the decouple processor on it's own to reduce the response time of the lambda while not delaying the delivery of otel data.

Another question that comes to mind is whether the batch processor alone would work here since it would in theory be able to send all the data left in any existing batches through the normal shutdown process.

Not on it's on no, I did also consider it. The problem is that the collector is only shutdown when the environment is about to be destroyed. However it's frozen after each function invocation.
Without the ability to prevent the environment being frozen until the exports have successfully completed you will end up with data occasionally being lost due network interruptions.

@adcharre I'm not very familiar with extensions (or the collector)... could you highlight for me how the code you added handles the frozen scenario? (How does it prevent from being frozen.) Perhaps some additional comments in your code would be helpful here.

I've added additional information to the Listener interface here https://github.com/adcharre/opentelemetry-lambda/blob/2a8812c4aa78cbe7a27149532ec78ce6081a7234/collector/lambdalifecycle/notifier.go#L21

adcharre · 2023-10-23T21:49:36Z

collector/internal/lifecycle/manager.go

 			err = lm.listener.Wait(ctx, res.RequestID)
 			if err != nil {
 				lm.logger.Error("problem waiting for platform.runtimeDone event", zap.Error(err), zap.String("requestID", res.RequestID))
 			}
+
+			// Check other components are ready before allowing the freezing of the environment.


@tylerbenson - here's a comment to explain that we notifier other components that the function has finished and therefore need to be ready to freeze.

The new code I've added hooks into the existing lifecycle code,

I don't have a good understanding of the extension lifecycle. If the lambda function returns and starts waiting for the next event, are you sure it waits for events to finish processing before freezing? Is this documented somewhere? How did you verify this behavior?

Lifecycle docs say it waits until the extension also calls Next.

But does the lambda function get the next event only after all extensions have called Next? If that were the case then wouldn't it still be waiting until the extension finishes, unless the extension calls Next before doing the export, but then it might be frozen before actually exporting.

@tylerbenson according to the lifecycle documentation (https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html#runtimes-extensions-api-lifecycle)

Each phase starts with an event from Lambda to the runtime and to all registered extensions. The runtime and each extension signal completion by sending a Next API request. Lambda freezes the execution environment when each process has completed and there are no pending events.

@tsloughter the lambda runtime will only get the next invocation once all extensions have also signalled they are ready.
The sequence of events is:

Lambda and Extensions receive Next event

Lambda function completes and signals it's ready for the next event.

Collector receives notification from Telemetry API that the function has finished.

Decouple processor finishes forwarding data.

Collector signals it's ready for the next event.

Back to 1.

Also see the "Invoke phase" in the same link:

After receiving the function response from the runtime, Lambda returns the response to the client, even if extensions are still running.

and

The Invoke phase ends after the runtime and all extensions signal that they are done by sending a Next API request.

As to the comment about verifying this behaviour - from Cloudwatch logs to confirm what's documented.

Thanks for the detailed response.

Ok, cool, this is great.

adcharre · 2023-10-23T21:51:11Z

collector/lambdalifecycle/notifier.go

+
+type Listener interface {
+	FunctionInvoked()
+	FunctionFinished()


Probably this is where some additional comments are required, I'll look at adding them tomorrow.

adcharre · 2023-10-26T12:39:35Z

@codeboten & @tylerbenson - finally through the CLA and I've updated the readme and code with some more comments around lifecyle.

codeboten

Thanks for the submission @adcharre, would you consider submitting the decouple processor to the collector contrib repo? minor change i would suggest is calling it the queueingprocessor

specifically i think it would be beneficial to have a processor that can decouple the export pipeline from the batch processor into queues. many collector exporters do this today via the queueing supported in the exporter helper. a processor would take that responsibility out of the exporter altogether

codeboten · 2023-11-01T18:35:41Z

Please take a look at the failing test:

--- FAIL: TestLifecycle (0.40s)
    --- FAIL: TestLifecycle/full_lifecycle_with_data_from_function (0.20s)
        testing.go:1465: race detected during execution of test
    --- FAIL: TestLifecycle/full_lifecycle_with_data_before_shutdown (0.20s)
        testing.go:1465: race detected during execution of test
    testing.go:1465: race detected during execution of test

adcharre · 2023-11-01T19:06:04Z

Please take a look at the failing test:

--- FAIL: TestLifecycle (0.40s)
    --- FAIL: TestLifecycle/full_lifecycle_with_data_from_function (0.20s)
        testing.go:1465: race detected during execution of test
    --- FAIL: TestLifecycle/full_lifecycle_with_data_before_shutdown (0.20s)
        testing.go:1465: race detected during execution of test
    testing.go:1465: race detected during execution of test

I've reworked the 2 tests and tested locally and no more errors from the race detector.

adcharre · 2023-11-01T20:26:06Z

Thanks for the submission @adcharre, would you consider submitting the decouple processor to the collector contrib repo? minor change i would suggest is calling it the queueingprocessor

specifically i think it would be beneficial to have a processor that can decouple the export pipeline from the batch processor into queues. many collector exporters do this today via the queueing supported in the exporter helper. a processor would take that responsibility out of the exporter altogether

I'll have a go if you think it'll be useful. I had assumed there would be no need as the batch processor can do the breaking the receive -> export pipeline. It should just be a matter of removing some code and renaming a few things...

tylerbenson · 2023-11-02T15:13:40Z

@adcharre If you'd like to apply @codeboten's suggested naming change of queueingprocessor to this PR, we can move forward with merging. This way in the future when that processor is in the collector repo we can migrate without breaking users.

Thanks for your effort here. This is great! I really appreciate the explanations you've provided.

adcharre · 2023-11-02T16:10:50Z

@adcharre If you'd like to apply @codeboten's suggested naming change of queueingprocessor to this PR, we can move forward with merging. This way in the future when that processor is in the collector repo we can migrate without breaking users.

Thanks for your effort here. This is great! I really appreciate the explanations you've provided.

@tylerbenson I don't think that makes sense as there will need to be 2 separate plugins!
This plugin is lambda lifecycle aware and makes use of packages that are not available in the main collector. The main collector plugin would just be the simple queuing mechanism to break the link between receiver and exporter.

Stepping back a bit, there could be 2 separate implementations of the queueingprocessor one here with the lambda lifecycle awareness and a separate implementation in opentelemetry-collector-contrib which doesn't include the lambda stuff?

codeboten · 2023-11-02T17:32:24Z

Right, the proposal to submit the queueing portion of the processor (queueingprocessor) as its own processor was specifically around giving the processing pipeline the ability to decouple the exporting from the batch processor (as I mentioned earlier this is largely accomplished via queueing in the exporter helper today)

If in the future there was a facility for registering the lambda lifecycle mechanism in the queueingprocessor (which could be doable if it were done generically enough, not sure if theres a use-case beyond lambda), then the decoupleprocessor could be deprecated in favour of this queueingprocessor.

In the short term though, i think it's fine for them to be separate processors.

tylerbenson · 2023-11-02T17:34:26Z

Ok, sorry for the confusion. Please resolve the conflicts from main then I can merge.

codeboten · 2023-11-02T18:02:24Z

@adcharre please ignore my comment suggesting opening an issue/pr to the main collector repository. I was thankfully reminded that this used to exist in the core repository as the queuedprocessor and was deprecated in November 2021 😄 open-telemetry/opentelemetry-collector@e820370

adcharre · 2023-11-02T18:04:36Z

@tylerbenson conflicts should now be resolved and I've updated the decoupleprocessors dependencies to be inline with the other processors.

tylerbenson · 2023-12-05T17:00:14Z

@adcharre one thing I forgot to add here... Please update the readme with updated guidelines for using this new feature.

adcharre · 2023-12-07T22:05:47Z

@adcharre one thing I forgot to add here... Please update the readme with updated guidelines for using this new feature.

@tylerbenson - Will do! I'll try and get round to it next week.

adcharre requested a review from a team October 20, 2023 11:05

Fix missing go.sum entries

656d11b

tylerbenson reviewed Oct 23, 2023

View reviewed changes

collector/processor/decoupleprocessor/README.md Outdated Show resolved Hide resolved

Add link to lambda lifecycle

e51c41d

codeboten reviewed Oct 23, 2023

View reviewed changes

adcharre commented Oct 23, 2023

View reviewed changes

adcharre added 2 commits October 24, 2023 18:37

Add additional comments to clarify lifecycle

828b649

Update README.md

2a8812c

Add Makefile to lambdalifecycle

fa15461

codeboten approved these changes Nov 1, 2023

View reviewed changes

Fix race detector error

5bf3d08

tylerbenson approved these changes Nov 2, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into decouple_processor

8802fbd

Update dependencies

34bb381

tylerbenson merged commit 2027890 into open-telemetry:main Nov 2, 2023
11 checks passed

adcharre deleted the decouple_processor branch November 3, 2023 09:10

lukep-coxauto mentioned this pull request Nov 6, 2023

Proposal: Remove waiting for backend response of receiving traces/metrics from critical path of lambda returning to user #812

Closed

adcharre mentioned this pull request Dec 11, 2023

Update Collector README to add details on the decouple processor #1046

Merged

tylerbenson mentioned this pull request Jan 18, 2024

Use of layer + timeout causes high initialization times of next Lambda call #986

Open

tylerbenson mentioned this pull request May 4, 2024

Include the Tail Sampling Processor #1229

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Decouple and Batch Processors to Collector #959

Add Decouple and Batch Processors to Collector #959

adcharre commented Oct 20, 2023

linux-foundation-easycla bot commented Oct 20, 2023 •

edited

Loading

tylerbenson commented Oct 23, 2023

adcharre commented Oct 23, 2023

codeboten left a comment

codeboten Oct 23, 2023

adcharre Oct 23, 2023

tylerbenson Oct 23, 2023

adcharre Oct 27, 2023

adcharre Oct 23, 2023

tylerbenson Oct 26, 2023

tsloughter Oct 26, 2023

adcharre Oct 27, 2023

tylerbenson Oct 27, 2023

tsloughter Oct 27, 2023

adcharre Oct 23, 2023

adcharre commented Oct 26, 2023

codeboten left a comment •

edited

Loading

codeboten commented Nov 1, 2023

adcharre commented Nov 1, 2023

adcharre commented Nov 1, 2023

tylerbenson commented Nov 2, 2023

adcharre commented Nov 2, 2023

codeboten commented Nov 2, 2023

tylerbenson commented Nov 2, 2023

codeboten commented Nov 2, 2023

adcharre commented Nov 2, 2023

tylerbenson commented Dec 5, 2023

adcharre commented Dec 7, 2023

Add Decouple and Batch Processors to Collector #959

Add Decouple and Batch Processors to Collector #959

Conversation

adcharre commented Oct 20, 2023

Problem

Solution

linux-foundation-easycla bot commented Oct 20, 2023 • edited Loading

tylerbenson commented Oct 23, 2023

adcharre commented Oct 23, 2023

codeboten left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adcharre commented Oct 26, 2023

codeboten left a comment • edited Loading

Choose a reason for hiding this comment

codeboten commented Nov 1, 2023

adcharre commented Nov 1, 2023

adcharre commented Nov 1, 2023

tylerbenson commented Nov 2, 2023

adcharre commented Nov 2, 2023

codeboten commented Nov 2, 2023

tylerbenson commented Nov 2, 2023

codeboten commented Nov 2, 2023

adcharre commented Nov 2, 2023

tylerbenson commented Dec 5, 2023

adcharre commented Dec 7, 2023

linux-foundation-easycla bot commented Oct 20, 2023 •

edited

Loading

codeboten left a comment •

edited

Loading