fix: OTEL context lost in subthreads of parallel bulk calls #2616

claudinoac · 2024-07-23T13:46:45Z

When using OpenTelemetry to trace ES requests, all the parallel bulk chunks go out of the scope of the main span, and that happens because they lose the parent context as they become new threads (which happens in parallel bulk chunks), resulting in what we see below:

In order to keep the same context flowing through all subthreads, I'm using the context propagation concept (https://opentelemetry.io/docs/languages/python/propagation/) to reestablish the previous context on the new spawned thread for every subthread of parallel bulk, thus keeping these subthreads following the flow of the parent context along with all the other calls (search/delete/index/etc).

Here's an example for the same batch job after applying this fix:

I tried to keep the scope to the OTEL class itself and limit the test to the level of the call to recover the context, though I'm aware that might not be the ideal scenario.

I look forward for reviews and suggestions :)

cla-checker-service · 2024-07-23T13:46:49Z

💚 CLA has been signed

github-actions · 2024-07-23T13:46:59Z

A documentation preview will be available soon.

🔨 Buildkite builds
📚 HTML diff
📙 Preview page

Request a new doc build by commenting

Rebuild this PR: run docs-build
Rebuild this PR and all Elastic docs: run docs-build rebuild

_{run docs-build is much faster than run docs-build rebuild. A rebuild should only be needed in rare situations.}

_{If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here.}

pquentin · 2024-07-25T08:03:38Z

buildkite test this please

pquentin · 2024-07-26T09:30:27Z

Thank you! Excellent find. I can take a proper look next week, but in the meantime:

lint is still failing. You can reproduce locally with nox -rs lint.
The integration tests are also failing with AttributeError: 'FailingBulkClient' object has no attribute '_otel'. You can't see those CI logs right now, sorry. It's possible but annoying to reproduce locally with nox -rs test-3.12 and nox -rs test_otel-3.12. But you need an Elasticsearch cluster listening at localhost:9200. (I can fix this next week myself, no worries.)

claudinoac · 2024-07-26T18:55:21Z

@pquentin I've fixed the lint issues, it seemed wrong bc I was running the linter on python v3.8, now I believe the lint CI should pass.

Now I'm not being able to run the integration tests properly as it's throwing a bunch of SSL errors such as ValueError: TLS options require scheme to be 'https'.

Another error that is happening in quite a lot of the tests is BadRequestError(400, 'illegal_argument_exception', 'Cannot delete policy [kibana-event-log-policy]. It is in use by one or more indices: [.kibana-event-log-8.3.2-000012, .kibana-event-log-8.3.2-000011, .kibana-event-log-8.3.2-000010, .kibana-event-log-8.3.2-000013]').

The ES cluster is empty, configured with basic dev recommended configs and running in a docker container as well as the lib enviroment (I've put them under the same network and changed all the localhost references to {{container_name}} just to run the tests).

259 tests are succeeding but other 177 are failing with the errors above

pquentin · 2024-07-31T11:04:01Z

buildkite test this please

pquentin · 2024-07-31T13:07:18Z

buildkite test this please

pquentin

Sorry for the delay, I was on vacation last week. OK, so now that the CI is fixed, I took the time to look at the code in more detail. I'm not familiar with OpenTelemetry contexts and propagation: I just read the documentation today! So please tell me if what I'm saying does not make sense.

My main concern is that the propagation will only behave correctly if there's a context to recover from, but that is only true if one span was created already. If the first function you're calling is parallel_bulk, there won't be a span or context. Thankfully, this can be fixed by having the parallel_bulk call itself and all callers of _process_bulk_chunk create a span, which is something we want anyway. (Bonus points for covering elasticsearch/_async/helpers.py too.)

In other words, all bulk helpers should start with something like this:

with client._otel.span("parallel_bulk", inject_context=True):
    ...

elasticsearch/_otel.py

pquentin · 2024-08-01T08:13:20Z

As explained inline, I believe my comments will make more sense if you take a look at my top-level review: #2616 (review)

pquentin · 2024-08-06T11:57:03Z

Hey @claudinoac, did you get the chance to look at my latest comments?

claudinoac · 2024-08-06T18:29:33Z

For now, I'll be submitting the changes we agreed to make so far, and in the meanwhile we can discuss the optional saving of context on the carrier.

claudinoac · 2024-08-06T21:18:24Z

Ok @pquentin so I've made the latest changes based on your suggestions, though I wasn't sure what to put into endpoint_id and path_parts attributes, so I've just put default values there as there wasn't a simple way to infer the endpoint being called (as this is a very generic method) and I'm not exactly sure what path_parts means.

pquentin · 2024-08-07T08:32:06Z

Thanks. I will be able to take a proper look at this next week

pquentin · 2024-08-20T10:00:52Z

Sorry for the delay. This looks pretty good, but now we need to add a span to all helper functions that end up calling _process_bulk_chunk, not only parallel_bulk. That way, all bulk function will have helpful spans, even if the top-level application does not 't have them. And we need a better way to add that span, using client._otel.span is a bit of hack that does not really fit. A quick POC on what this would look like for streaming bulk:

I'll work on getting this finished if that's OK for you.

claudinoac · 2024-08-20T18:30:21Z

@pquentin I'm not sure if that's really necessary as OTEL itself keeps them in the span stack, below their callers.
If those other methods don't split themselves in threads or processes like parallel_bulk does, then you'll be wasting your time as they probably already look the same as your screenshot shows.
Nonetheless, the issue I'm trying to address here is the subcalls of parallel bulk getting out of scope and polluting the overall trace, which IMHO is in a much higher priority.
As an example, I'm dealing with a quite complex application continuously running hundreds of data pipelines and millions of records per day and these out of scope spans are not only polluting my tracing views, but preventing me from proper trace such pipelines.
I agree that the approach using the _otel object is kind of hacky, but that's due the nature of the way OTEL was implemented here. I just tried to follow the pattern.
Feel free to make any changes.

The helpers spans are not regular DB spans.

pquentin · 2024-08-21T12:08:32Z

buildkite test this please

pquentin · 2024-08-21T13:14:38Z

buildkite test this please

pquentin · 2024-08-21T13:47:55Z

The current version is in line with my vision. I now need to add more robust tests and possibly do the same thing for async helpers.

It's global and not practical to use with multiple tests.

pquentin · 2024-08-30T13:57:04Z

buildkite test this please

pquentin · 2024-08-30T15:29:09Z

buildkite test this please

pquentin · 2024-09-02T06:10:57Z

buildkite test this please

Instead of maintaining a global context, we simply pass the otel span down and use it manually.

pquentin · 2024-09-03T14:05:20Z

buildkite test this please

miguelgrinberg · 2024-09-03T14:08:17Z

elasticsearch/helpers/actions.py

@@ -322,6 +324,7 @@ def _process_bulk_chunk(
            Tuple[_TYPE_BULK_ACTION_HEADER, _TYPE_BULK_ACTION_BODY],
        ]
    ],
+    otel_span: OpenTelemetrySpan,


On a first look it seemed to me the otel_span attribute would be optional, to handle the case of apps not using otel. But I guess when otel is disabled this is just a fake span object that doesn't do anything?

Yes, exactly, see https://github.com/elastic/elasticsearch-py/pull/2616/files#diff-78ec49b32fc64dc78b569e87b525b22c8d412e606e476c4861d8b108a9289e04R93. Having a fake object avoids using ifs, allowing None, etc.

miguelgrinberg

LGTM

pquentin · 2024-09-03T14:32:03Z

@claudinoac Sorry for the delay! Yesterday, Miguel made me realize that we don't need context propagation at all here since spans are serializables across threads. But using context propagation was also harmful: if two threads were calling parallel_bulk at the same time, then resulting span hierarchy would have been incorrect. I've implemented the simpler and more correct version today. Thanks for your patience.

Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit d4df09f)

claudinoac · 2024-09-03T14:58:31Z

@pquentin Yeah I'm aware about them being serializable, in fact that's the way I handle them across threads, processes and even across different applications. Though I'm not sure if your approach here is going to fix the bug. I'm going to make a few live tests just to make sure it addresses the right problem. And for the record, if the context propagation is indeed a problem in race condition scenarios, then the OTEL class shouldn't be a singleton at all.

pquentin · 2024-09-04T05:04:34Z

I can wait for your tests before releasing 8.15.1 if you prefer.

The OTel class isn't a singleton, but there's only one per client. It's simply not designed to carry state about spans.

Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit d4df09f) Co-authored-by: Alisson Claudino <[email protected]>

claudinoac · 2024-09-05T14:23:43Z

@pquentin I've made the live tests and everything looks good. Thanks for getting this merged, you can proceed with the release :)

claudinoac force-pushed the main branch from e282879 to fbfd3d4 Compare July 23, 2024 13:53

pquentin requested changes Jul 31, 2024

View reviewed changes

claudinoac and others added 7 commits August 6, 2024 16:00

fix: forward OTEL context to subthreads in parallel bulk calls

86c7498

test: add scenario for OTEL context forwarding in parallel bulk

c88e548

lint: isort import order and black formatting

041d841

fix: add missing type hints

31c0b8d

Add _otel attribute to FailingBulkClient

1f2dd49

Restore version

3f73592

fix: review suggestions, renaming and except guard

ee1468e

claudinoac force-pushed the main branch from c7018ee to ee1468e Compare August 6, 2024 19:00

fix: optionally inject OTEL context

a763272

claudinoac requested a review from pquentin August 6, 2024 21:19

Merge branch 'main' into main

3ac91b6

pquentin added 3 commits August 21, 2024 11:48

Remove 3.7 in nox otel session

795d1bf

Switch from span(inject_context) to helpers_span

7cfad71

The helpers spans are not regular DB spans.

Create OpenTelemetry parent traces for all sync bulk helpers

9eb549a

Fix _recover_parent_context without OTel

464b306

claudinoac and others added 4 commits August 27, 2024 12:29

Merge branch 'main' into main

49096df

Switch test_otel to sync_client

a0689c8

Switch away from global tracer

6dbe873

It's global and not practical to use with multiple tests.

Test all bulks

4de4315

pquentin approved these changes Aug 30, 2024

View reviewed changes

Fix url.full comparison

280251d

Fix attribute check again

5dcec1a

pquentin requested a review from miguelgrinberg September 2, 2024 06:28

pquentin added 2 commits September 3, 2024 17:58

Use more robust approach to keep track of the parent context

a91e059

Instead of maintaining a global context, we simply pass the otel span down and use it manually.

Fix lint

3caee68

miguelgrinberg reviewed Sep 3, 2024

View reviewed changes

miguelgrinberg approved these changes Sep 3, 2024

View reviewed changes

pquentin added the backport 8.15 label Sep 3, 2024

pquentin merged commit d4df09f into elastic:main Sep 3, 2024
11 checks passed

github-actions bot pushed a commit that referenced this pull request Sep 3, 2024

Fix OTel context loss in parallel bulk helper (#2616)

9ad6298

Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit d4df09f)

github-actions bot mentioned this pull request Sep 3, 2024

[Backport 8.15] fix: OTEL context lost in subthreads of parallel bulk calls #2649

Merged

pquentin pushed a commit that referenced this pull request Sep 5, 2024

[Backport 8.15] Fix OTel context loss in parallel bulk helper (#2616)

c78b236

Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit d4df09f) Co-authored-by: Alisson Claudino <[email protected]>

pquentin mentioned this pull request Oct 11, 2024

Fix OTel context loss in bulk helpers elastic/elasticsearch-serverless-python#87

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: OTEL context lost in subthreads of parallel bulk calls #2616

fix: OTEL context lost in subthreads of parallel bulk calls #2616

claudinoac commented Jul 23, 2024

cla-checker-service bot commented Jul 23, 2024 •

edited

Loading

github-actions bot commented Jul 23, 2024

pquentin commented Jul 25, 2024

pquentin commented Jul 26, 2024

claudinoac commented Jul 26, 2024 •

edited

Loading

pquentin commented Jul 31, 2024

pquentin commented Jul 31, 2024

pquentin left a comment

pquentin commented Aug 1, 2024

pquentin commented Aug 6, 2024

claudinoac commented Aug 6, 2024

claudinoac commented Aug 6, 2024 •

edited

Loading

pquentin commented Aug 7, 2024

pquentin commented Aug 20, 2024 •

edited

Loading

claudinoac commented Aug 20, 2024

pquentin commented Aug 21, 2024

pquentin commented Aug 21, 2024

pquentin commented Aug 21, 2024 •

edited

Loading

pquentin commented Aug 30, 2024

pquentin commented Aug 30, 2024

pquentin commented Sep 2, 2024

pquentin commented Sep 3, 2024

miguelgrinberg Sep 3, 2024 •

edited

Loading

pquentin Sep 3, 2024

miguelgrinberg left a comment

pquentin commented Sep 3, 2024

claudinoac commented Sep 3, 2024

pquentin commented Sep 4, 2024 •

edited

Loading

claudinoac commented Sep 5, 2024

fix: OTEL context lost in subthreads of parallel bulk calls #2616

fix: OTEL context lost in subthreads of parallel bulk calls #2616

Conversation

claudinoac commented Jul 23, 2024

cla-checker-service bot commented Jul 23, 2024 • edited Loading

github-actions bot commented Jul 23, 2024

pquentin commented Jul 25, 2024

pquentin commented Jul 26, 2024

claudinoac commented Jul 26, 2024 • edited Loading

pquentin commented Jul 31, 2024

pquentin commented Jul 31, 2024

pquentin left a comment

Choose a reason for hiding this comment

pquentin commented Aug 1, 2024

pquentin commented Aug 6, 2024

claudinoac commented Aug 6, 2024

claudinoac commented Aug 6, 2024 • edited Loading

pquentin commented Aug 7, 2024

pquentin commented Aug 20, 2024 • edited Loading

claudinoac commented Aug 20, 2024

pquentin commented Aug 21, 2024

pquentin commented Aug 21, 2024

pquentin commented Aug 21, 2024 • edited Loading

pquentin commented Aug 30, 2024

pquentin commented Aug 30, 2024

pquentin commented Sep 2, 2024

pquentin commented Sep 3, 2024

miguelgrinberg Sep 3, 2024 • edited Loading

Choose a reason for hiding this comment

pquentin Sep 3, 2024

Choose a reason for hiding this comment

miguelgrinberg left a comment

Choose a reason for hiding this comment

pquentin commented Sep 3, 2024

claudinoac commented Sep 3, 2024

pquentin commented Sep 4, 2024 • edited Loading

claudinoac commented Sep 5, 2024

cla-checker-service bot commented Jul 23, 2024 •

edited

Loading

claudinoac commented Jul 26, 2024 •

edited

Loading

claudinoac commented Aug 6, 2024 •

edited

Loading

pquentin commented Aug 20, 2024 •

edited

Loading

pquentin commented Aug 21, 2024 •

edited

Loading

miguelgrinberg Sep 3, 2024 •

edited

Loading

pquentin commented Sep 4, 2024 •

edited

Loading