-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: OTEL context lost in subthreads of parallel bulk calls #2616
Conversation
💚 CLA has been signed |
A documentation preview will be available soon. Request a new doc build by commenting
If your PR continues to fail for an unknown reason, the doc build pipeline may be broken. Elastic employees can check the pipeline status here. |
buildkite test this please |
Thank you! Excellent find. I can take a proper look next week, but in the meantime:
|
@pquentin I've fixed the lint issues, it seemed wrong bc I was running the linter on python v3.8, now I believe the lint CI should pass. Now I'm not being able to run the integration tests properly as it's throwing a bunch of SSL errors such as Another error that is happening in quite a lot of the tests is The ES cluster is empty, configured with basic dev recommended configs and running in a docker container as well as the lib enviroment (I've put them under the same network and changed all the 259 tests are succeeding but other 177 are failing with the errors above |
buildkite test this please |
1 similar comment
buildkite test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, I was on vacation last week. OK, so now that the CI is fixed, I took the time to look at the code in more detail. I'm not familiar with OpenTelemetry contexts and propagation: I just read the documentation today! So please tell me if what I'm saying does not make sense.
My main concern is that the propagation will only behave correctly if there's a context to recover from, but that is only true if one span was created already. If the first function you're calling is parallel_bulk
, there won't be a span or context. Thankfully, this can be fixed by having the parallel_bulk
call itself and all callers of _process_bulk_chunk
create a span, which is something we want anyway. (Bonus points for covering elasticsearch/_async/helpers.py
too.)
In other words, all bulk helpers should start with something like this:
with client._otel.span("parallel_bulk", inject_context=True):
...
As explained inline, I believe my comments will make more sense if you take a look at my top-level review: #2616 (review) |
Hey @claudinoac, did you get the chance to look at my latest comments? |
For now, I'll be submitting the changes we agreed to make so far, and in the meanwhile we can discuss the optional saving of context on the carrier. |
Ok @pquentin so I've made the latest changes based on your suggestions, though I wasn't sure what to put into |
Thanks. I will be able to take a proper look at this next week |
Sorry for the delay. This looks pretty good, but now we need to add a span to all helper functions that end up calling I'll work on getting this finished if that's OK for you. |
@pquentin I'm not sure if that's really necessary as OTEL itself keeps them in the span stack, below their callers. |
The helpers spans are not regular DB spans.
buildkite test this please |
buildkite test this please |
The current version is in line with my vision. I now need to add more robust tests and possibly do the same thing for async helpers. |
It's global and not practical to use with multiple tests.
buildkite test this please |
buildkite test this please |
buildkite test this please |
Instead of maintaining a global context, we simply pass the otel span down and use it manually.
buildkite test this please |
@@ -322,6 +324,7 @@ def _process_bulk_chunk( | |||
Tuple[_TYPE_BULK_ACTION_HEADER, _TYPE_BULK_ACTION_BODY], | |||
] | |||
], | |||
otel_span: OpenTelemetrySpan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a first look it seemed to me the otel_span
attribute would be optional, to handle the case of apps not using otel. But I guess when otel is disabled this is just a fake span object that doesn't do anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, exactly, see https://github.com/elastic/elasticsearch-py/pull/2616/files#diff-78ec49b32fc64dc78b569e87b525b22c8d412e606e476c4861d8b108a9289e04R93. Having a fake object avoids using ifs, allowing None, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@claudinoac Sorry for the delay! Yesterday, Miguel made me realize that we don't need context propagation at all here since spans are serializables across threads. But using context propagation was also harmful: if two threads were calling |
Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit d4df09f)
@pquentin Yeah I'm aware about them being serializable, in fact that's the way I handle them across threads, processes and even across different applications. Though I'm not sure if your approach here is going to fix the bug. I'm going to make a few live tests just to make sure it addresses the right problem. And for the record, if the context propagation is indeed a problem in race condition scenarios, then the OTEL class shouldn't be a singleton at all. |
I can wait for your tests before releasing 8.15.1 if you prefer. The OTel class isn't a singleton, but there's only one per client. It's simply not designed to carry state about spans. |
Co-authored-by: Quentin Pradet <[email protected]> (cherry picked from commit d4df09f) Co-authored-by: Alisson Claudino <[email protected]>
@pquentin I've made the live tests and everything looks good. Thanks for getting this merged, you can proceed with the release :) |
When using OpenTelemetry to trace ES requests, all the parallel bulk chunks go out of the scope of the main span, and that happens because they lose the parent context as they become new threads (which happens in parallel bulk chunks), resulting in what we see below:
In order to keep the same context flowing through all subthreads, I'm using the context propagation concept (https://opentelemetry.io/docs/languages/python/propagation/) to reestablish the previous context on the new spawned thread for every subthread of parallel bulk, thus keeping these subthreads following the flow of the parent context along with all the other calls (search/delete/index/etc).
Here's an example for the same batch job after applying this fix:
I tried to keep the scope to the OTEL class itself and limit the test to the level of the call to recover the context, though I'm aware that might not be the ideal scenario.
I look forward for reviews and suggestions :)