-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenTelemetry Extension not reliably adding user id and roles to span attributes #39563
Comments
/cc @brunobat (opentelemetry), @radcortez (opentelemetry) |
Thanks @vonatzigenc. This is a very complete report. |
Hey,
In general, it is. Problem is that it will only solve your problems if the identity is already set there, which means that the span must be created after the Quarkus REST and similar started processing. I tried to lookup all cases where span can be created and I think it won't be true for every scenario. Bruno can prove me wrong....
Please note there has been changes around context activation in last released and I can't tell from top of my head in which micro it was fixed, but IMHO: current implementation activates CDI request context and access bean inside async method, which sometimes will be already after deactivation. It is wrong to expect there that the context will be active.
I think your impl. is better, I don't like that it is unbounded to the request (impossible to solve inside SpanProcessor though), which means (in addition to resources) it will occasionally happen after the span ended? Also I don't believe it fixes all the issues.
The feature can't be reliable written like this because it basically says whenever Span got created we will try to authenticate. But if I can't tell how to solve this without digging more into how and when Spans can be created (all cases). Maybe quick fix could be:
Proper fix should be to determine - if @vonatzigenc if you want to dig into this, please go ahead. Only thing is that I can't provide you more help without actually getting more familiar with the Quarkus OpenTelemetry. Hence my suggestion can't be reliable until I actually work on it. |
Otherwise I can have a look in next weeks. |
FYI I'd expect https://quarkus.io/version/main/guides/opentelemetry#quarkus-security-events to be very reliable if that helps to your use case. |
Thanks for the quick feedback and analysis.
I would also have to dig much deeper. (In routes and otel)
The variant via the SecurityEvents looks promising. The following code has already worked in a first test. (But I haven't looked at it more deeply yet)
Would the idea be that this would be adapted in the extension? (i.e. as a replacement for the current |
you can't rely on the
Not alone.
No. It needs to be more complex. I can't tell without working on this and I can't tell all the caveats. I can do a review if you eventually create a PR. One thing I can tell is that current test coverage does not test important scenarios: with/without RBAC, with/without HTTP Perms, proactive enable/disabled, combinations of previous. |
Looking at what we currently do for
The way I see it, the proper solution is the following: The span attributes should only be set in a synchronous fashion and that in turn means that these security attributes will only be set when the security information is present and can be determined in a synchronous fashion. |
I agree with @geoand
Well, that's the point I wasn't sure about. The conclusion is fine if we document that to make this feature reliably reporting user if credentials are available, they will need proactive authentication enabled (because that's what is happening when lazy proactive auth is in place and FYI I also don't think this is good first issue @vonatzigenc. I am sure @brunobat and @sberyozkin would be happy to suggest many issues where you could help if you are interested in contributing. Just ping them if that is a case. |
It absolutely is not a good first issue :) |
We need 2 different actions here.
On a side note, we might have a similar problem on the security events collection. We need to make sure the span |
Probably no, I adjusted my expectations based on Georgios comment, we just need to document expectations. However as I mentioned, please note that when you inject the identity and call My suggestion is that I need to dig into this to understand it properly. I'll provide a fix within a week. |
I didn't know about |
ΙΜΗΟ, if we can't provide a proper fix along the lines described, we should remove the feature altogether as it's very broken currently. |
+1 |
Even with a proper fix it's very likely that we will not be able to provide the authentication data in all cases, for all execution flows... And this is fine. Other option we can explore is trying to extract that inf from the security events, when available. |
sounds to me, thanks
This feature can be reliable, we just need to rewrite it and document which setup provides information always and which only on documented scenarios. I don't think we can extract it from security events because of context propagation would be expensive (using duplicated context local data) or because of cardinality (add it to every created span vs current span when authentication happened). |
This is your domain, so you know best. I'm just saying that this sounds super weird to me and I would not have opted to include such behavior in an analogous situation in some of the extensions I maintain. |
Slept ove it... This will cause much frustration and questions, like it is now. |
I've removed my assignment and will let it to you guys, but personally I don't think you should make conclusions without thorough analysis. Maybe you did it, I don't know. I don't know until I can study Quarkus OTEL internals as I said whether it can be done for sure. My plan was to leverage fact that Vert.x HTTP route event is handled on duplicated context (mainstream scenario) so OTEL context storage is also stored on duplicated context. before HTTP authenticator starts to authenticate when proactive authentication is enabled, the OTEL context is already stored on the Vert.x duplicated context. I planned to investigate if when the identity is already in the |
Please do investigate and I still think we need the feature, even if it doesn't work for all flows. We just need to be more precise on the documentation and on how will be the user experience. |
Sounds good! |
There is a thing I didn't know - https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#onendspan says that inside the |
This is working - 8f7673d. I need to write some more tests and handle edge cases next week, but you can see that current test coverage is already big enough for knowing this concept is alright. The fix however, won't be backportable as it relys on code that I changed so many times in past releases it would be merge hell. |
Describe the bug
With the OpenTelemetry Extension, the property
quarkus.otel.traces.eusp.enabled=true
can be used to add the user id and the roles to the span attributes. (Based on Guide - Using OpenTelemetry - User data)However, this is not reliable and there are spans where the attributes are missing.
I have found two different types of faulty behaviour:
In both cases, the attributes are missing on the exported span.
Since Quarkus 3.7.2 the second case occurs more often. (Most likely because of the change #38605)
With 3.7.1 there was no exception, but the attributes are written to the closed span.
The user data is read from the SecurityIdentity in io.quarkus.opentelemetry.runtime.exporter.otlp.EndUserSpanProcessor and set to the span.
The
onStart
method must non-blockin, which is why access toSecurityIdentity
was implemented via a ManagedExecutor. See #34595 (comment) (Access to attributes ofSecurityIdentity
properties is blocking. See io.quarkus.security.runtime.SecurityIdentityProxy)However, this can lead to the processing taking place too late.
Expected behavior
Actual behavior
With "slightly higher load" the attributes are partially missing
How to Reproduce?
Reproducer project: https://github.com/vonatzigenc/quarkus-otel-eusp-reproducer/tree/main
In reproducer:
a) there are spans without the attributes (Is logged using a simple SpanExporter)
b) (sometimes) ContextNotActiveException (Stacktrace is logged)
c) (sometimes) an attempt was made to write attributes to closed spans (Log statement
FINE [io.ope.sdk.tra.SdkSpan] (executor-thread-5) Calling setAttribute() on an ended Span.
)In general:
Output of
uname -a
orver
alpine 3.19.1
Output of
java -version
openjdk version "21.0.1.0.101" 2024-01-16 LTS
Quarkus version or git rev
Tested with 3.8.2
Build tool (ie. output of
mvnw --version
orgradlew --version
)Apache Maven 3.9.5
Additional information
I tried to fix the error in the reproducer project.
However, I'm not sure if using
CurrentIdentityAssociation
instead ofSecurityIdentity
is legitimate.(And whether the non-blocking variant is correct).
CurrentIdentityAssociationEndUserSpanProcessor.java
securityIdentityAssociation.getDeferredIdentity()
instead of
If helpful I can create a PR with the customisation.
The text was updated successfully, but these errors were encountered: