Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore TLS components (SSLContext, TrustManager, KeyManager) if plain HTTP protocol is used for exporting #6329

Conversation

serkan-ozal
Copy link
Contributor

@serkan-ozal serkan-ozal commented Mar 29, 2024

Even though plain HTTP protocol is used for exporting, still security components (SSLContext, TrustManager, KeyManager) are initialized. Most of the time, this is not a problem other than adding some minor startup overhead.

But there might be some cases that security components should not be initialized initially by OTEL Java agent. Here is a case which effects one of our customers:

  • They have their own custom KeyStoreSpi implementation.
  • They bundle their Spring application as executable jar, so artifact layout is customized for Spring's custom Launcher classloader.
  • OTEL Java agent is initialized at startup and so creates exporters. But even though traces are sent to OTEL collector over plain HTTP, still some security components are initialized by OTEL Java agent at system classloader level.
  • Since artifact layout is customized for Spring's custom Launcher classloader as mentioned above, JDK level security components are initialized without custom KeyStoreSpi implementation, because custom KeyStoreSpi implementation is not visible to system classloader, but to Spring's custom Launcher classloader.
  • Then security component usages (custom keystore import and SSL connection to remote service by using this keystore) at application level fails.

Also there is a similarity between the case I have mentioned above and this issue open-telemetry/opentelemetry-java-instrumentation#10921. They are not the same errors, but caused by the initialization of some JDK level SPIs at system classloader level instead of Launcher classloader for Spring applications.

@serkan-ozal serkan-ozal requested a review from a team March 29, 2024 12:55
Copy link

codecov bot commented Mar 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.14%. Comparing base (9845ac9) to head (dacf5b9).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #6329      +/-   ##
============================================
+ Coverage     91.12%   91.14%   +0.01%     
- Complexity     5856     5863       +7     
============================================
  Files           636      636              
  Lines         17062    17071       +9     
  Branches       1733     1740       +7     
============================================
+ Hits          15548    15559      +11     
+ Misses         1019     1018       -1     
+ Partials        495      494       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@tylerbenson tylerbenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks reasonable to me. Definitely makes sense to avoid the extra work when not sending via SSL which is likely very common (eg local collector).

@serkan-ozal
Copy link
Contributor Author

As far as I see, Build (macos-latest, 8) job has been cancelled because of timeout after 6 hours and don't think it is related to my changes. Can someone retry the failed job?

@@ -207,8 +208,8 @@ public GrpcExporter<T> build() {
grpcChannel,
grpcStubFactory,
retryPolicy,
tlsConfigHelper.getSslContext(),
tlsConfigHelper.getTrustManager());
isPlainHttp ? null : tlsConfigHelper.getSslContext(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since artifact layout is customized for Spring's custom Launcher classloader as mentioned above, JDK level security components are initialized without custom KeyStoreSpi implementation, because custom KeyStoreSpi implementation is not visible to system classloader, but to Spring's custom Launcher classloader.

It looks like the call to SSLContext.getInstance("TLS") is what is responsible for initializing this state? Seems brittle to need to rely on the otel java agent and otel java SDK to not call ordinary java network APIs like SSLContext.getInstance("TLS").

Why is it not ok to call SSLContext.getInstance("TLS") when the scheme is http, but acceptable to call it when scheme is https? Wouldn't spring users that want to implement KeyStoreSpi and use https be subject to this problem?

Copy link
Contributor Author

@serkan-ozal serkan-ozal Mar 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jack-berg,

Actually, in my case, extra SSLContext.getInstance("TLS") call is not a problem. But I have though that there might be similar problem for javax.net.ssl.SSLContextSpi and java.security.Provider as they are also loaded by Java ServiceLoader. So I have prevented extra SSLContext.getInstance("TLS") calls if they are not needed (scheme is http).

If you want, I can revert those changes and only keep

builder.connectionSpecs(Collections.singletonList(ConnectionSpec.CLEARTEXT));

parts as those fix the my original issue.

Regarding to your this question

Wouldn't spring users that want to implement KeyStoreSpi and use https be subject to this problem?

Yes, you're right. They are still subject to this problem. In fact, this PR doesn't fix all the possible issues (when OTEL Java agent uses secure connection while exporting to collector), but most of the cases (I believe), because as @tylerbenson mentioned, many are exporting to local collector first without secure connection, and then local collector deals with exporting to remove vendor target.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be evaluated as a performance optimization where we try to avoid doing unnecessary work in obvious cases rather than creating a guarantee that it never happens.

While this has a similar appearance as java util logging which we take great pains to deal with in the javaagent, doing so adds a significant amount of complexity in the javaagent.

For logging this complexity is worth it because logging initialization is so hard to avoid, while SSL initialization is only done in very specific locations and not difficult to avoid (with the added benefit of reducing performance overhead and not even requiring an additional setting).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf argument is quite different than how this is being framed. Some points to consider:

  • On my machine, SSLContext.getInstance("TLS") takes ~50ms. Anyone know how long the agent typically takes to initialize?
  • Optimizing initialization for perf reasons seems reasonable, but we shouldn't be under the impression that this is a reliable fix for the reported issue. The same issue is likely to pop up again the future.
  • If we want to optimize initialization for perf reasons, can we look at a profile of CPU time for the initialization of the agent / SDK? There's likely other low hanging fruit we can optimize.

Copy link
Contributor Author

@serkan-ozal serkan-ozal Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for the performance optimization perspective, getting rid of redundant SSLContext.getInstance("TLS") is not the only part optimized here. But also getting rid of redundant TrustManager and KeyStore initialization in the OkHttpClient. In our case, initialization of the TrustManager is the real problem.

Here is the stacktrace to see the flow:

OpenTelemetry Javaagent failed to start
java.security.KeyStoreException: problem accessing trust store
	at java.base/sun.security.ssl.TrustManagerFactoryImpl.engineInit(TrustManagerFactoryImpl.java:73)
	at java.base/javax.net.ssl.TrustManagerFactory.init(TrustManagerFactory.java:282)
	at okhttp3.internal.platform.Platform.platformTrustManager(Platform.kt:80)
	at okhttp3.OkHttpClient.<init>(OkHttpClient.kt:237)
	at okhttp3.OkHttpClient$Builder.build(OkHttpClient.kt:1069)
	at io.opentelemetry.exporter.sender.okhttp.internal.OkHttpGrpcSender.<init>(OkHttpGrpcSender.java:97)
	at io.opentelemetry.exporter.sender.okhttp.internal.OkHttpGrpcSenderProvider.createSender(OkHttpGrpcSenderProvider.java:44)
	at io.opentelemetry.exporter.internal.grpc.GrpcExporterBuilder.build(GrpcExporterBuilder.java:189)
	at io.opentelemetry.exporter.otlp.metrics.OtlpGrpcMetricExporterBuilder.build(OtlpGrpcMetricExporterBuilder.java:243)
	at io.opentelemetry.exporter.otlp.internal.OtlpMetricExporterProvider.createExporter(OtlpMetricExporterProvider.java:71)

...
...

Caused by: java.io.IOException: Keystore was tampered with, or password was incorrect
	at java.base/sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:813)
	at java.base/sun.security.util.KeyStoreDelegator.engineLoad(KeyStoreDelegator.java:242)
	at java.base/java.security.KeyStore.load(KeyStore.java:1473)
	at java.base/sun.security.ssl.TrustStoreManager$TrustAnchorManager.loadKeyStore(TrustStoreManager.java:390)
	at java.base/sun.security.ssl.TrustStoreManager$TrustAnchorManager.getTrustedCerts(TrustStoreManager.java:336)
	at java.base/sun.security.ssl.TrustStoreManager.getTrustedCerts(TrustStoreManager.java:57)
	at java.base/sun.security.ssl.TrustManagerFactoryImpl.engineInit(TrustManagerFactoryImpl.java:49)
	... 43 more
Caused by: java.security.UnrecoverableKeyException: Password verification failed
	at java.base/sun.security.provider.JavaKeyStore.engineLoad(JavaKeyStore.java:811)
	... 49 more

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerbenson @jack-berg What you are thinking about this PR? Is there anything I can add/fix this PR? Or the Do changes make sense you?

Copy link
Member

@jack-berg jack-berg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving this, but not for the reasons discussed in the PR description. As discussed in this comment chain, this isn't a reliable solution for the problem described. And folks shouldn't depend on or expect the SDK to jump through hoops to add a bandaid without addressing the underlying issue in a general way.

Accepting this because PR description aside, its a innocuous change with a minor improvement in perf for a narrow slice of users.

@jack-berg
Copy link
Member

@serkan-ozal can you update with latest changes from main to fix the build? Thanks.

@serkan-ozal serkan-ozal force-pushed the improvement/exporter/ignore-tls-for-plain-http branch from 957b6d4 to dacf5b9 Compare April 30, 2024 19:45
@serkan-ozal
Copy link
Contributor Author

@jack-berg Thanks. Just rebased from main branch.

@jack-berg jack-berg merged commit 35bc345 into open-telemetry:main Apr 30, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants