Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate from gRPC to armeria for testing agent in memory exporter #5314

Closed
wants to merge 14 commits into from

Conversation

anuraaga
Copy link
Contributor

@anuraaga anuraaga commented Feb 7, 2022

There is currently a big gap between the published agent and our tests because for historical reasons we include gRPC in the testing exporter. This switches it to Armeria to avoid the gRPC dependency, which automatically switches the tests to using the okhttp export codepath.

@anuraaga anuraaga requested a review from a team February 7, 2022 03:19
@anuraaga anuraaga marked this pull request as draft February 7, 2022 05:12
Arrays.asList(
AgentTestingTracingCustomizer.spanProcessor.forceFlush(),
AgentTestingLogsCustomizer.logProcessor.forceFlush());
CompletableResultCode.ofAll(results).join(10, TimeUnit.SECONDS);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT about using OpenTelemetrySdkAccess instead? It flushes the meter provider too

Copy link
Contributor Author

@anuraaga anuraaga Feb 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops realized that we don't want to flush metrics here anyways (there's no such thing as pending metric exports really, all exports happen at random times and are valid)

Copy link
Member

@trask trask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

grpc without grpc 🤯

@anuraaga anuraaga force-pushed the testing-agent-armeria branch from e06c774 to dc4d91a Compare February 8, 2022 04:17
@anuraaga anuraaga force-pushed the testing-agent-armeria branch from 7059ef3 to 390b772 Compare February 8, 2022 05:49
@trask
Copy link
Member

trask commented Feb 8, 2022

This PR seems to cause a core dump on Java 15 OpenJ9 reliably, anyone have an idea what could cause it?

ya this is weird. it seems to consistently seg fault on the dubbo and camel tests, and none of the others.

and there's no consistency in the "current thread" in the core dumps between runs, which is normally what i'd use to search on to see if it's a known bug

and there's a lot of openj9 segfault issues

and even if we could run integration tests on Java 17 today, I don't think we could run them on openj9 17 due to #5051 (comment) and https://adoptopenjdk.net/archive.html?variant=openjdk8&jvmVariant=openj9

maybe it's best to go with the @SuppressWarnings route you mentioned yesterday for updating to OTel SDK 1.11.0 and postponing this PR for a bit?

i'm also ok with skipping those two tests on openj9 15 if we want to move forward with this PR

@laurit
Copy link
Contributor

laurit commented Feb 9, 2022

I played around with this a bit. Firstly reporting this to openj9 is probably futile. Afaik they build all their runtimes from the same code base and if the bug was still there then there is a good chance 8 & 11 would also fail similarly.
The bug goes away (tested with :instrumentation:apache-dubbo-2.7:javaagent:test) when executor instrumentation is disabled or when in AgentTestingCustomizer

    autoConfigurationCustomizer.addMeterProviderCustomizer(
        (meterProvider, config) ->
            meterProvider.registerMetricReader(
                PeriodicMetricReader.builder(AgentTestingExporterFactory.metricExporter)
                    .setInterval(Duration.ofMillis(300))
                    .newMetricReaderFactory()));

is commented out or when in AgentInstaller

    // If noop OpenTelemetry is enabled, autoConfiguredSdk will be null and AgentListeners are not
    // called
    AutoConfiguredOpenTelemetrySdk autoConfiguredSdk = null;
    if (config.getBoolean(JAVAAGENT_NOOP_CONFIG, false)) {
      logger.info("Tracing and metrics are disabled because noop is enabled.");
      GlobalOpenTelemetry.set(NoopOpenTelemetry.getInstance());
    } else {
      autoConfiguredSdk = installOpenTelemetrySdk(config);
    }

    if (autoConfiguredSdk != null) {
      runBeforeAgentListeners(agentListeners, config, autoConfiguredSdk);
    }

is moved after byte-buddy instrumenter is set up or when in AbstractExecutorInstrumentation java.util.concurrent.ForkJoinPool is removed from included executors list.
The key to this crash probably has something to to with java.util.concurrent.ForkJoinPool. While commenting out addMeterProviderCustomizer seems somewhat random I think it is still indirectly related to ForkJoinPool. Metrics send data to armeria which uses caffeine bounded cache, which uses ForkJoinPool. Changing interval Duration.ofSeconds(3) seems to also avoid the crash (idk maybe just makes it not repro on each run).
Similarly moving code around in AgentInstaller doesn't have an obvious relation to ForkJoinPool. ForkJoinPool is loaded early in AgentInstaller.installBytebuddyAgent from call to List<AgentListener> agentListeners = loadOrdered(AgentListener.class);. Seems like moving this block ensures that first metrics are sent after byte-buddy transformer is installed and byte-buddy has redefined ForkJoinPool.
The most unintrusive fix for this crash that I could come up with is to modify OtlpInMemoryMetricExporter so that it would fail export attempts before AgentListener.afterAgent is called.
To avoid similar issues in the future I'd urge maintainers to consider altering agent initialization sequence so that byte-buddy transformer is installed before sdk is initialized.

@mateuszrzeszutek
Copy link
Member

To avoid similar issues in the future I'd urge maintainers to consider altering agent initialization sequence so that byte-buddy transformer is installed before sdk is initialized.

I don't think we can do that either - that could cause some instrumentations to initialize with the no-op OpenTelemetry implementation (since the global is set after SDK is initialized)

@anuraaga
Copy link
Contributor Author

Thanks @laurit for the great digging! Delayed the metric export to after agent initializes

}

@SuppressWarnings("ImmutableEnumChecker")
private enum StartableMetricReader implements MetricReaderFactory, MetricReader {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a brief comment explaining why this is needed (or just pointing to the PR discussion)?

@anuraaga
Copy link
Contributor Author

For reference, I was expecting test slowdown with this PR going from in-memory transport to using localhost network traffic. After getting some builds through, dunno if I like it. Instead sent #5332

@anuraaga anuraaga marked this pull request as draft February 10, 2022 06:54
@laurit
Copy link
Contributor

laurit commented Feb 10, 2022

To avoid similar issues in the future I'd urge maintainers to consider altering agent initialization sequence so that byte-buddy transformer is installed before sdk is initialized.

I don't think we can do that either - that could cause some instrumentations to initialize with the no-op OpenTelemetry implementation (since the global is set after SDK is initialized)

I didn't say it would be easy :) There wouldn't be too many instrumentations that are affected, these could use an extra agent started check and bail out when sdk isn't ready yet. Perhaps this agent started check could even be baked into instrumenter api somehow? For another case where running initial retransformation concurrently with background thread started sdk code produces strangeness see #4697 Basically it boils down to replacing one set of problems with another, hopefully more manageable, set of problems.

@anuraaga anuraaga closed this Feb 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants