Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add context_propagation_only config option, spec existing disable_send config option #461

Merged
merged 8 commits into from
Oct 26, 2021

Conversation

trentm
Copy link
Member

@trentm trentm commented Jun 29, 2021

This is part of my adding disableSend support to the Node.js APM agent.

Currently the Java, Python, and Ruby agents implement disable_send:

I'm not sure if they meet the current MUST / SHOULD semantics I've proposed.


@felixbarny I'd appreciate a sanity check from you on this:

  1. I didn't see an obvious existing spec file for this (didn't feel right in transport.md), so I've started a general section at the bottom of configuration.md.
  2. Do the proposed MUSTs and SHOULDs raise any red flags for my being Node.js-centric?

@trentm trentm requested a review from felixbarny June 29, 2021 00:40
@trentm trentm self-assigned this Jun 29, 2021
@apmmachine
Copy link

apmmachine commented Jun 29, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 3 min 9 sec

Copy link
Member

@felixbarny felixbarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see an obvious existing spec file for this (didn't feel right in transport.md), so I've started a general section at the bottom of configuration.md.

I think transport.md would be quite fitting. If you disagree, prefer creating a new file over adding to configuration.md.

Do the proposed MUSTs and SHOULDs raise any red flags for my being Node.js-centric?

LGTM 👍

specs/agents/README.md Outdated Show resolved Hide resolved
specs/agents/configuration.md Outdated Show resolved Hide resolved
Comment on lines 99 to 101
- SHOULD attempt to reduce runtime overhead where possible. For example,
because events will be dropped there is no need to collect stack traces,
collect metrics, or to calculate breakdown metrics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with this one. I like to think of DISABLE_SEND as a way to run all of the agent except only the parts that communicate with APM Server. That means doing everything as usual up until the very last point before opening a connection. If perf is an issue, users can use the RECORDING or ENABLED options instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But a big use case for this option is to allow for distributed context propagation and log correlation. To reduce overhead for that use case, it does make sense to turn off features that don't have any noticeable side-effects but add overhead. If recording or enabled is set to false, log correlation doesn't work either.

I like to think of DISABLE_SEND as a way to run all of the agent except only the parts that communicate with APM Server. That means doing everything as usual up until the very last point before opening a connection.

What's the advantage of not disabling side-effect-free features such as breakdown metrics if disable_send is false? Is it something you need for testing? Is it because the option name doesn't reflect that? Any suggestions for a better option name while we're standardizing on the behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is maybe only my own use case, but for my own apps I like to run the agent as integrated as possible, doing as many things as possible, to test that nothing breaks. However, I don't need the resulting data as it's not really relevant when in development/test mode.

It may be because I am the only person who's both an agent AND application developer, but I can see other folks wanting something like this too. Like, staging or whatever, where you want to mirror prod as closely as possible but don't wan't the burden of setting up a dedicated APM Server for it or don't want the data clutter.

idk, ready to be voted down. This definition was just not how I, personally, used it 😊

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand the dev/test use case. For that I wrote a small mock APM server: https://gist.github.com/trentm/4b8c54a8bdb1c1eba2ade871253860f6 It shows all requests nicely, which can be helpful for dev as well:

[2021-06-29T23:18:13.255Z]  INFO: mockapmserver/47289 on pink.local: request (req.remoteAddress=::ffff:127.0.0.1, req.remotePort=58727, req.bodyLength=2067, res.bodyLength=2)
    POST /intake/v2/events HTTP/1.1
    accept: application/json
    user-agent: elasticapm-node/3.16.0 elastic-apm-http-client/9.8.1 node/12.22.1
    content-type: application/x-ndjson
    content-encoding: gzip
    host: localhost:8200
    connection: keep-alive
    transfer-encoding: chunked

    {
        "metadata": {
            "service": {
                "name": "play-s3",
                "environment": "development",
                ...

However, I think the user use case (the motivator for this is usage by Kibana core for elastic/kibana#101711) either trumps the dev case, or we look into two separate config vars here.

@elastic/apm-agent-python What's your understanding/usage of disable_send?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply. The reason we added disable_send back in the day was indeed to facilitate the use of the agent for local development and CI (not our CI, the user's CI). The other option is to disable the agent completely, which bears the risk that issues of the agent inter-playing with the app only show up on deployment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's two votes for everything but sending 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as all the mentioned ways to limit overhead can already be achieved with existing settings AFAICT.

Not entirely, at least in the node.js agent. There isn't another way to avoid:

  1. encoding of the span/transaction objects to data objects for sending
  2. capturing an error (especially its stacktrace) for captured errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't recording or enabled do the trick? Maybe source_lines_error_app_frames for stacktraces?

Copy link
Member Author

@trentm trentm Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't recording or enabled do the trick?

The Node.js agent doesn't currently implement recording :) -- and I'm not entirely clear from the description of recording in the Python and Java agent docs that it matches. The Java docs imply recording: false means not tracking incoming HTTP requests, which means no trace-context propagation.

Maybe source_lines_error_app_frames for stacktraces?

Mostly, yes. Stacktrace collection is the large part of error (and span) capture, but not all of it.


tl;dr: Would anyone object to the spec here saying something like "Agents MAY reduce runtime overhead where possible, for example skipping stack trace, metrics, error collection and breakdown metric calculation."? Effectively this would allow agent implementations to differ slightly without yet having get into specifying exactly how to support comparatively rare use cases. I imagine this particular Kibana use case -- using APM, but not for traces and metrics -- is rare. This is "option 1" in the details below.

Details (you can skip if your eyes are glazing over):

We discussed this at the apm-agents-nodejs call this morning and Alan and I chatted a bit about it. Here are options that I see:

  1. Change the disable_send spec to say "agents MAY limit work ..." and basically have divergent use cases for the various agents.
  2. Add a separate config option to more clearly state the use case for "do all your work, but be useful for CI where there isn't an APM server" (as Python and Ruby are using disable_send=true now). Seems unnecessary to have Python and Ruby teams prioritize this.
  3. Add a separate config option to more clearly state the use case for "do the minimal work to support trace-context propagation and log correlation" (as Node.js and Java, sort of, are using disable_send=true now). I'm not sure what I would call this option.
  4. Add separate config options to disable particular functionality (like span serialization and error capture) such that the "do the minimal work to support trace-context propagation and log correlation" use case is: disable_send=true, metrics_interval=0s, breakdown_metrics=false, serialize_objects_for_sending=false(?), capture_errors=false(?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. This feels like under-specifying the option to a point where users have to rely on the implementation details of a specific agent.
  2. I think the name disable_send conveys that quite nicely already.
  3. To me, this sounds like a good option. Finding a better name that better suits the context propagation use case (context_propagation_only?) makes it easier to understand the intent of the option. Agents that have already implemented disable_send can just use it as an alias for the new config in the first phase. In a later stage, they can implement the listed optimizations.
    The spec should be more concrete on the things agents SHOULD update (such as putting events in the queue, capturing errors and stack traces)
  4. Seems like adding even more options and making it harder for the users to set the right set of options for the context propagation/log correlation use case.

@trentm
Copy link
Member Author

trentm commented Jun 29, 2021

I think transport.md would be quite fitting.

Okay, done.

@trentm trentm changed the title spec DISABLE_SEND config option spec disable_send config option Jul 6, 2021
@trentm trentm requested a review from felixbarny July 6, 2021 00:15
specs/agents/transport.md Outdated Show resolved Hide resolved
@trentm
Copy link
Member Author

trentm commented Oct 6, 2021

I dropped the ball on this one. I'd like to get back to this so I can settle on the new name used in the Node.js APM agent because Kibana is going to be using it.

Going on @felixbarny's "To me, this sounds like a good option." the plan is:

Add a separate config option to more clearly state the use case for "do the minimal work to support trace-context propagation and log correlation" (as Node.js and Java, sort of, are using disable_send=true now). I'm not sure what I would call this option.

I like Felix's suggestion of a boolean config var named context_propagation_only. Any objections?
I'll update this PR and re-request review.

@trentm trentm changed the title spec disable_send config option add context_propagation_only config options, spec existing disable_send config option Oct 6, 2021
@trentm trentm changed the title add context_propagation_only config options, spec existing disable_send config option add context_propagation_only config option, spec existing disable_send config option Oct 6, 2021
- MUST NOT log warnings/errors related to failures to communicate with APM server.
- SHOULD attempt to reduce runtime overhead where possible. For example,
because events will be dropped there is no need to collect stack traces,
collect metrics, or to calculate breakdown metrics.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REVIEW NOTE: I'm considering adding that there is no need to create spans (other than the top-level transaction) in this mode, pending some discussion on slack.

Copy link
Member

@felixbarny felixbarny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

…xt_propagation_only (from separate discussion)
@trentm trentm marked this pull request as ready for review October 7, 2021 17:38
@trentm trentm requested review from a team as code owners October 7, 2021 17:38
@trentm trentm requested review from a team as code owners October 7, 2021 17:38
specs/agents/transport.md Outdated Show resolved Hide resolved
Copy link
Contributor

@basepi basepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we explicitly define these as optional configuration options that agents can choose not to implement if they're not relevant? (My understanding is that they're optional, and if that's correct I think we should define that in the spec.)

@trentm
Copy link
Member Author

trentm commented Oct 12, 2021

Should we explicitly define these as optional configuration options that agents can choose not to implement if they're not relevant?

Perhaps something like starting each of those sections with Agents MAY implement this configuration option. ?

@basepi
Copy link
Contributor

basepi commented Oct 12, 2021

Perhaps something like starting each of those sections with Agents MAY implement this configuration option. ?

Seems reasonable. Maybe I'm nit-picking and it's not necessary. I haven't checked if we're explicit on other optionals.

@trentm
Copy link
Member Author

trentm commented Oct 12, 2021

Maybe I'm nit-picking

I like the added clarity. The existing language -- "Agents that implement this configuration option" -- is vague.

I haven't checked if we're explicit on other optionals.

It varies. Some say "should" in their opening sentence, implying that it is optional, but there is room for interpretation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants