-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add context_propagation_only config option, spec existing disable_send config option #461
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see an obvious existing spec file for this (didn't feel right in transport.md), so I've started a general section at the bottom of configuration.md.
I think transport.md
would be quite fitting. If you disagree, prefer creating a new file over adding to configuration.md
.
Do the proposed MUSTs and SHOULDs raise any red flags for my being Node.js-centric?
LGTM 👍
specs/agents/configuration.md
Outdated
- SHOULD attempt to reduce runtime overhead where possible. For example, | ||
because events will be dropped there is no need to collect stack traces, | ||
collect metrics, or to calculate breakdown metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree with this one. I like to think of DISABLE_SEND
as a way to run all of the agent except only the parts that communicate with APM Server. That means doing everything as usual up until the very last point before opening a connection. If perf is an issue, users can use the RECORDING
or ENABLED
options instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But a big use case for this option is to allow for distributed context propagation and log correlation. To reduce overhead for that use case, it does make sense to turn off features that don't have any noticeable side-effects but add overhead. If recording
or enabled
is set to false
, log correlation doesn't work either.
I like to think of DISABLE_SEND as a way to run all of the agent except only the parts that communicate with APM Server. That means doing everything as usual up until the very last point before opening a connection.
What's the advantage of not disabling side-effect-free features such as breakdown metrics if disable_send
is false
? Is it something you need for testing? Is it because the option name doesn't reflect that? Any suggestions for a better option name while we're standardizing on the behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is maybe only my own use case, but for my own apps I like to run the agent as integrated as possible, doing as many things as possible, to test that nothing breaks. However, I don't need the resulting data as it's not really relevant when in development/test mode.
It may be because I am the only person who's both an agent AND application developer, but I can see other folks wanting something like this too. Like, staging or whatever, where you want to mirror prod as closely as possible but don't wan't the burden of setting up a dedicated APM Server for it or don't want the data clutter.
idk, ready to be voted down. This definition was just not how I, personally, used it 😊
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand the dev/test use case. For that I wrote a small mock APM server: https://gist.github.com/trentm/4b8c54a8bdb1c1eba2ade871253860f6 It shows all requests nicely, which can be helpful for dev as well:
[2021-06-29T23:18:13.255Z] INFO: mockapmserver/47289 on pink.local: request (req.remoteAddress=::ffff:127.0.0.1, req.remotePort=58727, req.bodyLength=2067, res.bodyLength=2)
POST /intake/v2/events HTTP/1.1
accept: application/json
user-agent: elasticapm-node/3.16.0 elastic-apm-http-client/9.8.1 node/12.22.1
content-type: application/x-ndjson
content-encoding: gzip
host: localhost:8200
connection: keep-alive
transfer-encoding: chunked
{
"metadata": {
"service": {
"name": "play-s3",
"environment": "development",
...
However, I think the user use case (the motivator for this is usage by Kibana core for elastic/kibana#101711) either trumps the dev case, or we look into two separate config vars here.
@elastic/apm-agent-python What's your understanding/usage of disable_send
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. The reason we added disable_send
back in the day was indeed to facilitate the use of the agent for local development and CI (not our CI, the user's CI). The other option is to disable the agent completely, which bears the risk that issues of the agent inter-playing with the app only show up on deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's two votes for everything but sending 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as all the mentioned ways to limit overhead can already be achieved with existing settings AFAICT.
Not entirely, at least in the node.js agent. There isn't another way to avoid:
- encoding of the span/transaction objects to data objects for sending
- capturing an error (especially its stacktrace) for captured errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't recording
or enabled
do the trick? Maybe source_lines_error_app_frames
for stacktraces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't
recording
orenabled
do the trick?
The Node.js agent doesn't currently implement recording
:) -- and I'm not entirely clear from the description of recording
in the Python and Java agent docs that it matches. The Java docs imply recording: false
means not tracking incoming HTTP requests, which means no trace-context propagation.
Maybe
source_lines_error_app_frames
for stacktraces?
Mostly, yes. Stacktrace collection is the large part of error (and span) capture, but not all of it.
tl;dr: Would anyone object to the spec here saying something like "Agents MAY reduce runtime overhead where possible, for example skipping stack trace, metrics, error collection and breakdown metric calculation."? Effectively this would allow agent implementations to differ slightly without yet having get into specifying exactly how to support comparatively rare use cases. I imagine this particular Kibana use case -- using APM, but not for traces and metrics -- is rare. This is "option 1" in the details below.
Details (you can skip if your eyes are glazing over):
We discussed this at the apm-agents-nodejs call this morning and Alan and I chatted a bit about it. Here are options that I see:
- Change the
disable_send
spec to say "agents MAY limit work ..." and basically have divergent use cases for the various agents. - Add a separate config option to more clearly state the use case for "do all your work, but be useful for CI where there isn't an APM server" (as Python and Ruby are using
disable_send=true
now). Seems unnecessary to have Python and Ruby teams prioritize this. - Add a separate config option to more clearly state the use case for "do the minimal work to support trace-context propagation and log correlation" (as Node.js and Java, sort of, are using
disable_send=true
now). I'm not sure what I would call this option. - Add separate config options to disable particular functionality (like span serialization and error capture) such that the "do the minimal work to support trace-context propagation and log correlation" use case is: disable_send=true, metrics_interval=0s, breakdown_metrics=false, serialize_objects_for_sending=false(?), capture_errors=false(?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This feels like under-specifying the option to a point where users have to rely on the implementation details of a specific agent.
- I think the name
disable_send
conveys that quite nicely already. - To me, this sounds like a good option. Finding a better name that better suits the context propagation use case (
context_propagation_only
?) makes it easier to understand the intent of the option. Agents that have already implementeddisable_send
can just use it as an alias for the new config in the first phase. In a later stage, they can implement the listed optimizations.
The spec should be more concrete on the things agents SHOULD update (such as putting events in the queue, capturing errors and stack traces) - Seems like adding even more options and making it harder for the users to set the right set of options for the context propagation/log correlation use case.
Okay, done. |
I dropped the ball on this one. I'd like to get back to this so I can settle on the new name used in the Node.js APM agent because Kibana is going to be using it. Going on @felixbarny's "To me, this sounds like a good option." the plan is:
I like Felix's suggestion of a boolean config var named |
Co-authored-by: Felix Barnsteiner <[email protected]>
…ting usage of disable_send in Python and Ruby agents
specs/agents/transport.md
Outdated
- MUST NOT log warnings/errors related to failures to communicate with APM server. | ||
- SHOULD attempt to reduce runtime overhead where possible. For example, | ||
because events will be dropped there is no need to collect stack traces, | ||
collect metrics, or to calculate breakdown metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
REVIEW NOTE: I'm considering adding that there is no need to create spans (other than the top-level transaction) in this mode, pending some discussion on slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…xt_propagation_only (from separate discussion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we explicitly define these as optional configuration options that agents can choose not to implement if they're not relevant? (My understanding is that they're optional, and if that's correct I think we should define that in the spec.)
Co-authored-by: eyalkoren <[email protected]>
Perhaps something like starting each of those sections with |
Seems reasonable. Maybe I'm nit-picking and it's not necessary. I haven't checked if we're explicit on other optionals. |
I like the added clarity. The existing language -- "Agents that implement this configuration option" -- is vague.
It varies. Some say "should" in their opening sentence, implying that it is optional, but there is room for interpretation. |
… implement these)
This is part of my adding
disableSend
support to the Node.js APM agent.Currently the Java, Python, and Ruby agents implement
disable_send
:I'm not sure if they meet the current
MUST
/SHOULD
semantics I've proposed.@felixbarny I'd appreciate a sanity check from you on this:
MUST
s andSHOULD
s raise any red flags for my being Node.js-centric?