-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
new "run context" management to fix numerous span hierarchy issues #2181
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
This is still very much draft work.
The other half is this in elastic-apm-http-client: --- apm-nodejs-http-client/index.js 2021-06-09 15:57:41.000000000 -0700 +++ node_modules/elastic-apm-http-client/index.js 2021-08-09 15:15:37.000000000 -0700 @@ -205,9 +205,11 @@ switch (this._conf.serverUrl.protocol) { case 'http:': this._transport = http + this._uninstrumentedRequest = http.request[this._conf.originalSym] break case 'https:': this._transport = https + this._uninstrumentedRequest = https.request[this._conf.originalSym] break default: throw new Error('Unknown protocol ' + this._conf.serverUrl.protocol) @@ -794,7 +796,8 @@ } // Start the request and set its timeout. - const intakeReq = client._transport.request(client._conf.requestIntake) + console.warn('XXX making intake request', client._conf.requestIntake.path) + const intakeReq = client._uninstrumentedRequest(client._conf.requestIntake) if (Number.isFinite(client._conf.serverTimeout)) { intakeReq.setTimeout(client._conf.serverTimeout) }
…orks (see examples/parent-child.js case) We will have to do the same for other aws-sdk, for ES, and evaluate for other instrumentations. Note this is an improvement over current behaviour which doesn't handle cases like the parent/child reln above anyway.
e48bf5f
to
334d7ef
Compare
For example, this handles the problem case from #1964
need to use setTimeout to avoid the race, though it *is* slower running tests. Just a ~15s test run, whcih is fine for one test file.
…ripped me up on test/sanitize-field-names tests which assume that done() call is immediate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've taken a few passes at this and left some comments/questions, but overall it scans, is well thought out, and looks like it'll work. 👍 There's a few additional "real world" tests I intend to perform for an extra level of certainty, but I want to get this review out now instead of later :)
if (!this._started) { | ||
return null | ||
} | ||
return this._runCtxMgr.active().currTransaction() || null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Deep calls here (_runCtxMgr.active()
) should be safe. _runCtxMgr
is always set if the instrumentation is started, and the base class active
method of a context manager will always return an object.
// interface. | ||
// | ||
// (This class has largerly the same API as @opentelemetry/api `ContextManager`. | ||
// The implementation is adapted from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: oTel's license is Apache-2 and we're BSD-2. Do we want/need to include a copy of the Apache license along with these files, or is the code unique enough that we don't need to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did add an entry to NOTICE.md following the pattern there. I did not include the full Apache-2 license text, however, because that struck me as redundant with the link to it, and the fact that the apache license text is very large and would grokking the notice file more difficult. I'd be fine including the full text, however.
https://github.com/elastic/apm-agent-nodejs/pull/2181/files?w=1#diff-53337bf7ea91d782c573ac8653c18f95b7b23c38da9de08288639bcec9993406
// A run context manager that uses an async hook to automatically track | ||
// run context across async tasks. | ||
// | ||
// (Adapted from https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/opentelemetry-context-async-hooks/src/AsyncHooksContextManager.ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: oTel's license is Apache-2 and we're BSD-2. Do we want/need to include a copy of the Apache license along with these files, or is the code unique enough that we don't need to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See NOTICE.md
// "patch-async.js" it does an adequate job of context tracking for much of the | ||
// core Node.js API. | ||
// | ||
// (Adapted from https://github.com/open-telemetry/opentelemetry-js/blob/main/packages/opentelemetry-context-async-hooks/src/AsyncHooksContextManager.ts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: oTel's license is Apache-2 and we're BSD-2. Do we want/need to include a copy of the Apache license along with these files, or is the code unique enough that we don't need to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See NOTICE.md
// anything for other code bound to the original RunContext (e.g. via | ||
// `ins.bindFunction` or `ins.bindEmitter`). | ||
// | ||
// RunContext is roughly equivalent to OTel's `Context` interface in concept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: oTel's license is Apache-2 and we're BSD-2. Do we want/need to include a copy of the Apache license along with these files, or is the code unique enough that we don't need to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See NOTICE.md
@@ -110,7 +109,3 @@ function unwrap (nodule, name) { | |||
return nodule[name][symbols.unwrap]() | |||
} | |||
} | |||
|
|||
function isWrapped (wrapped) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that isWrapped
is an exported function from lib/instrumentation/shimmer.js
, and removing it will cause things to break if someone's directly using our shimmer. The shimmer isn't listed as a formal API, but its importance in custom instrumentation means people may be using it.
Question: Lacking clear evidence either way -- do you think this is safe to remove, or should we leave it here out of an abundance of caution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really want an abundance of caution to dictate not being able to remove undocumented internal APIs. The same could be said for all the API changes I'm making here on the Instrumentation
object, no?
This isWrapped
method was added in #429 solely for a fix to a single version of mimic-response (version 1.0.0). The issue was fixed in [email protected], which was released in 2018.
I'd like to say that if someone is doing a custom instrumentation and then adding it via apm.addPatch()
, that they should bring along their own shimmer.js and not use the private one.
How strongly do you stand on this? :)
} else { | ||
this._runCtxMgr = new BasicRunContextManager(this._log) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOP Question: Should loading the patch-async
module (require('./patch-async')(this)
) be the responsibility of the context manager object, or is there value to having this happen seperately? (I'm thinking a PatchAsyncRunContextManager
class that extends the BasicRunContextManager
class and calls require('./patch-async')(this)
in its constructor (after calling super()
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a reasonable thing to add, yah. I'd prefer to do it in a separate change, or at least not unless there is other work we want to do this PR. Note that I do intend to come back to this area later to add a RunContextManager based on Async Local Storage.
if (typeof target !== 'function') { | ||
return target | ||
} | ||
// this._log.trace('bind %s to fn "%s"', runContext, target.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: stray comment or deliberate debugging code left in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deliberate. When debugging context propagation issues, this debug print is sometimes useful to understand some of the mechanics. However, it is waay too noisy to leave in by default, even at 'trace' level. Also the use of target.name
isn't obvious to reproduce from memory later.
If you prefer, I could move this out to this comment in DEVELOPMENT.md.
@trentm I've finished the last of the real world smoke testing I wanted to do. One possible thing of interest -- I noticed was that with a handlebars template we went from having the spans listed Before: After: I believe this behavior is within expectations and more accurately reflects the behavior of the wrapped still a 👍 /approval from here. |
@astorm Could this possibly have been due to #2326 (which is merged to master, but not yet in a release)? I.e. could your "before" code have been the latest release rather than the latest master? |
@dgieselaar If/when you have a chance, I would love to get your feedback on whether this branch of the APM agent seems to work for you for self-APM-tracing of Kibana. The following are my notes from playing with using this branch to trace a dev build of the current 7.x branch of Kibana. No problem if you don't have time. My hope is to merge this later this week. Some Kibana self-APM playing with this:
diff --git a/package.json b/package.json
index 76177b90191..22d1ba1c0e7 100644
--- a/package.json
+++ b/package.json
@@ -221,7 +221,7 @@
"deep-freeze-strict": "^1.1.1",
"deepmerge": "^4.2.2",
"del": "^5.1.0",
- "elastic-apm-node": "^3.21.1",
+ "elastic-apm-node": "git://github.com/elastic/apm-agent-nodejs.git#trentm/run-ctx-mgr",
"execa": "^4.0.2",
"exit-hook": "^2.2.0",
"expiry-js": "0.1.7", and create a "config/apm.dev.js" with content like this: module.exports = {
active: true,
centralConfig: false,
serverUrl: 'https://trentm-play5.apm.us-west1.gcp.cloud.es.io',
breakdownMetrics: false,
transactionSampleRate: 1, // Note I'm going for no sampling here. Often this is reduced to 0.1 or something for fear APM perf impact may be large.
metricsInterval: '120s',
captureHeaders: false,
captureBody: 'off',
logUncaughtExceptions: true,
captureSpanStackTraces: false,
secretToken: '[REDACTED]'
}
The only real thing to notice here is that the parent/child relationship for the HTTP spans under ES spans is handled (this is #1963) Separate work, i.e. a different PR, will remove those HTTP spans. |
@trentm Love to see the progress here, will try to have a look tomorrow. Thanks for doing this! |
@trentm some quick observations:
|
@dgieselaar Thanks for looking!
Good. That is expected, assuming I understand you correctly.
My understanding is this is the same issue as you described earlier in #1964 I expect that: ending span A, then starting span B in the same async task, should result in span B not being a child of A.
However, the possible subtlety is: if span B is started in an async task that was created when span A was still active, then span B will be a child of A, even if A has ended by the time that async task runs. var spanA = apm.startSpan('A')
setImmediate(function () {
var spanB = apm.startSpan('B') // B *is* a child of A, even though A ended before this async task ran.
})
spanA.end() |
@trentm I think it happens in this scenario:
E.g., for this route, we end up with Is that expected? |
@dgieselaar Thanks very much for the scenario. I believe I've reproduced and understand roughly what is happening. tl;dr: Yes, this (still requiring the My sample code is at https://gist.github.com/trentm/6f118e2abb8188b8fe3d92cd1d7cbd8f if you care. The beginnings of the Running the "dario.js" sample code using
|
@trentm: perfect! looks like your version will work with sync functions too (ie a function that does not return a promise), is that correct? |
@dgieselaar That is the plan. I haven't tested that yet, though. |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
🤖 GitHub commentsTo re-run your PR in the CI, just comment with:
|
This is a big change. Thanks for reviewing it. If you are just here to see what behaviour has changed, scroll down to the "Behaviour changes" section.
Suggestions for reviewing
Consider looking at https://github.com/elastic/apm-agent-nodejs/pull/2181/files?w=1 which ignores whitespace diffs. At least the change in
s3.js
will be much smaller.Consider doing a first pass that checks
[ ] Viewed
for all test files whose only changes are:.testReset()
is the explicit mechanism for tests to reset run context tracking and still (mostly) be able to re-use the Instrumentation in the same test process.I've tried to add a "REVIEW NOTE: ..." comment in a number of places in the diff to given rationale for changes.
The instrumentation changes I've made to "lib/instrumentation/modules/aws-sdk/s3.js" is the best example of using the new run context handling. I plan to make some similar improvements to some other instrumentations in subsequent changes.
Terminology
I've use the term "run context" for what we've been tending to call "context" or "async context". A bare "context" is ambiguous -- we have (a) the "context" object on span and transaction objects sent to APM server, (b) trace-"context" for the W3C trace-context headers/data, (c) lambda's "context" object. I avoided using "async context" because we also track context through synchronous execution.
Overview
This PR makes a number of changes to the
Instrumentation
API that is used by the instrumentations under "lib/instrumentation/modules/". Tracking the "current transaction" and "current span" is now entirely encapsulated in tracking the current "run context" (seeRunContext.js
) via a "run context manager" singleton (agent._instrumentation._runCtxMgr
). This is the thing that usesasync_hooks
(AsyncHooksRunContextManager
) or not (BasicRunContextManager
) to track the run context for the currently running JS. They use an interface (and some implementation) very similar to OTel's ContextManager.The primary reason our context management can't be closer to OTel's is because our
apm.startTransaction(...)
,apm.startSpan(...)
,span.end()
, et al APIs change the current run context. IIUC, this is pretty odd as current APM APIs go. It is also reponsible for some confusion and bugs around parent/child relationships. More typical is an API that wraps a block to run in the context of a new span. Something like the following (much like the withSpan util Dario added to Kibana):Eventually we should add an API like this.
Here is brief summary of the parts of the Instrumentation API used for run context tracking:
Part of run context tracking is handled by an async-hook (or eventually other mechanisms) tracking new async tasks. The rest of run context tracking is in explicit calls to "bind" a function call to a particular run context.
Creating and ending transactions and spans:
Getting and working with run contexts:
Behaviour changes
First, this PR introduces no changes to the public API.
There are, however, the following changes in behaviour.
If user code creates span A, then creates span B in the same async task: B will be a child of A.
Before this change this was not the case. See
startSpan('A'); startSpan('B')
in the same async op: B should be a child of A, but isn't #1889 for discussion. This is why the HTTP span(s) under an ES span is not a child of the ES span; ditto for aws-sdk instrumentation. The "nested spans" case in "test/instrumentation/index.test.js" tests this case.Fixing this, also lays the ground work for excluding the HTTP span under ES and aws-sdk spans (S3, SQS); also for exit span handling.
Before this PR, an ended transaction would linger as
apm.currentTransaction
. Not any more. The'errors should have a transaction id - ended transaction'
test case was removed from "test/instrumentation/index.test.js" for this. This test was added in Addtransaction.id
to all errors #147. My read of that is that there is no need to associate an error with a transaction if it is captured after the transaction has ended. In other words, I don't think this is a breaking change.Not a behaviour change, but a possibly significant bug fix is the new implementation of
ins.bindEmitter
. It fixes an issue whereremoveListener
would fail to actually remove if the same handler function was added to multiple events. The'removes listeners properly'
test case in "test/instrumentation/index.test.js" tests this.Checklist