-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eliminate Monkey Patching for Diagnostic Instrumentation #134
Comments
What are the cons of using |
AsyncHooks are fine to correlated initiator and corresponding callbacks for native calls. But I don't think it's really possible to extract the data passed to the functions. Besides that I think AsyncHooks have some blind spots:
|
In AsyncHooks the embedder can pass a custom resource object where they can store any information they like. Such as the data passed to a function. |
Wouldn't you normally only implement the Embedder API on async boundaries - when queueing a callback, just before calling it, just after calling it etc. In APM we need sometimes hooks into other places of the code. For example instrumenting all middleware functions in Express or templating languages that doesn't do any async calls are both quite common. |
You could also use the Embedder API for making the context flow easier to understand. A database request, even if there is no queue, can involve a hundred async operations. Wrapping that with
Indeed, for the purely synchronous cases it doesn't make sense to use I think
This we can fix. The V8 team is already working on it.
I've done a great deal of benchmarking with If someone would like to benchmark this for themselves, I have already implemented a basic version of
What details do you need? You can emit arbitrary events in
For sure |
Maybe I haven't fully understood Async Hooks yet. If my task is to trace e.a. a In our case we usually monitor high level requests e.a. a DB request, not all the 100 async calls needed to get it done. As a result we would like to extract the high level data not the low level protocol messages. Sure, the 100 async interactions in between need to be tracked also in a lot cases but just to correlate request and cb, without extracting specific data. Here async hooks look really fine. |
We started working with the OpenCensus folks to create a generic, vendor neutral tracing API. Maybe this is something we can incorporate into Node.js as well. Did anyone already research using anything like OpenCensus, Dapper, OpenTracing? I'd almost expect that, as Google is driving that. |
|
@danielkhan - I'm not aware of anything. Would be great to see something like this, at a minimum just an inventory of different approaches so we can be somewhat systematic. nodejs/node-eps#48 looks like an effort to explose Also, to be clear, we should scope this issue to addressing the "monkey-patching for gathering metadata about API calls" problem. As discussed above, ppl are monkey-patching for two distinct reasons - tracking async context and capturing details about API calls (e.g., params passed to a DB call). Async hooks is intended for the former. I'm not following suggestions to use Async Hooks APIs for the latter, but perhaps I'm missing something. If this is a valid approach, we should list in above list. |
Ted from OpenTracing here. Would you consider the OpenTracing C++11 API, or a C API? We're in the process of integrating into a number of technologies, now would be a good time to start working with you! https://github.com/opentracing/opentracing-cpp |
@mike-kaufman if I understand well, you are suggesting we would want some kind of proper overriding tooling in Node.js? |
@vdeturckheim - not sure what you mean by "proper overriding tooling". What's been expressed here is that monkey patching is invasive & fragile - i.e., it relies on internal details of some library, and when those details change, monkey-patching breaks. Consequently, there's a desire to get away from it. I think we'd like to find some solution where
|
@tedsuo last year Node.js integrated a trace engine from google and we have been discussing how to add tracepoints etc. I'm not sure if the use of OpenTracing is relevant to this thread but it might be interesting as a discussion on its own. Ideally it we would be able to plug in different trace engines under the APIs that we expose in C and Javascript in Node.js for tracing. I've not read enough on OpenTracing to see how it would fit into that model. If you have some cycles to thing about it and possibly give the diagnostics team a n overview I'd suggest opening a new issue to start that discussion. |
@mike-kaufman thanks for pointing me to this thread. @danielkhan we are discussing node.js OpenCensus SDK with Google. I think diagnostics channel and OpenCensus are very aligned. We just need to understand the proper layering and requirements of those. |
For background, OpenCensus (github) is a new project to make a language and vendor neutral distributed tracing API and SDK. This work is in concert with the w3c proposal for distributed trace context. We are hoping to extract Node.js instrumentation code from Stackdriver Trace and incorporate that into the OpenCensus SDK as a starting point. Some of the other languages already have fairly functional SDKs, but we haven't gotten that far on JavaScript just yet. Even with this in place, the question asked by @mike-kaufman in the OP still applies. How do we connect producers of high level tracing data (e.g. express) to the consumers (APM, debugging tools). This may be trace-events (but OP has valid concerns) or it may be something like the diagnostic channel module from Microsoft. I see both of these attempting to do the same thing from different ends (low-level vs. high level), and one thing I would like to achieve at the upcoming diagnostic summit is whether there is a path to intersect. Some thoughts on the specific concerns by OP:
This is a problem. We need the notion of context to be well defined within Node core. Once this notion is available, it would be fairly easy to bind trace events to the context.
This is a problem. We will need a way for the VMs to accelerate (JIT) tracing calls from JavaScript. I think we would probably want the VMs to provide an intrinsic.
IMO, this is addressable.
If we can avoid the JavaScript to C transitions, it would hard to beat the performance of trace-events with a pure JS. Trace-event is designed first of all as a low level tracing mechanism to be used primarily from C/C++. The additional benefit it provides is that it acts like a single bus that aggregates all performance event data. Diagnostic channel does the same thing, but it will be limited to only high-level (i.e. not too high frequency) performance event data. Perhaps diagnostic channel is a good starting point. Once/if we have the trace-event API available from JS, we can additionally inject the diagnostic channel data into trace events as well, giving us a single stream of all performance event data? Just some thoughts. |
Based on @ofrobots comment above, I think there's a path forward here by the following:
Notes:
Does this sound like a fair interpretation of thread so far? Please let me know if I'm missing something. |
I would also note that monkey patching requires a custom loader to work with ESM. I would like to add this as one of the problems that removing monkey patching would solve. As part of the goals, we should consider the LTS cycle. When opening PRs to 3rd-party libraries, we should be focusing on making sure that the library could run without modifications on old node releases (down to 4 ideally). Moreover, we should ensure that when turned off, this tracing layer has zero overhead. One of the advantages of monkey patching is that when it's turned off it's off. |
@mcollina - Thanks, updated summary above to reflect your comments. Note I changed "zero overhead" to "near zero" in case anyone is being pedantic. With tracing disabled, overhead should be a few instructions. |
I don't think that only emitting trace events will fit for all usecases. Currently we have to modify the arguments array (e.g. by wrapping the passed callback) in quite some cases. Maybe async hooks can help to avoid the need to wrap callbacks in most cases as we get the call context via this hooks. If I compare NodeJS instrumentation with other techs like Java or .NET the main difference is in my opinion not the access to internals (for Java/.NET byte code is patched). The main difference is that in NodeJs monkey patching is done by a lot people - not just by a few APM vendors (which are anyway not used in the same application at the same time). Catching up with changes in internals is definitely not my favorite task. But till now it has shown that it's harder and more cumbersome to find issues caused by combining modules using monkey patching. |
@Flarna - if you could be more explicit about use cases that aren't handled by emitting "trace events", and potential any work-around, that would be helpful here. E.g., for adding http headers, can you inject a piece of middleware? |
@ofrobots and I were discussing this just now -- to summarize, we worked with the following three requirements for getting out of monkeypatching for tracing:
The ecosystem adoption of What we think is that Node core could expose some API where an APM vendor would be able to specify that they want a set of key-value pairs to be added to every outgoing HTTP request upon calling |
@mike-kaufman - I will try to get more concrete by describing some samples. Please note that this does not describe every detail/corner case. Example 1: Simple outgoing request without tagging; e.g. database requests
Once the wrapped callback is called (or the promise then/catch function) we do similar actions:
In case another patched function like I could imagine that this use case can be completely covered by events and async_hooks (assuming enough info is passed to the events, e.g. we have begin/end events for all functions,...) Example 2: "Complicated" outgoing request without tagging; e.g. database request via query/stream object Not sure if we are able to trace transactional context via async_hooks along such multi step operations. If not, the trace events emitted by the module could contain an unique identifier for the operation to allow to track context. Example 3: Simple outgoing request with tagging; e.g. HTTP request Besides events to extract data and async_hooks to track context we need some hook here to inject a HTTP header. For other RPC/Messaging protocols a similar hook would be needed. Important here is that the trace tag injected shall be unique to the operation and is created during the actual call (at (2)). Example 4: Incoming request, e.g.
Once wrapped events or
With trace events emitted we could for sure cover most or even all of above functionality. Not sure if async_hooks allow to track context for all possible combinations of events/requests on HTTP request/response objects. I would expect that we need at least something on top of it to link the request event and the corresponding resp.end(). We know that there are frameworks out there which do pooling. Example 5: Incoming HTTP request injecting a JS agent
I fear that this use case is too heavy and specific to be covered by generic hooks/events.... |
@Flarna thanks for the detailed examples. Can you elaborate on what you mean by 'request injecting a JS agent'? I think I am understanding with the rest of your description, but I am not sure what a 'JS agent' is. |
I interpreted this as needing to inject a script into a html response, but @Flarna please clarify. :) |
@mike-kaufman, @ofrobots Stepping in here. Yes it's about changing the response by adding a script tag to the body. |
@mhdawson replying in regards to using OpenTracing for C/JS instrumentation. We are adding support for dynamic loading (opentracing/opentracing-cpp#45) and will be adding a C bridge as the next step. This will allow any API-compliant tracer to be dynamically linked to the nodejs binary, much like what we are doing with envoy, nginx, postgres, etc. This has the advantage of not requiring NodeJS to be tied to a particular tracing implementation or version, as the code is separated. I can start another issue to discuss this. |
I'm not convinced it's possible to fully avoid monkey-patches. There will always be cases like needing to inspect a connection object of a database driver during a query to include host/port information with contextual data of the trace layer, needing to inspect the socket of an incoming http request to get the remote address data, and many other cases where the patches do a lot more than just naively capturing function arguments. It's very common to need to dig about in the function context or augment things to intercept things like errors as transparently as possible. edit: had this issue open since yesterday and it hadn't refreshed with any of the posts from the large one from @Flarna on. Oops. 😅 |
@Flarna - thanks for the detail above. Will be interesting to use your use cases as a function to see if we can get the right API in place. Won't be 100% at first, but should be able to evolve in that direction & the exercise will bring more clarity/definition to the shape of the necessary APIs. |
I think part of the difficulty is that given an arbitrarily implemented database driver, we can't predict how to reach (for example) its connection object automatically, and the implementor/maintainer of that driver cannot predict what information they should provide to some new potential tracing hooks/reporting API to satisfy the needs of every tool (APM or otherwise). I have seen cases of monkeypatch instrumentation tracking state across nested objects or calls in ways that the instrumented module itself doesn't track (or otherwise need to). In these cases the instrumented module itself might have a hard time reporting the data collected by the monkeypatching instrumentation. For example, finding the full route of an express request currently requires monkeypatching every layer in the routing tree and pushing each layer's route fragment into a stack during traversal. Unless I am mistaken, Express does not otherwise keep track of or have a way to easily find this full route. I imagine there are more compelling examples of this difficulty; this is just one I am familiar with. Maybe the potential tracing reporting API would allow Express to report the full route piece-by-piece, but then we're still relying on Express maintainers to understand and accept a pull request that does this. Maybe we can count on Express to do such a thing, but can we count on every tracing-relevant module out there to do such a thing? And to collate and report every potential data point that any tool out there might find interesting? And can we count on end-users to use updated versions of all these modules? I think it could be within the realm of possibility to eliminate the need to monkeypatch Node core libraries, but similar to @Qard I am not convinced that it is possible to avoid monkeypatching 3rd-party ecosystem modules. I think it'll be enough of a challenge to have everyone in the ecosystem propagate asynchronous contexts correctly in the face of various queueing/pooling/promise-caching scenarios, much less spit out sufficient trace data for every tracing/diagnostic use case to fully avoid monkeypatching. We probably don't want to end up with every APM/diagnostic-tool vendor trying to send PRs to every library they want to trace, trying to have them report some additional piece of diagnostic data that one vendor cares about but another doesn't. I imagine that with the natural competitive inclinations of vendors wanting to provide features/capabilities that others don't, we would either see monkeypatching continue or we might even see ecosystem module maintainers caught in the middle of an awkward "what all should this module report?" tug of war. I don't mean to just be a naysayer - I think reducing the need for/incidence of monkeypatching would be excellent, and I think the plan/steps Mike laid out above can satisfy a huge portion of use cases and go a very long way toward that goal. I just hesitate at the notion that monkeypatching can be eliminated entirely; despite the downsides, it's practically a feature of JavaScript. I would be very interested to read about how other languages/platforms handle this sort of need, if anybody is familiar with good resources or examples. I know some other languages have handled asynchronous context tracking brilliantly, so maybe there is also something to be learned in this adjacent area. |
Speaking as a vendor, I would appreciate a generic API. The agent part of an APM product is just a fraction of the value proposition of APM. I don't think that there is a generic way to instrument every module there is. For me the MongoDB APM vendor API is a good example of how a module vendor provides a way for APM vendors to register callbacks instead of monkey patching. |
Fully agree here. The trace events have to give more info then just function arguments. Depending on the concrete usecase it may be quite some effort to put all this data together.
I fear not - we even see customers insisting on use of NodeJS 0.12. Others use ancient versions of NodeJs 4.
As far as I know the approach in Java and .NET is patching bytecode during loading. This requires you to be an "agent module" loaded in a special way - not like in NodeJS. |
RE other runtimes, .net is now using something called DiagnosticSource, which is similar in function to DiagnosticChannel. |
@LewisJEllis - thanks for your comments. A brief response to some of your points:
|
Even if you want to use OpenCensus, they seem to be interested in implementing the OpenTracing interface. OpenTracing is an interface behind which the actual tracer can do tracer-specific work. Here is the issue regard the OpenCensus Go client, which easily applies to all of the clients: census-instrumentation/opencensus-go#502. |
closing in liue of #180. |
Forked from issue #95 so we can track seperately:
@watson wrote:
To frame the problem a bit more concisely: Current APM vendors have to monkey-patch libraries to produce diagnostic data. E.g., if you want to know specific details of a DB query, one would monkey-patch the DB driver to capture any necessary params and stats.
This is problematic because
We would like to get to a solution that has the following characteristics:
The text was updated successfully, but these errors were encountered: