-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking the NodeJS event loop #998
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #998 +/- ##
==========================================
+ Coverage 80.75% 80.84% +0.09%
==========================================
Files 134 135 +1
Lines 10733 10813 +80
==========================================
+ Hits 8667 8742 +75
- Misses 1559 1562 +3
- Partials 507 509 +2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@@ -121,7 +121,6 @@ int BPF_KPROBE(kprobe_tcp_rcv_established, struct sock *sk, struct sk_buff *skb) | |||
// If the source port for a client call is lower, we'll get this wrong. | |||
// TODO: Need to fix this. | |||
pid_info.orig_dport = pid_info.p_conn.conn.s_port, | |||
task_tid(&pid_info.c_tid); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't required anymore. I figured out a better way to track what I need for cleaning up the parent trace information. Essentially, now I store the namespaced threadID and the extra runtime ID on the HTTP request. It's the most accurate information to use on cleanup, since we use those to create the parent request trace.
@@ -628,8 +624,8 @@ int BPF_KPROBE(kprobe_sys_exit, int status) { | |||
return 0; | |||
} | |||
|
|||
pid_key_t task = {0}; | |||
task_tid(&task); | |||
trace_key_t task = {0}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extended data structure. It now has an extra ID, which is the current runtime ID. For now it's 0 for all other languages than NodeJS, and it's the current async_id for NodeJS.
@@ -96,6 +95,8 @@ typedef struct http_info { | |||
// with other instrumented processes | |||
pid_info pid; | |||
tp_info_t tp; | |||
u64 extra_id; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are now stored here so we can correctly clean-up the server trace information when the HTTP request is finished.
__uint(pinning, LIBBPF_PIN_BY_NAME); | ||
} async_reset_args SEC(".maps"); | ||
|
||
SEC("uprobe/node:AsyncReset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracks this NodeJS function to remember the AsyncWrap NodeJS pointer. We then use the pointer to read the async_id_ and async_trigger_id_ in EmitAsyncInit.
bpf_probe_read_user(&trigger_async_id, sizeof(u64), ((void *)wrap) + async_wrap_trigger_async_id_off); | ||
|
||
if (async_id) { | ||
bpf_map_update_elem(&active_nodejs_ids, &id, &async_id, BPF_ANY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Saves the current async_id_ in play and the child -> parent relationship with the trigger_async_id_.
#include "pid_types.h" | ||
#include "nodejs.h" | ||
|
||
static __always_inline u64 extra_runtime_id() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to support other runtimes which have internal threading models, for now only checks NodeJS.
if (p_tid) { | ||
// Lookup now to see if the parent was a request | ||
c_tid = *p_tid; | ||
if (t_key.extra_id) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have runtime ID, we lookup in the runtime parent table, else we look at the OS level thread information as before.
|
||
func (p *Tracer) UProbes() map[string]map[string]ebpfcommon.FunctionPrograms { | ||
return map[string]map[string]ebpfcommon.FunctionPrograms{ | ||
"node": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These symbols have been verified with Node 18, 20 and 22.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing! I like the idea of extra ID information for some runtimes
We've had a long standing issue with context propagation for NodeJS applications, where we can't really process more than one context at a time, essentially making context propagation for NodeJS broken.
This PR ads some tracking of the active async_id for the NodeJS event loop.
The idea is to make the context tracking in Beyla based on the PID:TID + additional ID, which will be 0 for languages that don't need another ID. For NodeJS we'll poke at the active Node transaction for the PID:TID and match that way.
Ideally only a probe on EmitAsyncInit is needed, but the IDs are double floats (e.g. number in Javascript). The system ABI for float arguments is register based and at present time it's not possible to read floating point registers in kprobes/uprobes. Some kernel mailing list discussion about adding support for floats here https://lore.kernel.org/bpf/CAO658oXvAN12PFQhAQR2UXs78K-1vF3tAefd6-ToEzzQucNM=Q@mail.gmail.com/T/. Since we can't read the float registers, we resort to an additional probe on AsyncReset, which allows us to save the AsyncWrap pointer and then in EmitAsyncInit we read the two values, async_id_ and trigger_async_id_.
TODO:
Limitations: