Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking the NodeJS event loop #998

Merged
merged 7 commits into from
Jul 8, 2024
Merged

Tracking the NodeJS event loop #998

merged 7 commits into from
Jul 8, 2024

Conversation

grcevski
Copy link
Contributor

@grcevski grcevski commented Jul 5, 2024

We've had a long standing issue with context propagation for NodeJS applications, where we can't really process more than one context at a time, essentially making context propagation for NodeJS broken.

This PR ads some tracking of the active async_id for the NodeJS event loop.

The idea is to make the context tracking in Beyla based on the PID:TID + additional ID, which will be 0 for languages that don't need another ID. For NodeJS we'll poke at the active Node transaction for the PID:TID and match that way.

Ideally only a probe on EmitAsyncInit is needed, but the IDs are double floats (e.g. number in Javascript). The system ABI for float arguments is register based and at present time it's not possible to read floating point registers in kprobes/uprobes. Some kernel mailing list discussion about adding support for floats here https://lore.kernel.org/bpf/CAO658oXvAN12PFQhAQR2UXs78K-1vF3tAefd6-ToEzzQucNM=Q@mail.gmail.com/T/. Since we can't read the float registers, we resort to an additional probe on AsyncReset, which allows us to save the AsyncWrap pointer and then in EmitAsyncInit we read the two values, async_id_ and trigger_async_id_.

TODO:

  • Harder to pass integration tests

Limitations:

  • Doesn't work with SSL yet. SSL seems to increase the async_id_ in JavaScript interpreted code, so we can't match the parent transaction. More work is required here.

@codecov-commenter
Copy link

codecov-commenter commented Jul 5, 2024

Codecov Report

Attention: Patch coverage is 95.29412% with 4 lines in your changes missing coverage. Please review.

Project coverage is 80.84%. Comparing base (31e98da) to head (4686a89).

Files Patch % Lines
pkg/internal/ebpf/nodejs/nodejs.go 94.59% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #998      +/-   ##
==========================================
+ Coverage   80.75%   80.84%   +0.09%     
==========================================
  Files         134      135       +1     
  Lines       10733    10813      +80     
==========================================
+ Hits         8667     8742      +75     
- Misses       1559     1562       +3     
- Partials      507      509       +2     
Flag Coverage Δ
integration-test 56.30% <95.29%> (+0.31%) ⬆️
k8s-integration-test 58.91% <4.70%> (-0.32%) ⬇️
oats-test 36.18% <4.70%> (-0.26%) ⬇️
unittests 50.44% <0.00%> (-0.40%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@grcevski grcevski marked this pull request as ready for review July 5, 2024 20:39
@grcevski grcevski requested review from mariomac and marctc as code owners July 5, 2024 20:39
@grcevski grcevski changed the title WIP: Tracking the NodeJS event loop Tracking the NodeJS event loop Jul 5, 2024
@@ -121,7 +121,6 @@ int BPF_KPROBE(kprobe_tcp_rcv_established, struct sock *sk, struct sk_buff *skb)
// If the source port for a client call is lower, we'll get this wrong.
// TODO: Need to fix this.
pid_info.orig_dport = pid_info.p_conn.conn.s_port,
task_tid(&pid_info.c_tid);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't required anymore. I figured out a better way to track what I need for cleaning up the parent trace information. Essentially, now I store the namespaced threadID and the extra runtime ID on the HTTP request. It's the most accurate information to use on cleanup, since we use those to create the parent request trace.

@@ -628,8 +624,8 @@ int BPF_KPROBE(kprobe_sys_exit, int status) {
return 0;
}

pid_key_t task = {0};
task_tid(&task);
trace_key_t task = {0};
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extended data structure. It now has an extra ID, which is the current runtime ID. For now it's 0 for all other languages than NodeJS, and it's the current async_id for NodeJS.

@@ -96,6 +95,8 @@ typedef struct http_info {
// with other instrumented processes
pid_info pid;
tp_info_t tp;
u64 extra_id;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are now stored here so we can correctly clean-up the server trace information when the HTTP request is finished.

__uint(pinning, LIBBPF_PIN_BY_NAME);
} async_reset_args SEC(".maps");

SEC("uprobe/node:AsyncReset")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracks this NodeJS function to remember the AsyncWrap NodeJS pointer. We then use the pointer to read the async_id_ and async_trigger_id_ in EmitAsyncInit.

bpf_probe_read_user(&trigger_async_id, sizeof(u64), ((void *)wrap) + async_wrap_trigger_async_id_off);

if (async_id) {
bpf_map_update_elem(&active_nodejs_ids, &id, &async_id, BPF_ANY);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saves the current async_id_ in play and the child -> parent relationship with the trigger_async_id_.

#include "pid_types.h"
#include "nodejs.h"

static __always_inline u64 extra_runtime_id() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to support other runtimes which have internal threading models, for now only checks NodeJS.

if (p_tid) {
// Lookup now to see if the parent was a request
c_tid = *p_tid;
if (t_key.extra_id) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have runtime ID, we lookup in the runtime parent table, else we look at the OS level thread information as before.


func (p *Tracer) UProbes() map[string]map[string]ebpfcommon.FunctionPrograms {
return map[string]map[string]ebpfcommon.FunctionPrograms{
"node": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These symbols have been verified with Node 18, 20 and 22.

Copy link
Contributor

@mariomac mariomac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! I like the idea of extra ID information for some runtimes

@grcevski grcevski merged commit 31390b2 into grafana:main Jul 8, 2024
6 checks passed
@grcevski grcevski deleted the nodejs branch July 8, 2024 15:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants