Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-11003] Fix unsafe initialization when using profiler with otel tracing #4195

Merged
merged 4 commits into from
Dec 5, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Dec 5, 2024

What does this PR do?

This PR fixes two issues in the profiler's support for reading from the opentelemetry ("otel") context:

  1. If during the initialization of our otel reading code an exception was raised, we tried to rescue it but did not properly clean it up so it could still confuse the app/cause weird behaviors

  2. Initialization of the otel reading code could happen during an allocation sample, where it's not safe to run random Ruby code (including throwing exceptions)

Motivation:

I suspect these issues may be linked to a customer crash:

[BUG] unexpected situation - recordd:1 current:0

-- C level backtrace information -------------------------------------------
ruby(rb_print_backtrace+0x11) [0x55ba03ccf90f] vm_dump.c:820
ruby(rb_vm_bugreport) vm_dump.c:1151
ruby(bug_report_end+0x0) [0x55ba03e91607] error.c:1042
ruby(rb_bug_without_die) error.c:1042
ruby(die+0x0) [0x55ba03ac0998] error.c:1050
ruby(rb_bug) error.c:1052
ruby(disallow_reentry+0x0) [0x55ba03ab6dcc] vm_sync.c:226
ruby(rb_ec_vm_lock_rec_check+0x1a) [0x55ba03cb17aa] eval_intern.h:144
ruby(rb_ec_tag_state) eval_intern.h:155
ruby(rb_vm_exec) vm.c:2484
ruby(vm_invoke_proc+0x201) [0x55ba03cb62b1] vm.c:1509
ruby(rb_vm_invoke_proc+0x33) [0x55ba03cb65d3] vm.c:1728
ruby(thread_do_start_proc+0x176) [0x55ba03c63516] thread.c:598
ruby(thread_do_start+0x12) [0x55ba03c648a2] thread.c:615
ruby(thread_start_func_2) thread.c:672
ruby(nt_start+0x107) [0x55ba03c65137] thread_pthread.c:2187
/lib/x86_64-linux-gnu/libpthread.so.0(start_thread+0xd9) [0x7ff360b66609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7ff360a70353]

...but I could not reproduce it myself. Nevertheless, the things being fixed were still definitely bugs ;)

Change log entry

Yes. Fix unsafe initialization when using profiler with otel tracing

Additional Notes:

N/A

How to test the change?

This change includes test coverage (took longer than the fixes to code :P)

The docs for `rb_protect` clearly say that we must call
`rb_set_errinfo (Qnil)` if we want the exception to be cleanly ignored.
When called from inside an allocation (NEWOBJ tracepoint), it's not
safe to allocate further new objects (including exceptions to be
raised).

One such example of an allocation is when calling
`read_otel_current_span_key_const` to initialize the otel span key.

Thus, this commit introduces the `is_safe_to_allocate_objects`
flag (and plumbs it around a bunch of methods...) so that we can
gate calls to `read_otel_current_span_key_const` and not perform
them when they're not safe.
I was testing on 3.1 so I missed this one >_>
@ivoanjo ivoanjo requested review from a team as code owners December 5, 2024 12:26
Copy link

github-actions bot commented Dec 5, 2024

Thank you for updating Change log entry section 👏

Visited at: 2024-12-05 14:06:54 UTC

@github-actions github-actions bot added the profiling Involves Datadog profiling label Dec 5, 2024
@ivoanjo ivoanjo added bug Involves a bug otel OpenTelemetry-related changes labels Dec 5, 2024
@datadog-datadog-prod-us1
Copy link
Contributor

datadog-datadog-prod-us1 bot commented Dec 5, 2024

Datadog Report

Branch report: ivoanjo/prof-11003-fix-profiler-otel-initialization
Commit report: 47b1706
Test service: dd-trace-rb

✅ 0 Failed, 22411 Passed, 1459 Skipped, 5m 51.73s Total Time

@pr-commenter
Copy link

pr-commenter bot commented Dec 5, 2024

Benchmarks

Benchmark execution time: 2024-12-05 12:48:45

Comparing candidate commit 47b1706 in PR branch ivoanjo/prof-11003-fix-profiler-otel-initialization with baseline commit f354358 in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 30 metrics, 2 unstable metrics.

scenario:profiler - sample timeline=false

  • 🟥 throughput [-0.511op/s; -0.463op/s] or [-7.901%; -7.172%]

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.75%. Comparing base (f354358) to head (47b1706).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4195      +/-   ##
==========================================
- Coverage   97.76%   97.75%   -0.01%     
==========================================
  Files        1357     1357              
  Lines       81890    81914      +24     
  Branches     4164     4164              
==========================================
+ Hits        80060    80076      +16     
- Misses       1830     1838       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ivoanjo ivoanjo merged commit 6f2057b into master Dec 5, 2024
348 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-11003-fix-profiler-otel-initialization branch December 5, 2024 14:07
@github-actions github-actions bot added this to the 2.8.0 milestone Dec 5, 2024
@ivoanjo ivoanjo mentioned this pull request Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Involves a bug otel OpenTelemetry-related changes profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants