Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-TICKET] Redesign GC profiling to add timeline support and reduce overhead #3313

Merged
merged 21 commits into from
Dec 14, 2023

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Dec 8, 2023

What does this PR do?

This PR redesigns the garbage collection profiling feature to add support for timeline and reduce its overhead.

Here's how it looks in timeline mode:

image

Here's how it looks in flamegraph mode:

image

A TL;DR description of how it works now is below:

  1. Wiring it up
  • Ruby exposes native-level tracepoint events for when it starts/stops doing GC (RUBY_INTERNAL_EVENT_GC_ENTER/RUBY_INTERNAL_EVENT_GC_EXIT).
  • Inside the CpuAndWallTimeWorker we create a tracepoint that listens for these events and calls the on_gc_start and on_gc_finish events on the ThreadContext collector.
  • The RUBY_INTERNAL_EVENT_GC_ENTER/RUBY_INTERNAL_EVENT_GC_EXIT tracepoint events are emitted while Ruby is doing GC, which means we can't allocate anything. The code we use is thus built to not allocate anything.
  • We can't do all the work we want in on_gc_finish because we can't allocate anything, so that function returns a flag that tells the CpuAndWallTimeWorker to call sample_after_gc as soon as the GC finishes.
  1. Tracking
  • When a thread starts doing GC (on_gc_start), we record its current cpu-time and wall-time.
  • When a thread stops doing GC (on_gc_finish), we record how much cpu-time and wall-time elapsed.
  • When sample_after_gc gets called, we turn the data recorded above + some metadata from Ruby into a sample event to represent GC as a fake Garbage Collection thread.
  1. Overhead
  • To avoid impacting applications a lot, the Ruby GC works incrementally. That is, rather than doing a full GC (marking, sweeping) in a single step, Ruby breaks this into smaller steps.
  • We get told about every smaller step: we get a pair of on_gc_start/on_gc_finish calls for each step.
  • In the past, we always called sample_after_gc after every on_gc_finish to record the GC. This was a big source of overhead: in a sample application that allocates as much as possible in a loop I was able to observe more than 50k calls to on_gc_start/on_gc_finish per second, which obviously would represent a lot of overhead if we create one GC sample for each of these.
  • To avoid this overhead, we coalesce/accumulate the cpu/wall-time from multiple calls to on_gc_start/on_gc_finish and only create a sample to represent this data from time to time (default is every 10ms or every full GC cycle, whatever happens first).

Compared to the old implementation, here's what changed:

  • Garbage collection events get attributed to a placeholder "Garbage Collection" thread
  • We no longer record "Garbage Collection" samples on the thread that triggered GC. (And thus no longer include the stack of that thread when recording GC information). This enabled a nice cleanup of the code in the stack collector
  • GCs across several threads can be included in the same sample
  • Timeline support is included
  • Extra metadata about the GC can be seen in the timeline
  • Overhead is lowered, even during heavy periods of GC action

Motivation:

I worked on this as part of the Profiler R&D week. I've been wanting to go back and fix the problems with the gc profiling implementation for a long time :)

Additional Notes:

I'm opening this PR on top of @AlexJF's #3281 because otherwise both were going to conflict, given the changes to the stack recorder. The two PRs are otherwise completely independent.

I'll wait until #3281 is merged to master to merge this PR as well.

How to test the change?

Test coverage is included.

During development, this tiny benchmark was useful:

require 'benchmark/ips'
require 'ddtrace'

puts RUBY_DESCRIPTION

Datadog.configure do |c|
  c.profiling.enabled = true
  c.profiling.advanced.allocation_counting_enabled = false
  c.telemetry.enabled = false
end

class Potato
  def initialize(val)
    @potato = val
  end
end

Benchmark.ips do |x|
  x.config(time: 10, warmup: 2)

  x.report("DD_PROFILING_FORCE_ENABLE_GC=#{ENV['DD_PROFILING_FORCE_ENABLE_GC']} Potato.new(10)", "Potato.new(10)")

  x.compare!
end

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

@ivoanjo ivoanjo requested review from a team as code owners December 8, 2023 16:46
@github-actions github-actions bot added the profiling Involves Datadog profiling label Dec 8, 2023
Comment on lines +647 to +654
.wall_time_ns = state->gc_tracking.accumulated_wall_time_ns,
.timeline_wall_time_ns = state->gc_tracking.accumulated_wall_time_ns,
Copy link
Contributor

@AlexJF AlexJF Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get the need for both? Is this a backend limitation (i.e., why doesn't it just parse wall_time_ns values directly)? Why does it (seem to) work fine for non-GC frames?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great question. I've pushed a note in 391db52 to address this:

      // This event gets both a regular cpu/wall-time duration, as a normal cpu/wall-time sample would, as well as a
      // timeline duration.
      // This is done to enable two use-cases:
      // * regular cpu/wall-time makes this event show up as a regular stack in the flamegraph
      // * the timeline duration is used when the event shows up in the timeline

In practice, we could change the backend to use get the cpu/wall-time timeline duration and use it for the timeline event, but doing it this way keeps us consistent with what other runtimes do for timeline (e.g. dotnet), and thus makes it easy to share code across them.

@codecov-commenter
Copy link

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (56745c9) 98.22% compared to head (807a519) 98.23%.
Report is 34 commits behind head on alexjf/prof-8667-heap-profiling-part1.

Files Patch % Lines
...atadog/profiling/collectors/thread_context_spec.rb 99.16% 1 Missing ⚠️
Additional details and impacted files
@@                          Coverage Diff                           @@
##           alexjf/prof-8667-heap-profiling-part1    #3313   +/-   ##
======================================================================
  Coverage                                  98.22%   98.23%           
======================================================================
  Files                                       1253     1253           
  Lines                                      72794    72774   -20     
  Branches                                    3419     3414    -5     
======================================================================
- Hits                                       71504    71486   -18     
+ Misses                                      1290     1288    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still LGTM, 👍 Heads-up on incoming conflicts 😄

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, @ivoanjo!

Base automatically changed from alexjf/prof-8667-heap-profiling-part1 to master December 12, 2023 12:28
When this feature was added, we weren't yet working an allocation
profiling, so having stacks would hint at where an allocation was that
triggered the GC.

This... is basically a very crappy allocation profiler, and as
an anti-bonus it can also get quite expensive.

As part of making this feature shape up for us to have it on by
default, let's no longer collect stacks. As an alternative, you
can get a similar context either from allocation profiling OR by
looking at thread timeline for what the thread was doing around the
time the GC happened.
These abstractions added some complexity to the stack collector, and
were only needed to allow us to put extra placeholder frames on top
of a true collected stack.

Since this was only being used by the GC placeholder feature, and
we've changed that to not collect a stack anymore, we can clean up
this mechanism and simplify the code.
This function is rather trivial BUT we have a really nice big comment to
explain it so it's useful to extract it so we can have the comment
separate, instead of moving the big comment inside the already
very-complex `sample_thread_internal`.
While still accurate, I don't think this comment is still relevant --
we no longer have the skid issue, and the rest is just historical
context that I don't think is that interesting to keep around.
This event type will be used to report GC information for use in the
profiling timeline.
This helper takes care of mapping `rb_gc_latest_gc_info` into metadata
that gets added to GC samples.

This helper will get tested via the thread context specs (changes
upcoming).
…overhead

**What does this PR do?**

This PR redesigns the garbage collection profiling feature to add support
for timeline and reduce its overhead.

A TL;DR description of how it works now is below:

1. Wiring it up

  * Ruby exposes native-level tracepoint events for when it starts/stops
  doing GC (`RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT`).
  * Inside the `CpuAndWallTimeWorker` we create a tracepoint that listens
  for these events and calls the `on_gc_start` and `on_gc_finish` events
  on the `ThreadContext` collector.
  * The `RUBY_INTERNAL_EVENT_GC_ENTER`/`RUBY_INTERNAL_EVENT_GC_EXIT`
  tracepoint events are emitted while Ruby is doing GC, which means we
  can't allocate anything. The code we use is thus built to not
  allocate anything.
  * We can't do all the work we want in `on_gc_finish` because we can't
  allocate anything, so that function returns a flag that tells the
  `CpuAndWallTimeWorker` to call `sample_after_gc` as soon as the GC
  finishes.

2. Tracking

  * When a thread starts doing GC (`on_gc_start`), we record its current
  cpu-time and wall-time.
  * When a thread stops doing GC (`on_gc_finish`), we record how much
  cpu-time and wall-time elapsed.
  * When `sample_after_gc` gets called, we turn the data recorded above
  + some metadata from Ruby into a sample event to represent GC as
  a fake Garbage Collection thread.

3. Overhead

  * To avoid impacting applications a lot, the Ruby GC works incrementally.
  That is, rather than doing a full GC (marking, sweeping) in a single
  step, Ruby breaks this into smaller steps.
  * We get told about every smaller step: we get a pair of `on_gc_start`
  /`on_gc_finish` calls for each step.
  * In the past, we always called `sample_after_gc` after every
  `on_gc_finish` to record the GC. This was a big source of overhead:
  in a sample application that allocates as much as possible in a loop
  I was able to observe more than 50k calls to `on_gc_start`/`on_gc_finish`
  per second, which obviously would represent a lot of overhead if we
  create one GC sample for each of these.
  * To avoid this overhead, we coalesce/accumulate the cpu/wall-time from
  multiple calls to `on_gc_start`/`on_gc_finish` and only create a sample
  to represent this data from time to time (default is every 10ms or
  every full GC cycle, whatever happens first).

Compared to the old implementation, here's what changed:

* Garbage collection events get attributed to a placeholder
"Garbage Collection" thread.
* We no longer record "Garbage Collection" samples on the thread that
triggered GC. (And thus no longer include the stack of that thread
when recording GC information). This enabled a nice cleanup of the
code in the stack collector
* GCs across several threads can be included in the same sample
* Timeline support is included
* Extra metadata about the GC can be seen in the timeline
* Overhead is lowered, even during heavy periods of GC action

**Motivation:**

I worked on this as part of the Profiler R&D week. I've been wanting
to go back and fix the problems with the gc profiling implementation for
a long time :)

**Additional Notes:**

N/A

**How to test the change?**

Test coverage is included.

During development, this tiny benchmark was useful:

```ruby
require 'benchmark/ips'
require 'ddtrace'

puts RUBY_DESCRIPTION

Datadog.configure do |c|
  c.profiling.enabled = true
  c.profiling.advanced.allocation_counting_enabled = false
  c.telemetry.enabled = false
end

class Potato
  def initialize(val)
    @potato = val
  end
end

Benchmark.ips do |x|
  x.config(time: 10, warmup: 2)

  x.report("DD_PROFILING_FORCE_ENABLE_GC=#{ENV['DD_PROFILING_FORCE_ENABLE_GC']} Potato.new(10)", "Potato.new(10)")

  x.compare!
end
```
Instead, we expose the `record_placeholder_stack` directly.
No longer needed!
…cer-2.7 docker image ships

This allows us to keep using `-Werror` in CI without issue, which is
pretty valuable.
Because GC samples can be coalesced, we can't assert on 5 GCs resulting
in exactly 5 GC events.
@ivoanjo ivoanjo force-pushed the ivoanjo/refactor-gc-tracking-timeline-try3 branch from 807a519 to bfdd1b7 Compare December 13, 2023 14:30
@ivoanjo
Copy link
Member Author

ivoanjo commented Dec 13, 2023

Update: This PR was originally opened on top of #3281 but there were a number of changes to that branch, so I chose to rebase on top of master and force-push.

The rebase was clean other than a manual fix in profiling/component.rb + a new spec from master that needed updating ( bfdd1b7 ). No other changes were made.

It seems that even with `GC.start()` we can observe different reasons,
so I've relaxed the spec.
Because GC samples are pushed with `thread id: GC`, this was breaking
`object_id_from` in tests that happened to have some GC samples.

There is no `object_id` for the "GC thread" because there is no such
thread -- it's just a way of representing GC information.
@ivoanjo
Copy link
Member Author

ivoanjo commented Dec 14, 2023

Thanks y'all for the help and reviews. Merging away!

@ivoanjo ivoanjo merged commit 94d7371 into master Dec 14, 2023
218 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/refactor-gc-tracking-timeline-try3 branch December 14, 2023 08:51
@github-actions github-actions bot added this to the 1.19.0 milestone Dec 14, 2023
@ivoanjo ivoanjo restored the ivoanjo/refactor-gc-tracking-timeline-try3 branch December 14, 2023 11:49
@ivoanjo ivoanjo deleted the ivoanjo/refactor-gc-tracking-timeline-try3 branch January 9, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants