Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10201] Reduce allocation profiling overhead by using coarse timestamps #3797

Merged
merged 5 commits into from
Jul 24, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jul 22, 2024

What does this PR do?

This PR reduces the allocation profiling overhead (or: optimizes allocation profiling :D ) by using coarse timestamps on the on_newobj_event hot path.

Motivation:

When allocation profiling is enabled, the profiler gets called for almost every object allocated in the Ruby VM. Thus, this code path is extremely sensitive: the less work we can do before we return control over to Ruby, the less impact allocation profiling has on the application.

The dynamic sampling rate mechanism we employ takes the current timestamp as an input, to decide if "enough time" has elapsed since it last readjusted itself. But "enough time" right now is one second and thus we can get away with using CLOCK_MONOTONIC_COARSE on Linux which is noticeably cheaper than the regular CLOCK_MONOTONIC.

Additional Notes:

Enabling the use of the monotonic clock sometimes on the discrete dynamic sampler required it to "spill" some of its guts out to the caller, so that the caller could correctly use the coarse clock on the hot path.

How to test the change?

Here's my experiment to compare three different clock sources I evaluated:

 // Build with
 // `g++ time_sources.cpp -o time_sources -lbenchmark -lpthread`
 // where benchmark is <https://github.com/google/benchmark> aka
 // `apt install libbenchmark1 libbenchmark-dev` on ubuntu/debian

 #include <benchmark/benchmark.h>
 #include <x86intrin.h> // For __rdtsc
 #include <ctime>       // For clock_gettime

static void BM_RDTSC(benchmark::State& state) {
    for (auto _ : state) {
        benchmark::DoNotOptimize(__rdtsc());
    }
}

static void BM_ClockMonotonic(benchmark::State& state) {
    timespec ts;
    for (auto _ : state) {
        clock_gettime(CLOCK_MONOTONIC, &ts);
        benchmark::DoNotOptimize(ts);
    }
}

static void BM_ClockMonotonicCoarse(benchmark::State& state) {
    timespec ts;
    for (auto _ : state) {
        clock_gettime(CLOCK_MONOTONIC_COARSE, &ts);
        benchmark::DoNotOptimize(ts);
    }
}

BENCHMARK(BM_RDTSC);
BENCHMARK(BM_ClockMonotonic);
BENCHMARK(BM_ClockMonotonicCoarse);

BENCHMARK_MAIN();

Results on my machine:

./time_sources --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
2024-07-19T10:48:20+01:00
Running ./time_sources
Run on (20 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 1.23, 1.30, 1.11
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_RDTSC_mean                        5.52 ns         5.52 ns           10
BM_RDTSC_median                      5.44 ns         5.44 ns           10
BM_RDTSC_stddev                     0.148 ns        0.147 ns           10
BM_RDTSC_cv                          2.67 %          2.67 %            10
BM_ClockMonotonic_mean               15.8 ns         15.8 ns           10
BM_ClockMonotonic_median             15.4 ns         15.4 ns           10
BM_ClockMonotonic_stddev             1.07 ns         1.07 ns           10
BM_ClockMonotonic_cv                 6.77 %          6.77 %            10
BM_ClockMonotonicCoarse_mean         5.92 ns         5.92 ns           10
BM_ClockMonotonicCoarse_median       5.93 ns         5.93 ns           10
BM_ClockMonotonicCoarse_stddev      0.041 ns        0.041 ns           10
BM_ClockMonotonicCoarse_cv           0.68 %          0.68 %            10

and here's the result of running benchmarks/profiler_allocation.rb comparing master to this branch:

ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux]
Warming up --------------------------------------
Allocations (baseline)   1.431M i/100ms
Calculating -------------------------------------
Allocations (baseline)   14.370M (± 2.0%) i/s -    144.541M in  10.062635s

Warming up --------------------------------------
Allocations (master)     1.014M i/100ms
Calculating -------------------------------------
Allocations (master)     10.165M (± 1.0%) i/s -    102.390M in  10.074151s

Warming up --------------------------------------
Allocations (coarse)     1.179M i/100ms
Calculating -------------------------------------
Allocations (coarse)     11.495M (± 2.5%) i/s -    115.573M in  10.059971s

Comparison:
Allocations (baseline): 14369960.1 i/s
Allocations (coarse): 11495418.2 i/s - 1.25x  slower
Allocations (master): 10164615.7 i/s - 1.41x  slower

I've specifically used Ruby 2.7 for this comparison since this benchmark had a lot more variance (including baseline) on latter Rubies, and I wanted to isolate the specific changes to this code path.

This will be used by the discrete dynamic sampler to make cheaper checks
of when to re-evaluate if it should work again.
…estamps

**What does this PR do?**

This PR reduces the allocation profiling overhead (or: optimizes
allocation profiling :D ) by using coarse timestamps on the
`on_newobj_event` hot path.

**Motivation:**

When allocation profiling is enabled, the profiler gets called for
almost every object allocated in the Ruby VM. Thus, this code path is
extremely sensitive: the less work we can do before we return control
over to Ruby, the less impact allocation profiling has on the
application.

The dynamic sampling rate mechanism we employ takes the current
timestamp as an input, to decide if "enough time" has elapsed since it
last readjusted itself. But "enough time" right now is *one second*
and thus we can get away with using `CLOCK_MONOTONIC_COARSE` on Linux
which is noticeably cheaper than the regular `CLOCK_MONOTONIC`.

**Additional Notes:**

Enabling the use of the monotonic clock _sometimes_ on the discrete
dynamic sampler required it to "spill" some of its guts out to the
caller, so that the caller could correctly use the coarse clock on the
hot path.

**How to test the change?**

Here's my experiment to compare three different clock sources I
evaluated:

```c++
 // Build with
 // `g++ time_sources.cpp -o time_sources -lbenchmark -lpthread`
 // where benchmark is <https://github.com/google/benchmark> aka
 // `apt install libbenchmark1 libbenchmark-dev` on ubuntu/debian

 #include <benchmark/benchmark.h>
 #include <x86intrin.h> // For __rdtsc
 #include <ctime>       // For clock_gettime

static void BM_RDTSC(benchmark::State& state) {
    for (auto _ : state) {
        benchmark::DoNotOptimize(__rdtsc());
    }
}

static void BM_ClockMonotonic(benchmark::State& state) {
    timespec ts;
    for (auto _ : state) {
        clock_gettime(CLOCK_MONOTONIC, &ts);
        benchmark::DoNotOptimize(ts);
    }
}

static void BM_ClockMonotonicCoarse(benchmark::State& state) {
    timespec ts;
    for (auto _ : state) {
        clock_gettime(CLOCK_MONOTONIC_COARSE, &ts);
        benchmark::DoNotOptimize(ts);
    }
}

BENCHMARK(BM_RDTSC);
BENCHMARK(BM_ClockMonotonic);
BENCHMARK(BM_ClockMonotonicCoarse);

BENCHMARK_MAIN();
```

Results on my machine:

```
./time_sources --benchmark_repetitions=10 --benchmark_report_aggregates_only=true
2024-07-19T10:48:20+01:00
Running ./time_sources
Run on (20 X 4900 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x10)
  L1 Instruction 32 KiB (x10)
  L2 Unified 1280 KiB (x10)
  L3 Unified 24576 KiB (x1)
Load Average: 1.23, 1.30, 1.11
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-------------------------------------------------------------------------
Benchmark                               Time             CPU   Iterations
-------------------------------------------------------------------------
BM_RDTSC_mean                        5.52 ns         5.52 ns           10
BM_RDTSC_median                      5.44 ns         5.44 ns           10
BM_RDTSC_stddev                     0.148 ns        0.147 ns           10
BM_RDTSC_cv                          2.67 %          2.67 %            10
BM_ClockMonotonic_mean               15.8 ns         15.8 ns           10
BM_ClockMonotonic_median             15.4 ns         15.4 ns           10
BM_ClockMonotonic_stddev             1.07 ns         1.07 ns           10
BM_ClockMonotonic_cv                 6.77 %          6.77 %            10
BM_ClockMonotonicCoarse_mean         5.92 ns         5.92 ns           10
BM_ClockMonotonicCoarse_median       5.93 ns         5.93 ns           10
BM_ClockMonotonicCoarse_stddev      0.041 ns        0.041 ns           10
BM_ClockMonotonicCoarse_cv           0.68 %          0.68 %            10
```

and here's the result of running `benchmarks/profiler_allocation.rb`
comparing master to this branch:

```
ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux]
Warming up --------------------------------------
Allocations (baseline)   1.431M i/100ms
Calculating -------------------------------------
Allocations (baseline)   14.370M (± 2.0%) i/s -    144.541M in  10.062635s

Warming up --------------------------------------
Allocations (master)     1.014M i/100ms
Calculating -------------------------------------
Allocations (master)     10.165M (± 1.0%) i/s -    102.390M in  10.074151s

Warming up --------------------------------------
Allocations (coarse)     1.179M i/100ms
Calculating -------------------------------------
Allocations (coarse)     11.495M (± 2.5%) i/s -    115.573M in  10.059971s

Comparison:
Allocations (baseline): 14369960.1 i/s
Allocations (coarse): 11495418.2 i/s - 1.25x  slower
Allocations (master): 10164615.7 i/s - 1.41x  slower
```

I've specifically used Ruby 2.7 for this comparison since this benchmark
had a lot more variance (including baseline) on latter Rubies, and I
wanted to isolate the specific changes to this code path.
@ivoanjo ivoanjo requested review from a team as code owners July 22, 2024 10:50
@github-actions github-actions bot added the profiling Involves Datadog profiling label Jul 22, 2024
@pr-commenter
Copy link

pr-commenter bot commented Jul 22, 2024

Benchmarks

Benchmark execution time: 2024-07-22 10:57:07

Comparing candidate commit 446f2ad in PR branch ivoanjo/prof-10201-coarse-timestamps with baseline commit 40d80cd in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 2 unstable metrics.

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.91%. Comparing base (40d80cd) to head (446f2ad).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3797   +/-   ##
=======================================
  Coverage   97.91%   97.91%           
=======================================
  Files        1248     1248           
  Lines       75192    75192           
  Branches     3638     3638           
=======================================
  Hits        73627    73627           
  Misses       1565     1565           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ivoanjo added a commit that referenced this pull request Jul 22, 2024
**What does this PR do?**

This PR introduces a new setting for the profiler:
`profiling.advanced.allocation_counting_enabled` which controls the
profiler allocation counting feature.

This setting is off by default, enabling us to reduce allocation
profiling overhead slightly.

**Motivation:**

We actually used to have this setting back in the ddtrace 1.x series.
We introduced it in #2635 and removed it in #3281 -- by tieing it
directly to allocation profiling.

I decided to re-introduce it so we can disable this feature by
default.

**Additional Notes:**

This PR sits atop #3797 because it makes sense to measure both overhead
improvements together, but is otherwise completely independent.

**How to test the change?**

Here's the results for running the `profiler_allocation` benchmark:

```
ruby 2.7.7p221 (2022-11-24 revision 168ec2b1e5) [x86_64-linux]
Warming up --------------------------------------
Allocations (baseline)   1.419M i/100ms
Calculating -------------------------------------
Allocations (baseline)   14.535M (± 2.1%) i/s -    146.197M in  10.062717s

Warming up --------------------------------------
Allocations (alloc_profiling_enabled) 1.122M i/100ms
Calculating -------------------------------------
Allocations (alloc_profiling_enabled) 11.636M (± 2.2%) i/s -    116.679M in  10.032209s

Warming up --------------------------------------
Allocations (alloc_profiling_disabled) 1.124M i/100ms
Calculating -------------------------------------
Allocations (alloc_profiling_disabled) 11.866M (± 2.6%) i/s -    119.175M in  10.050580s

Comparison:
Allocations (baseline): 14534979.3 i/s
Allocations (alloc_profiling_disabled): 11865926.7 i/s - 1.22x  slower
Allocations (alloc_profiling_enabled): 11635919.9 i/s - 1.25x  slower
```

The difference is close to the margin of error; nevertheless
this feature was showing up on the native profiler, and since it was
on the hot path for allocation profiling, I think it's worth it.
Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🏎️ 🔥

@ivoanjo
Copy link
Member Author

ivoanjo commented Jul 24, 2024

Thanks for the review! :)

@ivoanjo ivoanjo merged commit 2cdc371 into master Jul 24, 2024
171 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10201-coarse-timestamps branch July 24, 2024 08:30
@github-actions github-actions bot added this to the 2.3.0 milestone Jul 24, 2024
@TonyCTHsu TonyCTHsu mentioned this pull request Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants