From 7d22e545c2a1adc4856faf1956c1b5bab1338396 Mon Sep 17 00:00:00 2001
From: Ivo Anjo <ivo.anjo@datadoghq.com>
Date: Fri, 3 Mar 2023 12:25:56 +0000
Subject: [PATCH] Tweak threshold on flaky profiler test

This test failed in a flaky way in CI
(<https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/9211/workflows/8431a38e-fa1d-4a93-a937-b8488b960a89/jobs/342577/tests#failed-test-0>).

Interesting observation in the output:

```
Failure/Error: expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
  sample_count: 7, stats: {:trigger_sample_attempts=>8, :trigger_simulated_signal_delivery_attempts=>8, :simulated_signal_delivery=>8, :signal_handler_enqueued_sample=>8, :signal_handler_wrong_thread=>0, :sampled=>7, :sampling_time_ns_min=>122586, :sampling_time_ns_max=>2849779, :sampling_time_ns_total=>3664094, :sampling_time_ns_avg=>523442.0}, debug_failures: {:thread_list=>[#<Thread:0x000055df58fcafc0@Thread.main run>, #<Thread:0x000055df5b3d1f90 /app/spec/spec_helper.rb:241 sleep>, #<Thread:0x000055df5d235c48 /usr/local/bundle/gems/webrick-1.7.0/lib/webrick/utils.rb:158 sleep_forever>],
```

There was at least one sample that was super slow
(`sampling_time_ns_max` -- almost 3ms).

Because of that, the profiler throttled sampling down (using the dynamic
sampling rate component), and so did not take as many samples as we
were expecting. This is by design, although it's really hard to know
_why_ the profiler slowed down -- maybe the CI machine was busy?
Maybe that leftover webrick thread leaked from another test was
the culprit?

Right now the dynamic sampling rate is by design not very configurable,
so I've chosen to tweak the number of samples down yet again.

If this last adjustment isn't enough, I'll bite the bullet and make it
possible to disable the dynamic sampling rate (or tweak it), so as to
not cause issues for this spec.

(Like I mentioned in the comment -- I still think this test is worth
the trouble, as it validates an important behavior of the profiler
that otherwise may not get exercised in most validations we run --
profiling while idle).
---
 .../cpu_and_wall_time_worker_spec.rb           | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb b/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb
index 27817e44e17..df3579d68e2 100644
--- a/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb
+++ b/spec/datadog/profiling/collectors/cpu_and_wall_time_worker_spec.rb
@@ -310,11 +310,12 @@
 
         cpu_and_wall_time_worker.stop
 
-        result = samples_for_thread(samples_from_pprof_without_gc_and_overhead(recorder.serialize!), Thread.current)
+        all_samples = samples_from_pprof_without_gc_and_overhead(recorder.serialize!)
+        result = samples_for_thread(all_samples, Thread.current)
         sample_count = result.map { |it| it.values.fetch(:'cpu-samples') }.reduce(:+)
 
         stats = cpu_and_wall_time_worker.stats
-        debug_failures = { thread_list: Thread.list, result: result }
+        debug_failures = { thread_list: Thread.list, all_samples: all_samples }
 
         trigger_sample_attempts = stats.fetch(:trigger_sample_attempts)
         simulated_signal_delivery = stats.fetch(:simulated_signal_delivery)
@@ -324,9 +325,16 @@
 
         # Sanity checking
 
-        # We're currently targeting 100 samples per second, so this is a conservative approximation that hopefully
-        # will not cause flakiness
-        expect(sample_count).to be >= 8, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
+        # We're currently targeting 100 samples per second (aka 20 in the 0.2 period above), so expecting 5 samples is a
+        # conservative choice that hopefully will not cause flakiness.
+        #
+        # @ivoanjo: One source of flakiness here in the past has been the dynamic sampling rate component -- if for some
+        # reason the profiler takes longer to sample (maybe the CI machine is busy today...), it will by design slow down
+        # and take less samples. For now, I tried to tune down this value, and am hesitating on adding a way to disable
+        # the dynamic sampling rate while we're running this test (because this is something we'd never do in production).
+        # But if spec proves to be flaky again, then I think we'll need to go with that solution (as I think this test
+        # is still quite valuable).
+        expect(sample_count).to be >= 5, "sample_count: #{sample_count}, stats: #{stats}, debug_failures: #{debug_failures}"
         expect(trigger_sample_attempts).to be >= sample_count
       end
     end