Refactor of ActiveDefrag to reduce latencies #1242

JimB123 · 2024-10-31T19:03:58Z

Refer to: #1141

This update refactors the defrag code to:

Make the overall code more readable and maintainable
Reduce latencies incurred during defrag processing

(See #1141 for more complete details.)

This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g. kvstoreDictLUTDefrag , activeDefragSdsDict(), and moduleDefragGlobals()). These are out of scope for this update and left for a future update.

During unit tests, the following max latencies were measured (in verbose mode):

Unit test name	Old Latency ms	New Latency ms
Active defrag main dictionary: cluster	8	7
Active defrag big keys: cluster	8	6
Active defrag main dictionary: standalone	8	0
Active defrag - AOF loading	20	0
Active defrag big keys: standalone	8	0
Active defrag big list: standalone	8	0

Note that the longer (non-zero) latencies in the first two tests are due to the still not bounded processing functions. In cluster-mode (single shard) these are processing all 16k slots at once. This is a separate TODO, out of scope for this update.

In addition, the following test was run on both old and new versions of the software:

// Create fragmented host
./src/valkey-benchmark -r 10000000 -n 10000000 -d 3 -t set 
./src/valkey-benchmark -r 9000000 -n 10000000 -d 11 -t set 
./src/valkey-benchmark -r 8000000 -n 10000000 -d 19 -t set 
./src/valkey-benchmark -r 7000000 -n 10000000 -d 27 -t set

// Enable defrag while running some traffic
./src/valkey-cli config set activedefrag yes; ./src/valkey-benchmark -r 7000000 -c 1 -n 1000000 -l -t get

Configuration was set so that both old/new clusters would use 10% defrag CPU.
Defrag time OLD: 120 sec
Defrag time NEW: 105 sec
Times were based on a 5-second poll. (I didn't have logs running.)

The improvement in run time is believed to be due to the unskewed nature of the new code which provides a more accurate 10% of the CPU.

This is the OLD distribution for the GET benchmark while defragging:

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 5)
50.000% <= 0.031 milliseconds (cumulative count 953602)
96.875% <= 0.039 milliseconds (cumulative count 980927)
98.438% <= 0.047 milliseconds (cumulative count 990566)
99.219% <= 0.055 milliseconds (cumulative count 993349)
99.609% <= 0.351 milliseconds (cumulative count 996097)
99.805% <= 0.535 milliseconds (cumulative count 998320)
99.902% <= 0.559 milliseconds (cumulative count 999088)
99.951% <= 0.759 milliseconds (cumulative count 999516)
99.976% <= 10.191 milliseconds (cumulative count 999765)
99.988% <= 10.231 milliseconds (cumulative count 999889)
99.994% <= 10.263 milliseconds (cumulative count 999941)
99.997% <= 10.335 milliseconds (cumulative count 999971)
99.998% <= 10.687 milliseconds (cumulative count 999985)
99.999% <= 10.759 milliseconds (cumulative count 999994)
100.000% <= 10.823 milliseconds (cumulative count 999997)
100.000% <= 11.167 milliseconds (cumulative count 999999)
100.000% <= 11.175 milliseconds (cumulative count 1000000)
100.000% <= 11.175 milliseconds (cumulative count 1000000)

This is the equivalent NEW distribution:

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 11)
50.000% <= 0.031 milliseconds (cumulative count 939178)
96.875% <= 0.047 milliseconds (cumulative count 979234)
98.438% <= 0.055 milliseconds (cumulative count 985116)
99.219% <= 0.535 milliseconds (cumulative count 992519)
99.609% <= 0.543 milliseconds (cumulative count 996577)
99.805% <= 0.551 milliseconds (cumulative count 998137)
99.902% <= 0.599 milliseconds (cumulative count 999028)
99.951% <= 0.767 milliseconds (cumulative count 999522)
99.976% <= 1.023 milliseconds (cumulative count 999776)
99.988% <= 1.047 milliseconds (cumulative count 999893)
99.994% <= 1.063 milliseconds (cumulative count 999947)
99.997% <= 1.111 milliseconds (cumulative count 999970)
99.998% <= 1.263 milliseconds (cumulative count 999986)
99.999% <= 1.511 milliseconds (cumulative count 999994)
100.000% <= 1.535 milliseconds (cumulative count 999997)
100.000% <= 1.735 milliseconds (cumulative count 999999)
100.000% <= 2.311 milliseconds (cumulative count 1000000)
100.000% <= 2.311 milliseconds (cumulative count 1000000)

You can see in the new distribution that there is a very slight increase in latencies <= 99.951%, in exchange for a huge reduction in tail latencies.

codecov · 2024-10-31T21:12:15Z

Codecov Report

Attention: Patch coverage is 95.19231% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.73%. Comparing base (a37dee4) to head (87cac16).
Report is 74 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/defrag.c	95.08%	15 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable    #1242      +/-   ##
============================================
+ Coverage     70.62%   70.73%   +0.11%     
============================================
  Files           114      114              
  Lines         61694    63192    +1498     
============================================
+ Hits          43569    44702    +1133     
- Misses        18125    18490     +365

Files with missing lines	Coverage Δ
src/ae.c	`75.37% <100.00%> (+0.47%)`	⬆️
src/config.c	`78.83% <ø> (+0.13%)`	⬆️
src/dict.c	`97.33% <100.00%> (+0.07%)`	⬆️
src/kvstore.c	`96.31% <100.00%> (+0.16%)`	⬆️
src/server.c	`87.67% <ø> (-0.99%)`	⬇️
src/server.h	`100.00% <ø> (ø)`
src/defrag.c	`92.01% <95.08%> (+6.25%)`	⬆️

... and 86 files with indirect coverage changes

Signed-off-by: Jim Brunner <[email protected]>

Refactor of ActiveDefrag to reduce latencies

87cac16

Signed-off-by: Jim Brunner <[email protected]>

JimB123 force-pushed the active-defrag branch from 500c5f5 to 87cac16 Compare November 1, 2024 15:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of ActiveDefrag to reduce latencies #1242

Refactor of ActiveDefrag to reduce latencies #1242

JimB123 commented Oct 31, 2024

codecov bot commented Oct 31, 2024 •

edited

Loading

Refactor of ActiveDefrag to reduce latencies #1242

Are you sure you want to change the base?

Refactor of ActiveDefrag to reduce latencies #1242

Conversation

JimB123 commented Oct 31, 2024

codecov bot commented Oct 31, 2024 • edited Loading

Codecov Report

codecov bot commented Oct 31, 2024 •

edited

Loading