Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of ActiveDefrag to reduce latencies #1242

Open
wants to merge 1 commit into
base: unstable
Choose a base branch
from

Conversation

JimB123
Copy link
Contributor

@JimB123 JimB123 commented Oct 31, 2024

Refer to: #1141

This update refactors the defrag code to:

  • Make the overall code more readable and maintainable
  • Reduce latencies incurred during defrag processing

(See #1141 for more complete details.)

This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g. kvstoreDictLUTDefrag , activeDefragSdsDict(), and moduleDefragGlobals()). These are out of scope for this update and left for a future update.

During unit tests, the following max latencies were measured (in verbose mode):

Unit test name Old Latency ms New Latency ms
Active defrag main dictionary: cluster 8 7
Active defrag big keys: cluster 8 6
Active defrag main dictionary: standalone 8 0
Active defrag - AOF loading 20 0
Active defrag big keys: standalone 8 0
Active defrag big list: standalone 8 0

Note that the longer (non-zero) latencies in the first two tests are due to the still not bounded processing functions. In cluster-mode (single shard) these are processing all 16k slots at once. This is a separate TODO, out of scope for this update.

In addition, the following test was run on both old and new versions of the software:

// Create fragmented host
./src/valkey-benchmark -r 10000000 -n 10000000 -d 3 -t set 
./src/valkey-benchmark -r 9000000 -n 10000000 -d 11 -t set 
./src/valkey-benchmark -r 8000000 -n 10000000 -d 19 -t set 
./src/valkey-benchmark -r 7000000 -n 10000000 -d 27 -t set

// Enable defrag while running some traffic
./src/valkey-cli config set activedefrag yes; ./src/valkey-benchmark -r 7000000 -c 1 -n 1000000 -l -t get

Configuration was set so that both old/new clusters would use 10% defrag CPU.
Defrag time OLD: 120 sec
Defrag time NEW: 105 sec
Times were based on a 5-second poll. (I didn't have logs running.)

The improvement in run time is believed to be due to the unskewed nature of the new code which provides a more accurate 10% of the CPU.

This is the OLD distribution for the GET benchmark while defragging:

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 5)
50.000% <= 0.031 milliseconds (cumulative count 953602)
96.875% <= 0.039 milliseconds (cumulative count 980927)
98.438% <= 0.047 milliseconds (cumulative count 990566)
99.219% <= 0.055 milliseconds (cumulative count 993349)
99.609% <= 0.351 milliseconds (cumulative count 996097)
99.805% <= 0.535 milliseconds (cumulative count 998320)
99.902% <= 0.559 milliseconds (cumulative count 999088)
99.951% <= 0.759 milliseconds (cumulative count 999516)
99.976% <= 10.191 milliseconds (cumulative count 999765)
99.988% <= 10.231 milliseconds (cumulative count 999889)
99.994% <= 10.263 milliseconds (cumulative count 999941)
99.997% <= 10.335 milliseconds (cumulative count 999971)
99.998% <= 10.687 milliseconds (cumulative count 999985)
99.999% <= 10.759 milliseconds (cumulative count 999994)
100.000% <= 10.823 milliseconds (cumulative count 999997)
100.000% <= 11.167 milliseconds (cumulative count 999999)
100.000% <= 11.175 milliseconds (cumulative count 1000000)
100.000% <= 11.175 milliseconds (cumulative count 1000000)

This is the equivalent NEW distribution:

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 11)
50.000% <= 0.031 milliseconds (cumulative count 939178)
96.875% <= 0.047 milliseconds (cumulative count 979234)
98.438% <= 0.055 milliseconds (cumulative count 985116)
99.219% <= 0.535 milliseconds (cumulative count 992519)
99.609% <= 0.543 milliseconds (cumulative count 996577)
99.805% <= 0.551 milliseconds (cumulative count 998137)
99.902% <= 0.599 milliseconds (cumulative count 999028)
99.951% <= 0.767 milliseconds (cumulative count 999522)
99.976% <= 1.023 milliseconds (cumulative count 999776)
99.988% <= 1.047 milliseconds (cumulative count 999893)
99.994% <= 1.063 milliseconds (cumulative count 999947)
99.997% <= 1.111 milliseconds (cumulative count 999970)
99.998% <= 1.263 milliseconds (cumulative count 999986)
99.999% <= 1.511 milliseconds (cumulative count 999994)
100.000% <= 1.535 milliseconds (cumulative count 999997)
100.000% <= 1.735 milliseconds (cumulative count 999999)
100.000% <= 2.311 milliseconds (cumulative count 1000000)
100.000% <= 2.311 milliseconds (cumulative count 1000000)

You can see in the new distribution that there is a very slight increase in latencies <= 99.951%, in exchange for a huge reduction in tail latencies.

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 95.19231% with 15 lines in your changes missing coverage. Please review.

Project coverage is 70.73%. Comparing base (a37dee4) to head (87cac16).
Report is 74 commits behind head on unstable.

Files with missing lines Patch % Lines
src/defrag.c 95.08% 15 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1242      +/-   ##
============================================
+ Coverage     70.62%   70.73%   +0.11%     
============================================
  Files           114      114              
  Lines         61694    63192    +1498     
============================================
+ Hits          43569    44702    +1133     
- Misses        18125    18490     +365     
Files with missing lines Coverage Δ
src/ae.c 75.37% <100.00%> (+0.47%) ⬆️
src/config.c 78.83% <ø> (+0.13%) ⬆️
src/dict.c 97.33% <100.00%> (+0.07%) ⬆️
src/kvstore.c 96.31% <100.00%> (+0.16%) ⬆️
src/server.c 87.67% <ø> (-0.99%) ⬇️
src/server.h 100.00% <ø> (ø)
src/defrag.c 92.01% <95.08%> (+6.25%) ⬆️

... and 86 files with indirect coverage changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant