Refactor of ActiveDefrag to reduce latencies #1242
Open
+689
−455
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Refer to: #1141
This update refactors the defrag code to:
(See #1141 for more complete details.)
This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g.
kvstoreDictLUTDefrag
,activeDefragSdsDict()
, andmoduleDefragGlobals()
). These are out of scope for this update and left for a future update.During unit tests, the following max latencies were measured (in verbose mode):
Note that the longer (non-zero) latencies in the first two tests are due to the still not bounded processing functions. In cluster-mode (single shard) these are processing all 16k slots at once. This is a separate TODO, out of scope for this update.
In addition, the following test was run on both old and new versions of the software:
Configuration was set so that both old/new clusters would use 10% defrag CPU.
Defrag time OLD: 120 sec
Defrag time NEW: 105 sec
Times were based on a 5-second poll. (I didn't have logs running.)
The improvement in run time is believed to be due to the unskewed nature of the new code which provides a more accurate 10% of the CPU.
This is the OLD distribution for the GET benchmark while defragging:
This is the equivalent NEW distribution:
You can see in the new distribution that there is a very slight increase in latencies <= 99.951%, in exchange for a huge reduction in tail latencies.