defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code #692

zvi-code · 2024-06-25T07:43:04Z

Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the computeDefragCycles and not for every single ptr, this way we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will further improve the defrag decision and process.

Added files:

allocator_defrag.c / allocator_defrag.h - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.

Moved functions:

zmalloc_no_tcache , zfree_no_tcache - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc].
frag_smallbins_bytes - the logic and implementation moved to the new file

Existing API:

[once a second + when completed full cycle] computeDefragCycles
- zmalloc_get_allocator_info : gets from jemalloc allocated, active, resident, retained, muzzy, frag_smallbins_bytes
- frag_smallbins_bytes : for each bin; gets from jemalloc bin_info, curr_regs, cur_slabs
[during defrag, for each pointer]
- je_defrag_hint is getting a memory pointer and returns {0,1} . Internally it uses this information points:
  - #nonfull_slabs
  - #total_slabs
  - #free regs in the ptr slab

Jemalloc API (via ctl interface)

[BATCH]experimental_utilization_batch_query_ctl : gets an array of pointers, returns for each pointer 3 values,

number of free regions in the extent
number of regions in the extent
size of the extent in terms of bytes

[EXTENDED]experimental_utilization_query_ctl :

memory address of the extent a potential reallocation would go into
number of free regions in the extent
number of regions in the extent
size of the extent in terms of bytes
[stats-enabled]total number of free regions in the bin the extent belongs to
[stats-enabled]total number of regions in the bin the extent belongs to

`experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`?

[good]

We can query pointers in a batch, reduce the overall overhead
The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily

[bad]

In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #nonfull_slabs and total allocated regs.

New functions:

defrag_jemalloc_init: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:

The mallctlnametomib() function provides a way to avoid repeated name lookups for
applications that repeatedly query the same portion of the namespace,by translating
a name to a “Management Information Base” (MIB) that can be passed repeatedly to
mallctlbymib().
jemalloc_sz2binind_lgq* : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from lg_quantum
defrag_jemalloc_get_frag_smallbins : This function replaces frag_smallbins_bytes the logic moved to the new file allocator_defrag
defrag_jemalloc_should_defrag_multi → handle_results - unpacks the results
should_defrag : implements the same logic as the existing implementation inside je_defrag_hint
defrag_jemalloc_should_defrag_multi : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.

Logical differences:

In order to get the information about #nonfull_slabs and #regs, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use jemalloc_sz2binind_lgq* .

Testing

This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.

Test:

disable activedefrag
run valkey benchmark on overlapping address ranges with different block sizes
wait untill used_memory reaches 10GB
set maxmemory to 5GB and maxmemory-policy to allkeys-lru
stop load
wait for mem_fragmentation_ratio to reach 2
enable activedefrag - start test timer
wait until reach mem_fragmentation_ratio = 1.1

Results*:

(With this PR)Test results: 56 sec
(Without this PR)Test results: 67 sec

*both runs perform same "work" number of buffers moved to reach fragmentation target

Next benchmarking is to compare to:

DONE // existing je_get_defrag_hint
compare with naive defrag all: int defrag_hint() {return 1;}

deps/Makefile

src/defrag.c

deps/Makefile

zvi-code · 2024-06-25T07:57:00Z

src/zmalloc.c

+    return buf;
+}
+
+#define ARENA_TO_QUERY  0 //MALLCTL_ARENAS_ALL


need to fix assuming we will not set 1 arena in config

src/zmalloc.c

codecov · 2024-06-26T02:08:09Z

Codecov Report

Attention: Patch coverage is 93.56725% with 11 lines in your changes missing coverage. Please review.

Project coverage is 70.78%. Comparing base (3c32ee1) to head (955da04).
Report is 1 commits behind head on unstable.

Files with missing lines	Patch %	Lines
src/allocator_defrag.c	92.99%	11 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #692      +/-   ##
============================================
+ Coverage     70.65%   70.78%   +0.12%     
============================================
  Files           114      115       +1     
  Lines         63158    63284     +126     
============================================
+ Hits          44624    44794     +170     
+ Misses        18534    18490      -44

Files with missing lines	Coverage Δ
src/defrag.c	`86.64% <100.00%> (+1.98%)`	⬆️
src/server.c	`87.75% <100.00%> (+0.05%)`	⬆️
src/zmalloc.c	`82.60% <100.00%> (-2.07%)`	⬇️
src/allocator_defrag.c	`92.99% <92.99%> (ø)`

... and 13 files with indirect coverage changes

ranshid · 2024-06-26T05:21:01Z

@valkey-io/core-team this is the change we discussed in the last summit about aligning JEMALLOC vanila. Can you take a look and state if that makes sense before we start diving deeper to the review?

src/defrag.c

PingXie · 2024-07-02T05:53:33Z

Thanks, @zvi-code! This is fantastic news!

@valkey-io/core-team, can we consider including this change in version 8.2? Devendoring jemalloc would enable us to leverage the optimized jemalloc library tailored for the target platform, along with all the benefits of devendoring. However, I believe incorporating this into version 8.0 would be quite challenging given our current timeline and the existing PR backlog. I want to make sure we are all on the same page regarding expectations.

zvi-code · 2024-07-02T11:12:26Z

@PingXie , Thanks. My only concern with waiting for 8.2 is that I have several followup improvements to defrag mechanism that will be delayed as a result.
The changes are in several aspects [some examples from the top of my head]:

Reduce the cost of defrag per byte of defragmented memory. Do so by: 1) improving the iteration of the defrag to be memory access friendly 2) utilize the batch support to reduce the cost of call to jemalloc 3) fine tune the use of tcache in defrag alloc/free 4) run defrag in parallel with worker threads
Improve the when and how defrag is running: for example: 1) prioritize free over defrag 2) if many free where done to a bin re-evaluate the defrag quota 3) add qos into defrag to smooth the impact on latency+tail latency 4) no point to defrag before we completed purge/flush 5) make adjustment to defrag based on progress achieved in full key space scan completion + consider alloc/dealloc activity during the scan cycle period [for example if we made no progress, we can predict with high probability that we will not make progress also in next run] ++
Improve the what to defrag decision algorithm: 1) differentiate between even bin distribution and balanced [these require different prioritization cause the solve different problem] 2) consider change velocity 3) consider total locked capacity in bin ++

memory fragmentation is still a big problem in many use cases and I want valkey to be top performer in this aspect because it greatly affects the usability of memory resource. Ideally I want it to work so well that we will all agree to enabled it by default (if supported)

PingXie · 2024-07-02T16:18:18Z

Thanks for the additional context, @zvi-code. I am aligned, directionally.

For transparency, we are going to cut our first RC in a few weeks in anticipation of a fall GA. So my concern is purely on the risk management side and I am not sure if we could converge on this PR soon. That said, I have marked this PR for "major-decision-pending" (in the sense of "when" as opposed to "whether"). Will check out the PR next.

zuiderkwast · 2024-07-02T16:21:42Z

@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups.

Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.)

zvi-code · 2024-07-02T16:33:54Z

@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups.

Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.)

@zuiderkwast & @PingXie appriciate your reply, make sense!

Signed-off-by: Zvi Schneider <[email protected]>

Signed-off-by: zvi-code <[email protected]>

madolson · 2024-09-06T16:51:30Z

We are now just cherry-picking changes from unstable into the 8.0 branch, so we can resume investigating this now if it makes sense to go into the next minor release.

madolson

I started to take a look and wasn't really able follow the logic. Is there a better high level explanation of how this change works? (My comments are just minor readability things while I was trying to grok the change). If it works as intended it seems like a good thing to start incorporating ASAP so you can implement your additional changes for 8.2.

src/defrag.c

src/allocator_defrag.c

src/server.c

madolson · 2024-09-10T02:57:45Z

src/server.c

@@ -5591,6 +5592,12 @@ sds genValkeyInfoString(dict *section_dict, int all_sections, int everything) {
        /* clang-format on */
        freeMemoryOverheadData(mh);
    }
+    if (all_sections || (dictFind(section_dict, "defrag") != NULL)) {
+        if (sections++) info = sdscat(info, "\r\n");
+        /* clang-format off */


There is no corresponding clang-format on

poor copy paste, will fix

src/allocator_defrag.c

zvi-code · 2024-09-17T11:58:29Z

@madolson wrote a more elaborated overview in the top comment please let me know if it helps reading through this change

src/allocator_defrag.c

zuiderkwast

It's a lot of code that I don't understand. Is it basically the same logic that we used to have in our patched jemalloc that is converted to use mallctl instead of jemalloc's internals?

src/allocator_defrag.h

zuiderkwast · 2024-09-30T21:03:56Z

src/allocator_defrag.c

+ * For each result it checks if defragmentation should be performed based on should_defrag function.
+ * If defragmentation should NOT be performed, it sets the corresponding pointer in the ptrs array to NULL.
+ * */
+void handle_results(je_bins_conf *conf,


static?

The name is a little too generic to be a non-local function.

+1 on static, will fix the name as well

src/allocator_defrag.c

zvi-code · 2024-10-09T04:38:30Z

It's a lot of code that I don't understand. Is it basically the same logic that we used to have in our patched jemalloc that is converted to use mallctl instead of jemalloc's internals?

Yes with adjustment to differences between the API's

madolson

I get it all now. It seems pretty consistent to the previous behavior and generally all seems right to me.

src/allocator_defrag.c

madolson · 2024-10-10T04:35:04Z

src/allocator_defrag.c

+    je_defrag_bstats stat;            ///< Defragmentation statistics for the bin.
+} je_busage;
+
+/// @brief Struct representing the latest usage information across all bins.


We typically don't use either /// for comments nor do we use doxygen style comments.

Suggested change

/// @brief Struct representing the latest usage information across all bins.

/* Struct representing the latest usage information across all bins. */

src/allocator_defrag.c

madolson · 2024-10-10T04:37:54Z

src/allocator_defrag.c

+ * 4. Set the `defrag_supported` flag to indicate that defragmentation is enabled.
+ *
+ * Note: This function must be called before using any other defragmentation-related functionality.
+ * It should be called during the initialization phase of the application or module that uses the


Are you actually referring to modules here? I think modules are loaded after you call the init from the core.

module as in "code component" not in the valkey meaning, will remove

fixed to code

madolson · 2024-10-10T05:09:20Z

src/allocator_defrag.c

+    unsigned long nmalloc;  ///< Number of malloc operations (unused in this implementation).
+    unsigned long ndealloc; ///< Number of dealloc operations (unused in this implementation).


Can we drop them if they are unused?

madolson · 2024-10-10T05:17:44Z

src/allocator_defrag.c

+                                "[%d][%lu]::"
+                                "nregs:%lu,nslabs:%lu,nnonfull:%lu,"
+                                "hit_rate:%lu%%,hit:%lu,miss:%lu,nmalloc:%lu,ndealloc:%lu\r\n",


This is even weirder syntax. I don't think all this low level defrag information should be printed by default. Also, do we expect anyone to really be able to use this information? Exposing data is a one way door.

This information is at the bin level, we don't have the equivalent today and it is useful for someone debugging defrag or if they want to have advance enable\disable defrag logic. I agree it's a lot of very low level details, if we remove it from info all and only output if info defrag is queried, would this make more sense? or do you think we should use other mechanism to expose this info in case of need?

@zuiderkwast @madolson would appreciate your thoughts on the info defrag section. If you agree it is useful, how would you suggest to combine each bin's info? I don't want to have a line for each stat member of each bin...

madolson · 2024-10-10T05:20:25Z

src/allocator_defrag.c

+    unsigned nbins = arena_bin_conf.nbins;
+    if (nbins > 0) {
+        info = sdscatprintf(info,
+                            "jemalloc_quantom:%d\r\n"


This isn't specifically related to the defragmentation logic, not sure why it's in that section. There is a separate section, mem_allocator which maybe it should go instead.

+1, and correct the spelling of "quantum"

Ack on the spelling, regarding the info, the issue is that today the defrag information is scattered. We have some in the stats section (the defrag activity) but memory and fragmentation status are in different section. I think it's logical to have the information in one place.

madolson · 2024-10-10T05:21:39Z

src/allocator_defrag.c

+            busage = &usage_latest.bins_usage[j];
+            info = sdscatprintf(info,
+                                "[%d][%lu]::"
+                                "nregs:%lu,nslabs:%lu,nnonfull:%lu,"


Suggested change

"nregs:%lu,nslabs:%lu,nnonfull:%lu,"

"regions:%lu,slabs:%lu,nonfull:%lu,"

Would rather have more useful names. We also seem to omit "number" info fields. "numbe_of_x" would also be fine.

madolson · 2024-10-10T05:23:37Z

src/allocator_defrag.c

+    if (nbins > 0) {
+        info = sdscatprintf(info,
+                            "jemalloc_quantom:%d\r\n"
+                            "hit_ratio:%lu%%,hits:%lu,misses:%lu\r\n"


Suggested change

"hit_ratio:%lu%%,hits:%lu,misses:%lu\r\n"

"defrag_hit_ratio:%lu%%\r\ndefrag_hits:%lu,defrag_misses:%lu\r\n"

This is a weird info syntax. We normally don't have multiple pairs on the same line separated by commands. I think each field should get its own line. Each info field needs to also stand on its own, some clients like python decompose each element into a key-value pair, and hit_ratio isn't a very descriptive name.

How is this from the other metrics like ative_defrag_hits ?

I didn't want to change exiting defrag metrics tracked from defrag.c file. These (existing metrics) could be, potentially, for non jemalloc defrag and have upper level logic, like keys defragged. I can think of nice other metrics at the defrag.c level, like have information about types of keys that where defragged. Here this information is from the perspective of the allocator, this specific metric does correspond to the active_defrag_hits in existing code. @madolson Does this make sense? or do you still think we should consolidate?

madolson · 2024-10-14T14:16:16Z

This doesn't seem like a major decision, so removing the tag. If we were to de-vendor it, we would want to see that in a separate PR.

zvi-code · 2024-10-20T08:00:10Z

Will post a PR with fixes to existing comments later this week

zvi-code · 2024-11-05T12:58:28Z

forced push by mistake

zvi-code · 2024-11-05T15:40:38Z

@madolson , @zuiderkwast , @ranshid, I created a new PR with current code and a separate commit for CR fixes. Seems I used wrong update method (rebased) and I could not align the branch without encountering issues.

ranshid reviewed Jun 25, 2024

View reviewed changes

deps/Makefile Outdated Show resolved Hide resolved

ranshid reviewed Jun 25, 2024

View reviewed changes

src/defrag.c Show resolved Hide resolved

zvi-code commented Jun 25, 2024

View reviewed changes

deps/Makefile Outdated Show resolved Hide resolved

zvi-code commented Jun 25, 2024

View reviewed changes

src/zmalloc.c Outdated Show resolved Hide resolved

zvi-code marked this pull request as draft June 25, 2024 08:05

zvi-code commented Jun 25, 2024

View reviewed changes

src/zmalloc.c Outdated Show resolved Hide resolved

zvi-code changed the title ~~defrag: use jemalloc api to align with jemalloc oss~~ defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code Jun 25, 2024

zvi-code mentioned this pull request Jun 27, 2024

Jemalloc defrag situation #364

Open

hpatro reviewed Jun 27, 2024

View reviewed changes

src/defrag.c Show resolved Hide resolved

zvi-code force-pushed the align_defrag_vanila branch 2 times, most recently from 77d0c2e to c466bb1 Compare June 30, 2024 12:58

zvi-code force-pushed the align_defrag_vanila branch 3 times, most recently from 4318895 to 677ad5c Compare July 2, 2024 10:20

zvi-code marked this pull request as ready for review July 2, 2024 10:23

zvi-code force-pushed the align_defrag_vanila branch from 677ad5c to 0213d93 Compare July 2, 2024 10:54

PingXie added the major-decision-pending Major decision pending by TSC team label Jul 2, 2024

zvi-code force-pushed the align_defrag_vanila branch from 1ae70df to b0db4b1 Compare July 2, 2024 16:45

defrag: use jemalloc api to align with vanila jemalloc

003849d

Signed-off-by: Zvi Schneider <[email protected]>

zvi-code force-pushed the align_defrag_vanila branch from 24150e7 to 003849d Compare July 2, 2024 18:46

Merge branch 'valkey-io:unstable' into align_defrag_vanila

8597a83

PingXie mentioned this pull request Jul 10, 2024

[NEW] Better branching strategy for Valkey #769

Open

Merge branch 'unstable' into align_defrag_vanila

fd17a09

Signed-off-by: zvi-code <[email protected]>

madolson reviewed Sep 10, 2024

View reviewed changes

zuiderkwast reviewed Sep 30, 2024

View reviewed changes

src/allocator_defrag.c Show resolved Hide resolved

zuiderkwast reviewed Sep 30, 2024

View reviewed changes

madolson reviewed Oct 10, 2024

View reviewed changes

madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Oct 14, 2024

Merge branch 'valkey-io:unstable' into align_defrag_vanila

955da04

zvi-code force-pushed the align_defrag_vanila branch from 955da04 to 91b4dde Compare November 5, 2024 11:12

zvi-code closed this Nov 5, 2024

zvi-code force-pushed the align_defrag_vanila branch from 91b4dde to 3c32ee1 Compare November 5, 2024 12:45

zvi-code reopened this Nov 5, 2024

zvi-code force-pushed the align_defrag_vanila branch 2 times, most recently from 73535fe to 955da04 Compare November 5, 2024 13:52

zvi-code mentioned this pull request Nov 5, 2024

Remove valkey specific changes in jemalloc source code #1266

Merged

zvi-code closed this Nov 5, 2024

	/// @brief Struct representing the latest usage information across all bins.
	/* Struct representing the latest usage information across all bins. */

		unsigned long nmalloc; ///< Number of malloc operations (unused in this implementation).
		unsigned long ndealloc; ///< Number of dealloc operations (unused in this implementation).

	"nregs:%lu,nslabs:%lu,nnonfull:%lu,"
	"regions:%lu,slabs:%lu,nonfull:%lu,"

	"hit_ratio:%lu%%,hits:%lu,misses:%lu\r\n"
	"defrag_hit_ratio:%lu%%\r\ndefrag_hits:%lu,defrag_misses:%lu\r\n"

defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code #692

defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code #692

Conversation

zvi-code commented Jun 25, 2024 • edited Loading

Summary of the change

Added files:

Moved functions:

Existing API:

Jemalloc API (via ctl interface)

experimental_utilization_batch_query_ctl vs valkey je_defrag_hint?

New functions:

Logical differences:

Testing

Test:

Results*:

Choose a reason for hiding this comment

codecov bot commented Jun 26, 2024 • edited Loading

Codecov Report

ranshid commented Jun 26, 2024

PingXie commented Jul 2, 2024

zvi-code commented Jul 2, 2024 • edited Loading

PingXie commented Jul 2, 2024

zuiderkwast commented Jul 2, 2024

zvi-code commented Jul 2, 2024

madolson commented Sep 6, 2024

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zvi-code commented Sep 17, 2024 • edited Loading

zuiderkwast left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zvi-code commented Oct 9, 2024

madolson left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

madolson commented Oct 14, 2024

zvi-code commented Oct 20, 2024

zvi-code commented Nov 5, 2024

zvi-code commented Nov 5, 2024

zvi-code commented Jun 25, 2024 •

edited

Loading

`experimental_utilization_batch_query_ctl` vs valkey `je_defrag_hint`?

codecov bot commented Jun 26, 2024 •

edited

Loading

zvi-code commented Jul 2, 2024 •

edited

Loading

zvi-code commented Sep 17, 2024 •

edited

Loading