-
Notifications
You must be signed in to change notification settings - Fork 678
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code #692
Conversation
src/zmalloc.c
Outdated
return buf; | ||
} | ||
|
||
#define ARENA_TO_QUERY 0 //MALLCTL_ARENAS_ALL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to fix assuming we will not set 1 arena in config
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #692 +/- ##
============================================
+ Coverage 70.65% 70.78% +0.12%
============================================
Files 114 115 +1
Lines 63158 63284 +126
============================================
+ Hits 44624 44794 +170
+ Misses 18534 18490 -44
|
@valkey-io/core-team this is the change we discussed in the last summit about aligning JEMALLOC vanila. Can you take a look and state if that makes sense before we start diving deeper to the review? |
77d0c2e
to
c466bb1
Compare
Thanks, @zvi-code! This is fantastic news! @valkey-io/core-team, can we consider including this change in version 8.2? Devendoring jemalloc would enable us to leverage the optimized jemalloc library tailored for the target platform, along with all the benefits of devendoring. However, I believe incorporating this into version 8.0 would be quite challenging given our current timeline and the existing PR backlog. I want to make sure we are all on the same page regarding expectations. |
4318895
to
677ad5c
Compare
677ad5c
to
0213d93
Compare
@PingXie , Thanks. My only concern with waiting for 8.2 is that I have several followup improvements to defrag mechanism that will be delayed as a result.
memory fragmentation is still a big problem in many use cases and I want valkey to be top performer in this aspect because it greatly affects the usability of memory resource. Ideally I want it to work so well that we will all agree to enabled it by default (if supported) |
Thanks for the additional context, @zvi-code. I am aligned, directionally. For transparency, we are going to cut our first RC in a few weeks in anticipation of a fall GA. So my concern is purely on the risk management side and I am not sure if we could converge on this PR soon. That said, I have marked this PR for "major-decision-pending" (in the sense of "when" as opposed to "whether"). Will check out the PR next. |
@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups. Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.) |
@zuiderkwast & @PingXie appriciate your reply, make sense! |
1ae70df
to
b0db4b1
Compare
Signed-off-by: Zvi Schneider <[email protected]>
24150e7
to
003849d
Compare
Signed-off-by: zvi-code <[email protected]>
We are now just cherry-picking changes from unstable into the 8.0 branch, so we can resume investigating this now if it makes sense to go into the next minor release. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started to take a look and wasn't really able follow the logic. Is there a better high level explanation of how this change works? (My comments are just minor readability things while I was trying to grok the change). If it works as intended it seems like a good thing to start incorporating ASAP so you can implement your additional changes for 8.2.
@@ -5591,6 +5592,12 @@ sds genValkeyInfoString(dict *section_dict, int all_sections, int everything) { | |||
/* clang-format on */ | |||
freeMemoryOverheadData(mh); | |||
} | |||
if (all_sections || (dictFind(section_dict, "defrag") != NULL)) { | |||
if (sections++) info = sdscat(info, "\r\n"); | |||
/* clang-format off */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no corresponding clang-format on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
poor copy paste, will fix
@madolson wrote a more elaborated overview in the top comment please let me know if it helps reading through this change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a lot of code that I don't understand. Is it basically the same logic that we used to have in our patched jemalloc that is converted to use mallctl instead of jemalloc's internals?
* For each result it checks if defragmentation should be performed based on should_defrag function. | ||
* If defragmentation should NOT be performed, it sets the corresponding pointer in the ptrs array to NULL. | ||
* */ | ||
void handle_results(je_bins_conf *conf, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
static?
The name is a little too generic to be a non-local function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on static, will fix the name as well
Yes with adjustment to differences between the API's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it all now. It seems pretty consistent to the previous behavior and generally all seems right to me.
je_defrag_bstats stat; ///< Defragmentation statistics for the bin. | ||
} je_busage; | ||
|
||
/// @brief Struct representing the latest usage information across all bins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We typically don't use either ///
for comments nor do we use doxygen style comments.
/// @brief Struct representing the latest usage information across all bins. | |
/* Struct representing the latest usage information across all bins. */ |
* 4. Set the `defrag_supported` flag to indicate that defragmentation is enabled. | ||
* | ||
* Note: This function must be called before using any other defragmentation-related functionality. | ||
* It should be called during the initialization phase of the application or module that uses the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you actually referring to modules here? I think modules are loaded after you call the init from the core.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
module as in "code component" not in the valkey meaning, will remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed to code
unsigned long nmalloc; ///< Number of malloc operations (unused in this implementation). | ||
unsigned long ndealloc; ///< Number of dealloc operations (unused in this implementation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we drop them if they are unused?
"[%d][%lu]::" | ||
"nregs:%lu,nslabs:%lu,nnonfull:%lu," | ||
"hit_rate:%lu%%,hit:%lu,miss:%lu,nmalloc:%lu,ndealloc:%lu\r\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is even weirder syntax. I don't think all this low level defrag information should be printed by default. Also, do we expect anyone to really be able to use this information? Exposing data is a one way door.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information is at the bin level, we don't have the equivalent today and it is useful for someone debugging defrag or if they want to have advance enable\disable defrag logic. I agree it's a lot of very low level details, if we remove it from info all
and only output if info defrag
is queried, would this make more sense? or do you think we should use other mechanism to expose this info in case of need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zuiderkwast @madolson would appreciate your thoughts on the info defrag
section. If you agree it is useful, how would you suggest to combine each bin's info? I don't want to have a line for each stat member of each bin...
unsigned nbins = arena_bin_conf.nbins; | ||
if (nbins > 0) { | ||
info = sdscatprintf(info, | ||
"jemalloc_quantom:%d\r\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't specifically related to the defragmentation logic, not sure why it's in that section. There is a separate section, mem_allocator
which maybe it should go instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, and correct the spelling of "quantum"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack on the spelling, regarding the info, the issue is that today the defrag information is scattered. We have some in the stats section (the defrag activity) but memory and fragmentation status are in different section. I think it's logical to have the information in one place.
busage = &usage_latest.bins_usage[j]; | ||
info = sdscatprintf(info, | ||
"[%d][%lu]::" | ||
"nregs:%lu,nslabs:%lu,nnonfull:%lu," |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"nregs:%lu,nslabs:%lu,nnonfull:%lu," | |
"regions:%lu,slabs:%lu,nonfull:%lu," |
Would rather have more useful names. We also seem to omit "number" info fields. "numbe_of_x" would also be fine.
if (nbins > 0) { | ||
info = sdscatprintf(info, | ||
"jemalloc_quantom:%d\r\n" | ||
"hit_ratio:%lu%%,hits:%lu,misses:%lu\r\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"hit_ratio:%lu%%,hits:%lu,misses:%lu\r\n" | |
"defrag_hit_ratio:%lu%%\r\ndefrag_hits:%lu,defrag_misses:%lu\r\n" |
This is a weird info syntax. We normally don't have multiple pairs on the same line separated by commands. I think each field should get its own line. Each info field needs to also stand on its own, some clients like python decompose each element into a key-value pair, and hit_ratio isn't a very descriptive name.
How is this from the other metrics like ative_defrag_hits
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to change exiting defrag metrics tracked from defrag.c
file. These (existing metrics) could be, potentially, for non jemalloc defrag and have upper level logic, like keys defragged. I can think of nice other metrics at the defrag.c level, like have information about types of keys that where defragged. Here this information is from the perspective of the allocator, this specific metric does correspond to the active_defrag_hits
in existing code. @madolson Does this make sense? or do you still think we should consolidate?
This doesn't seem like a major decision, so removing the tag. If we were to de-vendor it, we would want to see that in a separate PR. |
Will post a PR with fixes to existing comments later this week |
955da04
to
91b4dde
Compare
91b4dde
to
3c32ee1
Compare
forced push by mistake |
73535fe
to
955da04
Compare
@madolson , @zuiderkwast , @ranshid, I created a new PR with current code and a separate commit for CR fixes. Seems I used wrong update method (rebased) and I could not align the branch without encountering issues. |
Summary of the change
This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.
This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the
computeDefragCycles
and not for every single ptr, this way we are avoiding the additional api call.Followup work will utilize the new options that are now open and will further improve the defrag decision and process.
Added files:
allocator_defrag.c
/allocator_defrag.h
- This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.Moved functions:
zmalloc_no_tcache
,zfree_no_tcache
- these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc].frag_smallbins_bytes
- the logic and implementation moved to the new fileExisting API:
computeDefragCycles
zmalloc_get_allocator_info
: gets from jemalloc allocated, active, resident, retained, muzzy,frag_smallbins_bytes
frag_smallbins_bytes
: for each bin; gets from jemalloc bin_info,curr_regs
,cur_slabs
je_defrag_hint
is getting a memory pointer and returns {0,1} . Internally it uses this information points:nonfull_slabs
total_slabs
Jemalloc API (via ctl interface)
[BATCH]
experimental_utilization_batch_query_ctl
: gets an array of pointers, returns for each pointer 3 values,[EXTENDED]
experimental_utilization_query_ctl
:experimental_utilization_batch_query_ctl
vs valkeyje_defrag_hint
?[good]
[bad]
nonfull_slabs
and total allocated regs.New functions:
defrag_jemalloc_init
: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:The mallctlnametomib() function provides a way to avoid repeated name lookups for
applications that repeatedly query the same portion of the namespace,by translating
a name to a “Management Information Base” (MIB) that can be passed repeatedly to
mallctlbymib().
jemalloc_sz2binind_lgq*
: this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived fromlg_quantum
defrag_jemalloc_get_frag_smallbins
: This function replacesfrag_smallbins_bytes
the logic moved to the new file allocator_defragdefrag_jemalloc_should_defrag_multi
→handle_results
- unpacks the resultsshould_defrag
: implements the same logic as the existing implementation inside je_defrag_hintdefrag_jemalloc_should_defrag_multi
: implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.Logical differences:
In order to get the information about #
nonfull_slabs
and #regs
, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we usejemalloc_sz2binind_lgq*
.Testing
This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.
Test:
activedefrag
used_memory
reaches 10GBmaxmemory
to 5GB andmaxmemory-policy
toallkeys-lru
mem_fragmentation_ratio
to reach 2activedefrag
- start test timermem_fragmentation_ratio
= 1.1Results*:
(With this PR)Test results:
56 sec
(Without this PR)Test results:
67 sec
*both runs perform same "work" number of buffers moved to reach fragmentation target
Next benchmarking is to compare to:
je_get_defrag_hint
int defrag_hint() {return 1;}