Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code #692

Closed
wants to merge 4 commits into from

Conversation

zvi-code
Copy link
Contributor

@zvi-code zvi-code commented Jun 25, 2024

Summary of the change

This is a base PR for refactoring defrag. It moves the defrag logic to rely on jemalloc native api instead of relying on custom code changes made by valkey in the jemalloc (je_defrag_hint) library. This enables valkey to use latest vanila jemalloc without the need to maintain code changes cross jemalloc versions.

This change requires some modifications because the new api is providing only the information, not a yes\no defrag. The logic needs to be implemented at valkey code. Additionally, the api does not provide, within single call, all the information needed to make a decision, this information is available through additional api call. To reduce the calls to jemalloc, in this PR the required information is collected during the computeDefragCycles and not for every single ptr, this way we are avoiding the additional api call.
Followup work will utilize the new options that are now open and will further improve the defrag decision and process.

Added files:

allocator_defrag.c / allocator_defrag.h - This files implement the allocator specific knowledge for making defrag decision. The knowledge about slabs and allocation logic and so on, all goes into this file. This improves the separation between jemalloc specific code and other possible implementation.

Moved functions:

zmalloc_no_tcache , zfree_no_tcache - these are very jemalloc specific logic assumptions, and are very specific to how we defrag with jemalloc. This is also with the vision that from performance perspective we should consider using tcache, we only need to make sure we don't recycle entries without going through the arena [for example: we can use private tcache, one for free and one for alloc].
frag_smallbins_bytes - the logic and implementation moved to the new file

Existing API:

  • [once a second + when completed full cycle] computeDefragCycles
    • zmalloc_get_allocator_info : gets from jemalloc allocated, active, resident, retained, muzzy, frag_smallbins_bytes
    • frag_smallbins_bytes : for each bin; gets from jemalloc bin_info, curr_regs, cur_slabs
  • [during defrag, for each pointer]
    • je_defrag_hint is getting a memory pointer and returns {0,1} . Internally it uses this information points:
      • #nonfull_slabs
      • #total_slabs
      • #free regs in the ptr slab

Jemalloc API (via ctl interface)

[BATCH]experimental_utilization_batch_query_ctl : gets an array of pointers, returns for each pointer 3 values,

  • number of free regions in the extent
  • number of regions in the extent
  • size of the extent in terms of bytes

[EXTENDED]experimental_utilization_query_ctl :

  • memory address of the extent a potential reallocation would go into
  • number of free regions in the extent
  • number of regions in the extent
  • size of the extent in terms of bytes
  • [stats-enabled]total number of free regions in the bin the extent belongs to
  • [stats-enabled]total number of regions in the bin the extent belongs to

experimental_utilization_batch_query_ctl vs valkey je_defrag_hint?

[good]

  • We can query pointers in a batch, reduce the overall overhead
  • The per ptr decision algorithm is not within jemalloc api, jemalloc only provides information, valkey can tune\configure\optimize easily

[bad]

  • In the batch API we only know the utilization of the slab (of that memory ptr), we don’t get the data about #nonfull_slabs and total allocated regs.

New functions:

  1. defrag_jemalloc_init: Reducing the cost of call to je_ctl: use the MIB interface to get a faster calls. See this quote from the jemalloc documentation:

    The mallctlnametomib() function provides a way to avoid repeated name lookups for
    applications that repeatedly query the same portion of the namespace,by translating
    a name to a “Management Information Base” (MIB) that can be passed repeatedly to
    mallctlbymib().

  2. jemalloc_sz2binind_lgq* : this api is to support reverse map between bin size and it’s info without lookup. This mapping depends on the number of size classes we have that are derived from lg_quantum

  3. defrag_jemalloc_get_frag_smallbins : This function replaces frag_smallbins_bytes the logic moved to the new file allocator_defrag
    defrag_jemalloc_should_defrag_multihandle_results - unpacks the results

  4. should_defrag : implements the same logic as the existing implementation inside je_defrag_hint

  5. defrag_jemalloc_should_defrag_multi : implements the hint for an array of pointers, utilizing the new batch api. currently only 1 pointer is passed.

Logical differences:

In order to get the information about #nonfull_slabs and #regs, we use the query cycle to collect the information per size class. In order to find the index of bin information given bin size, in o(1), we use jemalloc_sz2binind_lgq* .

Testing

This is the first draft. I did some initial testing that basically fragmentation by reducing max memory and than waiting for defrag to reach desired level. The test only serves as sanity that defrag is succeeding eventually, no data provided here regarding efficiency and performance.

Test:

  1. disable activedefrag
  2. run valkey benchmark on overlapping address ranges with different block sizes
  3. wait untill used_memory reaches 10GB
  4. set maxmemory to 5GB and maxmemory-policy to allkeys-lru
  5. stop load
  6. wait for mem_fragmentation_ratio to reach 2
  7. enable activedefrag - start test timer
  8. wait until reach mem_fragmentation_ratio = 1.1

Results*:

(With this PR)Test results: 56 sec
(Without this PR)Test results: 67 sec

*both runs perform same "work" number of buffers moved to reach fragmentation target

Next benchmarking is to compare to:

  • DONE // existing je_get_defrag_hint
  • compare with naive defrag all: int defrag_hint() {return 1;}

deps/Makefile Outdated Show resolved Hide resolved
src/defrag.c Show resolved Hide resolved
deps/Makefile Outdated Show resolved Hide resolved
src/zmalloc.c Outdated
return buf;
}

#define ARENA_TO_QUERY 0 //MALLCTL_ARENAS_ALL
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to fix assuming we will not set 1 arena in config

src/zmalloc.c Outdated Show resolved Hide resolved
@zvi-code zvi-code marked this pull request as draft June 25, 2024 08:05
src/zmalloc.c Outdated Show resolved Hide resolved
@zvi-code zvi-code changed the title defrag: use jemalloc api to align with jemalloc oss defrag: replace je_get_defrag_hint with jemalloc native interface and remove valkey specific changes in jemalloc source code Jun 25, 2024
Copy link

codecov bot commented Jun 26, 2024

Codecov Report

Attention: Patch coverage is 93.56725% with 11 lines in your changes missing coverage. Please review.

Project coverage is 70.78%. Comparing base (3c32ee1) to head (955da04).
Report is 1 commits behind head on unstable.

Files with missing lines Patch % Lines
src/allocator_defrag.c 92.99% 11 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable     #692      +/-   ##
============================================
+ Coverage     70.65%   70.78%   +0.12%     
============================================
  Files           114      115       +1     
  Lines         63158    63284     +126     
============================================
+ Hits          44624    44794     +170     
+ Misses        18534    18490      -44     
Files with missing lines Coverage Δ
src/defrag.c 86.64% <100.00%> (+1.98%) ⬆️
src/server.c 87.75% <100.00%> (+0.05%) ⬆️
src/zmalloc.c 82.60% <100.00%> (-2.07%) ⬇️
src/allocator_defrag.c 92.99% <92.99%> (ø)

... and 13 files with indirect coverage changes

@ranshid
Copy link
Member

ranshid commented Jun 26, 2024

@valkey-io/core-team this is the change we discussed in the last summit about aligning JEMALLOC vanila. Can you take a look and state if that makes sense before we start diving deeper to the review?

src/defrag.c Show resolved Hide resolved
@zvi-code zvi-code force-pushed the align_defrag_vanila branch 2 times, most recently from 77d0c2e to c466bb1 Compare June 30, 2024 12:58
@PingXie
Copy link
Member

PingXie commented Jul 2, 2024

Thanks, @zvi-code! This is fantastic news!

@valkey-io/core-team, can we consider including this change in version 8.2? Devendoring jemalloc would enable us to leverage the optimized jemalloc library tailored for the target platform, along with all the benefits of devendoring. However, I believe incorporating this into version 8.0 would be quite challenging given our current timeline and the existing PR backlog. I want to make sure we are all on the same page regarding expectations.

@zvi-code zvi-code force-pushed the align_defrag_vanila branch 3 times, most recently from 4318895 to 677ad5c Compare July 2, 2024 10:20
@zvi-code zvi-code marked this pull request as ready for review July 2, 2024 10:23
@zvi-code
Copy link
Contributor Author

zvi-code commented Jul 2, 2024

@PingXie , Thanks. My only concern with waiting for 8.2 is that I have several followup improvements to defrag mechanism that will be delayed as a result.
The changes are in several aspects [some examples from the top of my head]:

  1. Reduce the cost of defrag per byte of defragmented memory. Do so by: 1) improving the iteration of the defrag to be memory access friendly 2) utilize the batch support to reduce the cost of call to jemalloc 3) fine tune the use of tcache in defrag alloc/free 4) run defrag in parallel with worker threads
  2. Improve the when and how defrag is running: for example: 1) prioritize free over defrag 2) if many free where done to a bin re-evaluate the defrag quota 3) add qos into defrag to smooth the impact on latency+tail latency 4) no point to defrag before we completed purge/flush 5) make adjustment to defrag based on progress achieved in full key space scan completion + consider alloc/dealloc activity during the scan cycle period [for example if we made no progress, we can predict with high probability that we will not make progress also in next run] ++
  3. Improve the what to defrag decision algorithm: 1) differentiate between even bin distribution and balanced [these require different prioritization cause the solve different problem] 2) consider change velocity 3) consider total locked capacity in bin ++

memory fragmentation is still a big problem in many use cases and I want valkey to be top performer in this aspect because it greatly affects the usability of memory resource. Ideally I want it to work so well that we will all agree to enabled it by default (if supported)

@PingXie PingXie added the major-decision-pending Major decision pending by TSC team label Jul 2, 2024
@PingXie
Copy link
Member

PingXie commented Jul 2, 2024

Thanks for the additional context, @zvi-code. I am aligned, directionally.

For transparency, we are going to cut our first RC in a few weeks in anticipation of a fall GA. So my concern is purely on the risk management side and I am not sure if we could converge on this PR soon. That said, I have marked this PR for "major-decision-pending" (in the sense of "when" as opposed to "whether"). Will check out the PR next.

@zuiderkwast
Copy link
Contributor

@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups.

Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.)

@zvi-code
Copy link
Contributor Author

zvi-code commented Jul 2, 2024

@zvi-code When we release Valkey 8.0.0-rc1, we create the 8.0 branch, and we continue to merge new features into unstable. There is no freeze of the development. You will have time to finish the follow ups.

Valkey 8 rc1 is very soon. We have to work hard to review and merge the features already planned for Valkey 8. Probably we'll even have to exclude some of them. We don't want to delay releases indefinitely like they used to do in the R*dis times. Valkey 8.2 will be within the next 6 months. (That's the plan.)

@zuiderkwast & @PingXie appriciate your reply, make sense!

@madolson
Copy link
Member

madolson commented Sep 6, 2024

We are now just cherry-picking changes from unstable into the 8.0 branch, so we can resume investigating this now if it makes sense to go into the next minor release.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started to take a look and wasn't really able follow the logic. Is there a better high level explanation of how this change works? (My comments are just minor readability things while I was trying to grok the change). If it works as intended it seems like a good thing to start incorporating ASAP so you can implement your additional changes for 8.2.

src/defrag.c Show resolved Hide resolved
src/allocator_defrag.c Show resolved Hide resolved
src/server.c Show resolved Hide resolved
@@ -5591,6 +5592,12 @@ sds genValkeyInfoString(dict *section_dict, int all_sections, int everything) {
/* clang-format on */
freeMemoryOverheadData(mh);
}
if (all_sections || (dictFind(section_dict, "defrag") != NULL)) {
if (sections++) info = sdscat(info, "\r\n");
/* clang-format off */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no corresponding clang-format on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

poor copy paste, will fix

src/allocator_defrag.c Show resolved Hide resolved
src/allocator_defrag.c Show resolved Hide resolved
@zvi-code
Copy link
Contributor Author

zvi-code commented Sep 17, 2024

@madolson wrote a more elaborated overview in the top comment please let me know if it helps reading through this change

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a lot of code that I don't understand. Is it basically the same logic that we used to have in our patched jemalloc that is converted to use mallctl instead of jemalloc's internals?

src/allocator_defrag.h Show resolved Hide resolved
* For each result it checks if defragmentation should be performed based on should_defrag function.
* If defragmentation should NOT be performed, it sets the corresponding pointer in the ptrs array to NULL.
* */
void handle_results(je_bins_conf *conf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static?

The name is a little too generic to be a non-local function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on static, will fix the name as well

src/allocator_defrag.c Show resolved Hide resolved
@zvi-code
Copy link
Contributor Author

zvi-code commented Oct 9, 2024

It's a lot of code that I don't understand. Is it basically the same logic that we used to have in our patched jemalloc that is converted to use mallctl instead of jemalloc's internals?

Yes with adjustment to differences between the API's

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it all now. It seems pretty consistent to the previous behavior and generally all seems right to me.

src/allocator_defrag.c Show resolved Hide resolved
src/allocator_defrag.c Show resolved Hide resolved
je_defrag_bstats stat; ///< Defragmentation statistics for the bin.
} je_busage;

/// @brief Struct representing the latest usage information across all bins.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We typically don't use either /// for comments nor do we use doxygen style comments.

Suggested change
/// @brief Struct representing the latest usage information across all bins.
/* Struct representing the latest usage information across all bins. */

src/allocator_defrag.c Show resolved Hide resolved
* 4. Set the `defrag_supported` flag to indicate that defragmentation is enabled.
*
* Note: This function must be called before using any other defragmentation-related functionality.
* It should be called during the initialization phase of the application or module that uses the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you actually referring to modules here? I think modules are loaded after you call the init from the core.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module as in "code component" not in the valkey meaning, will remove

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed to code

Comment on lines +56 to +57
unsigned long nmalloc; ///< Number of malloc operations (unused in this implementation).
unsigned long ndealloc; ///< Number of dealloc operations (unused in this implementation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we drop them if they are unused?

Comment on lines +172 to +174
"[%d][%lu]::"
"nregs:%lu,nslabs:%lu,nnonfull:%lu,"
"hit_rate:%lu%%,hit:%lu,miss:%lu,nmalloc:%lu,ndealloc:%lu\r\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is even weirder syntax. I don't think all this low level defrag information should be printed by default. Also, do we expect anyone to really be able to use this information? Exposing data is a one way door.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is at the bin level, we don't have the equivalent today and it is useful for someone debugging defrag or if they want to have advance enable\disable defrag logic. I agree it's a lot of very low level details, if we remove it from info all and only output if info defrag is queried, would this make more sense? or do you think we should use other mechanism to expose this info in case of need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zuiderkwast @madolson would appreciate your thoughts on the info defrag section. If you agree it is useful, how would you suggest to combine each bin's info? I don't want to have a line for each stat member of each bin...

unsigned nbins = arena_bin_conf.nbins;
if (nbins > 0) {
info = sdscatprintf(info,
"jemalloc_quantom:%d\r\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't specifically related to the defragmentation logic, not sure why it's in that section. There is a separate section, mem_allocator which maybe it should go instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, and correct the spelling of "quantum"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack on the spelling, regarding the info, the issue is that today the defrag information is scattered. We have some in the stats section (the defrag activity) but memory and fragmentation status are in different section. I think it's logical to have the information in one place.

busage = &usage_latest.bins_usage[j];
info = sdscatprintf(info,
"[%d][%lu]::"
"nregs:%lu,nslabs:%lu,nnonfull:%lu,"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"nregs:%lu,nslabs:%lu,nnonfull:%lu,"
"regions:%lu,slabs:%lu,nonfull:%lu,"

Would rather have more useful names. We also seem to omit "number" info fields. "numbe_of_x" would also be fine.

if (nbins > 0) {
info = sdscatprintf(info,
"jemalloc_quantom:%d\r\n"
"hit_ratio:%lu%%,hits:%lu,misses:%lu\r\n"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"hit_ratio:%lu%%,hits:%lu,misses:%lu\r\n"
"defrag_hit_ratio:%lu%%\r\ndefrag_hits:%lu,defrag_misses:%lu\r\n"

This is a weird info syntax. We normally don't have multiple pairs on the same line separated by commands. I think each field should get its own line. Each info field needs to also stand on its own, some clients like python decompose each element into a key-value pair, and hit_ratio isn't a very descriptive name.

How is this from the other metrics like ative_defrag_hits ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to change exiting defrag metrics tracked from defrag.c file. These (existing metrics) could be, potentially, for non jemalloc defrag and have upper level logic, like keys defragged. I can think of nice other metrics at the defrag.c level, like have information about types of keys that where defragged. Here this information is from the perspective of the allocator, this specific metric does correspond to the active_defrag_hits in existing code. @madolson Does this make sense? or do you still think we should consolidate?

@madolson madolson added major-decision-approved Major decision approved by TSC team and removed major-decision-pending Major decision pending by TSC team labels Oct 14, 2024
@madolson
Copy link
Member

This doesn't seem like a major decision, so removing the tag. If we were to de-vendor it, we would want to see that in a separate PR.

@zvi-code
Copy link
Contributor Author

Will post a PR with fixes to existing comments later this week

@zvi-code
Copy link
Contributor Author

zvi-code commented Nov 5, 2024

forced push by mistake

@zvi-code zvi-code reopened this Nov 5, 2024
@zvi-code zvi-code force-pushed the align_defrag_vanila branch 2 times, most recently from 73535fe to 955da04 Compare November 5, 2024 13:52
@zvi-code
Copy link
Contributor Author

zvi-code commented Nov 5, 2024

@madolson , @zuiderkwast , @ranshid, I created a new PR with current code and a separate commit for CR fixes. Seems I used wrong update method (rebased) and I could not align the branch without encountering issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-approved Major decision approved by TSC team
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants