Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expire dedups on tick #297

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

Expire dedups on tick #297

wants to merge 4 commits into from

Conversation

the-mikedavis
Copy link
Member

@the-mikedavis the-mikedavis commented Sep 23, 2024

This change moves the work we do in khepri_machine:drop_expired_dedups/2 out of apply/3 and into the handler for Ra's periodic tick aux effect.

The commit message has more details but the TLDR is that the maps:filter/2 within drop_expired_dedups/2 can become expensive if many transactions are being handled before their dedups are received. Instead we can determine if any dedups are expired in the handler for the tick effect and expire any by committing a new #expire_dedups{} command.

Also included is a change to use the new handle_aux/5 API introduced in Ra 2.10 but that change shouldn't have any effect on behavior.

@the-mikedavis the-mikedavis added the enhancement New feature or request label Sep 23, 2024
@the-mikedavis the-mikedavis added this to the v0.17.0 milestone Sep 23, 2024
@the-mikedavis the-mikedavis self-assigned this Sep 23, 2024
src/khepri_machine.erl Outdated Show resolved Hide resolved
Copy link

codecov bot commented Sep 23, 2024

Codecov Report

Attention: Patch coverage is 96.29630% with 1 line in your changes missing coverage. Please review.

Project coverage is 90.25%. Comparing base (caf1ae3) to head (504ffd9).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/khepri_machine.erl 96.29% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #297      +/-   ##
==========================================
+ Coverage   89.69%   90.25%   +0.56%     
==========================================
  Files          22       22              
  Lines        3259     3642     +383     
==========================================
+ Hits         2923     3287     +364     
- Misses        336      355      +19     
Flag Coverage Δ
erlang-25 89.53% <96.29%> (+0.86%) ⬆️
erlang-26 89.95% <92.59%> (+0.65%) ⬆️
erlang-27 90.11% <96.29%> (+0.57%) ⬆️
os-ubuntu-latest 90.06% <92.59%> (+0.37%) ⬆️
os-windows-latest 90.25% <96.29%> (+0.83%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/khepri_machine.erl Outdated Show resolved Hide resolved
test/protect_against_dups_option.erl Outdated Show resolved Hide resolved
@the-mikedavis
Copy link
Member Author

I think it should be easier to start with the other optimization (which also must bump the machine version) and this PR should depend on rabbitmq/ra#474 so I'll mark this as a draft for now. We should be able to come back to it after the other PR is merged.

@the-mikedavis the-mikedavis marked this pull request as draft September 24, 2024 20:00
Ra 2.10.1 introduced a new `handle_aux/5` callback that takes the place
of `handle_aux/6`. Instead of passing the log state and machine state
separately, this API passes a new `ra_aux:internal_state()` opaque
argument which you can read log or machine state out of with helper
functions from the `ra_aux` module.

This commit only updates to use the new callback: there should be no
functional change from this commit.
This will be used in the child commit to write a test which sets a low
`tick_timeout` configuration.
src/khepri_machine.erl Outdated Show resolved Hide resolved
As part of the `khepri_machine:post_apply/2` helper which is run after
any command is applied, we use `maps:filter/2` to eliminate any entries
in the `dedups` field of the machine state which are expired according
to the command's timestamp. This `drop_expired_dedups/2` step becomes a
bottleneck though when a Khepri store handles many transactions at once.

For example in RabbitMQ, queue deletion is done with a transaction
submitted from each queue process. When many (for example five thousand)
queues are deleted at once, `drop_expired_dedups/2` becomes a noticeable
chunk of a flamegraph and specifically the `maps:filter/2` within.

`maps:filter/2` is slow here because the BIF used to implement it
collects a list of all key-value pairs for which the predicate returns
true, sorts it, and then creates a new hashmap from it. We are unlikely
to expire any given dedup when handling a command, especially when
submitting many commands at once, so we end up essentially calling
`maps:from_list(maps:to_list(Dedups))`.

It is a small improvement to replace `maps:filter/2` with `maps:fold/3`,
a case expression and `maps:remove/2` (reflecting that we will always
fold over the map but rarely remove elements) but it is not enough to
eliminate this rather large chunk of the flamegraph.

Instead the solution in this commit is to move the detection of expired
dedups to the "tick" aux effect. Ra emits this effect periodically:
every 1s by default but configurable. By moving this detection to
`handle_aux/5` we remove it from the "hot path" of `apply/3`. Even
better though: we are unlikely to actually need to expire any dedups.
In the case of queue deletion in RabbitMQ for example, we are likely
to handle all in-flight transactions and handle the subsequent
`#dedup_ack{}` commands before even evaluating a `tick` effect. If we
do handle a tick effect while transactions are in-flight then we are
unlikely to need to expire them anyways so we will only scan the map
with the new `khepri_utils:maps_any/2` helper.

If we need to expire dedups, the aux handler for the `tick` effect
appends a new `#expire_dedups{}` command to the log which does the same
as `drop_expired_dedups/2` prior to this commit.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants