-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CHANGED] refactored message delivery for future SubscribePullAsync #772
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## lev-bench2 #772 +/- ##
=============================================
Coverage ? 68.88%
=============================================
Files ? 49
Lines ? 15210
Branches ? 3136
=============================================
Hits ? 10478
Misses ? 1688
Partials ? 3044 ☔ View full report in Codecov by Sentry. |
@@ -0,0 +1,491 @@ | |||
// Copyright 2015-2024 The NATS Authors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NS: consider removing glib_
prefix from most filenames, and using more intention-revealing names (e.g. glib_async_cb -> async_error_cb?)
@@ -840,19 +842,10 @@ _timeoutPubAsync(natsTimer *t, void *closure) | |||
m->sub = js->rsub; | |||
natsMsg_setTimeout(m); | |||
|
|||
// Best attempt, ignore NATS_SLOW_CONSUMER errors which may be returned |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not test the changes in this path beyond passing the existing tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a big one! Will likely have to have another look, but this is a first pass. I think we need to run some perf test to make sure that we are not degrading performance too much.
src/conn.c
Outdated
} | ||
else | ||
if (s == NATS_OK) | ||
s = natsSub_enqueueMsg(sub, msg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested the performance implication to invoke a function here as opposed to add to the list/notify as it used to be? Does not have to be super elaborate, but you could have nats-sub and nats-pub and see. If there is a major difference, you should see right away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some cursory checks, but was planning to focus on it weekend, on the final 'PullSubscribeAsync' branch once I finish my struggle with the missing heartbeats. Any other ideas beyond [js-]pub->sub a million messages? I will add the comparison here. Also on this, did you have any tooling for CPU/mem profiling, and is there a built-in thread counter that I missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experience, you need more than 1M message, because otherwise the run may be a bit too fast to be representative. In early dev, I was using callgrind to do profiling.
re: is there a built-in thread counter, not sure what you are asking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kozlovic after some struggles I had to roll back significantly, going back to the original 2 deliverMsg
functions.
- My locking changes generally outperformed with a large number of threads/subs, but underperformed with a small number which is the dominant use case.
- The
bool
->unsigned :1
change was surprisingly (for me) slower -1% to -3% on x86, while barely noticeable on my M2. It also didn't affect the direct injection test much, but was showing a consistent slowness onLarge
and especiallySmall
(no flush) tests.
I left the 2 deliverMsg
functions in their places for now to compare the diffs, will move them to dispatch.c
in the next PR.
My trickery with changing the PR base to get the numbers compared here did not work, but here is the summary from my fork,
== BenchSubscribeAsync_Small ==
Best: -3.88%
Average: -4.58%
Worst: -6.61%
...
== BenchSubscribeAsync_Inject ==
Best: -7.36%
Average: -3.22%
Worst: -0.43%
...
== BenchSubscribeAsync_InjectSlow ==
Best: -0.35%
Average: -0.02%
Worst: 0.15%
...
== BenchSubscribeAsync_Large ==
Best: -0.03%
Average: -0.49%
Worst: 1.90%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you say if negative numbers here mean better or worse? Meaning is it that the new code is -3.88%
number of messages per second, meaning it is slower than main?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Negative is better; these are "sums of averages of all runs", not directly proportionate to the TPM. I pasted from here: https://github.com/levb/nats.c/actions/runs/10243884983/job/28336496607?pr=102, you can see both run's results and the comparison details (the comparison does not display records with less than 5% difference).
(Unfortunately rebasing this PR did not change the github internal "base" reference from main
, I'd need to cut a new PR to run the comparison here)
@levb I have to say that your PR reviews become hard to follow. You make use a lot of merge and separate PRs (which I understand why), but specifically for this PR, I am not sure what to review, and may have to do full review again. When branches don't live for too long, I think a rebase makes more sense. Say that you are submitting a PR and during the review process, we agree that something else needs to be done before. Maybe open a PR that has the new change, once committed, update (with a rebase) the original PR so that the new code "sits on top" of the main + the intermediate change. I think this one is different because you had to rollback some changes, but still. I will wait on your comment to know what I have to review, and it is ok if the answer is "everything" :-) |
@kozlovic Yes the entire PR, please. If it makes it easier for you I'd be very happy to replace this with a new (single commit) PR now, it would also have the proper base and run the benchmark properly. Apologies for so many changes, the performance benchmark/investigation went down too many rabbit holes. |
If it is the entire PR, it does not really matter to not have all those commits because I need to review everything.
Then maybe it is better to do so if you don't mind. I am busy with a NATS Server issue investigation at the moment, so I may not be able to review right away anyway, so that give you time to "clean it up". Ok? |
@levb Just to be clear: you are going to post a new PR instead of this one, correct? |
deliverMsg
logic into a single function (seedispatch.c
)nats_OpenWithConfig
for thread pool control, and in preparation for JSON-ifying default options, a separate request from @jnmoynenats.c
into ./glib/*.c