Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHANGED] refactored message delivery for future SubscribePullAsync #772

Closed
wants to merge 33 commits into from

Conversation

levb
Copy link
Collaborator

@levb levb commented Jul 19, 2024

  • Consolidated deliverMsg logic into a single function (see dispatch.c)
  • Added nats_OpenWithConfig for thread pool control, and in preparation for JSON-ifying default options, a separate request from @jnmoyne
  • [EXPERIMENTAL, not used/tested] Allow using a thread pool for reply delivery.
  • Code organization - broke up nats.c into ./glib/*.c

src/conn.c Show resolved Hide resolved
Copy link

codecov bot commented Jul 19, 2024

Codecov Report

Attention: Patch coverage is 74.89025% with 286 lines in your changes missing coverage. Please review.

Please upload report for BASE (lev-bench2@1e675d6). Learn more about missing BASE report.

Files Patch % Lines
src/glib/glib.c 69.53% 21 Missing and 57 partials ⚠️
src/glib/glib_last_error.c 64.82% 18 Missing and 33 partials ⚠️
src/nats.c 56.97% 28 Missing and 9 partials ⚠️
src/sub.c 84.37% 10 Missing and 25 partials ⚠️
src/glib/glib_timer.c 81.50% 3 Missing and 24 partials ⚠️
src/glib/glib_dispatch_pool.c 70.12% 7 Missing and 16 partials ⚠️
src/glib/glib_async_cb.c 82.81% 5 Missing and 6 partials ⚠️
src/glib/glib_gc.c 78.57% 2 Missing and 4 partials ⚠️
src/js.c 81.48% 1 Missing and 4 partials ⚠️
src/dispatch.c 80.00% 0 Missing and 3 partials ⚠️
... and 6 more
Additional details and impacted files
@@              Coverage Diff              @@
##             lev-bench2     #772   +/-   ##
=============================================
  Coverage              ?   68.88%           
=============================================
  Files                 ?       49           
  Lines                 ?    15210           
  Branches              ?     3136           
=============================================
  Hits                  ?    10478           
  Misses                ?     1688           
  Partials              ?     3044           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/conn.c Outdated Show resolved Hide resolved
src/glib/glib.c Show resolved Hide resolved
@@ -0,0 +1,491 @@
// Copyright 2015-2024 The NATS Authors
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NS: consider removing glib_ prefix from most filenames, and using more intention-revealing names (e.g. glib_async_cb -> async_error_cb?)

@@ -840,19 +842,10 @@ _timeoutPubAsync(natsTimer *t, void *closure)
m->sub = js->rsub;
natsMsg_setTimeout(m);

// Best attempt, ignore NATS_SLOW_CONSUMER errors which may be returned
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not test the changes in this path beyond passing the existing tests.

src/js.c Show resolved Hide resolved
src/nats.h Show resolved Hide resolved
src/natstime.c Show resolved Hide resolved
test/test.c Outdated Show resolved Hide resolved
@levb levb changed the title WIP: [CHANGED] refactored message delivery in preparation for SubscribePullAsync [CHANGED] refactored message delivery for future SubscribePullAsync Jul 19, 2024
@levb levb requested a review from kozlovic July 19, 2024 13:50
@levb levb requested a review from mtmk July 19, 2024 13:50
@levb levb marked this pull request as ready for review July 19, 2024 15:02
Copy link
Member

@kozlovic kozlovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a big one! Will likely have to have another look, but this is a first pass. I think we need to run some perf test to make sure that we are not degrading performance too much.

src/conn.c Outdated Show resolved Hide resolved
src/conn.c Outdated
}
else
if (s == NATS_OK)
s = natsSub_enqueueMsg(sub, msg);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested the performance implication to invoke a function here as opposed to add to the list/notify as it used to be? Does not have to be super elaborate, but you could have nats-sub and nats-pub and see. If there is a major difference, you should see right away.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some cursory checks, but was planning to focus on it weekend, on the final 'PullSubscribeAsync' branch once I finish my struggle with the missing heartbeats. Any other ideas beyond [js-]pub->sub a million messages? I will add the comparison here. Also on this, did you have any tooling for CPU/mem profiling, and is there a built-in thread counter that I missed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experience, you need more than 1M message, because otherwise the run may be a bit too fast to be representative. In early dev, I was using callgrind to do profiling.

re: is there a built-in thread counter, not sure what you are asking.

Copy link
Collaborator Author

@levb levb Jul 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

DISPATCH is blue

So there is a slowdown, if minimal. Strangely, I see it more on my machine (M2) with a particularly large difference at 5 threads (500 subs). Increasing the number of messages 10x did not introduce a meaningful difference. Investigating.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlovic With the new benchmark changes in #774, will be posting updates soon, please stand by.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kozlovic after some struggles I had to roll back significantly, going back to the original 2 deliverMsg functions.

  • My locking changes generally outperformed with a large number of threads/subs, but underperformed with a small number which is the dominant use case.
  • The bool->unsigned :1 change was surprisingly (for me) slower -1% to -3% on x86, while barely noticeable on my M2. It also didn't affect the direct injection test much, but was showing a consistent slowness on Large and especially Small (no flush) tests.

I left the 2 deliverMsg functions in their places for now to compare the diffs, will move them to dispatch.c in the next PR.

My trickery with changing the PR base to get the numbers compared here did not work, but here is the summary from my fork,

== BenchSubscribeAsync_Small ==
Best: -3.88%
Average: -4.58%
Worst: -6.61%
...
== BenchSubscribeAsync_Inject ==
Best: -7.36%
Average: -3.22%
Worst: -0.43%
...
== BenchSubscribeAsync_InjectSlow ==
Best: -0.35%
Average: -0.02%
Worst: 0.15%
...
== BenchSubscribeAsync_Large ==
Best: -0.03%
Average: -0.49%
Worst: 1.90%

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you say if negative numbers here mean better or worse? Meaning is it that the new code is -3.88% number of messages per second, meaning it is slower than main?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Negative is better; these are "sums of averages of all runs", not directly proportionate to the TPM. I pasted from here: https://github.com/levb/nats.c/actions/runs/10243884983/job/28336496607?pr=102, you can see both run's results and the comparison details (the comparison does not display records with less than 5% difference).

(Unfortunately rebasing this PR did not change the github internal "base" reference from main, I'd need to cut a new PR to run the comparison here)

src/dispatch.c Outdated Show resolved Hide resolved
src/dispatch.c Outdated Show resolved Hide resolved
src/dispatch.c Outdated Show resolved Hide resolved
src/nats.h Outdated Show resolved Hide resolved
src/opts.c Outdated Show resolved Hide resolved
src/sub.c Outdated Show resolved Hide resolved
src/sub.c Outdated Show resolved Hide resolved
src/sub.c Outdated Show resolved Hide resolved
Base automatically changed from lev-ci-logging-threadsanitizer to main July 22, 2024 22:00
@levb levb changed the base branch from main to lev-bench2 August 5, 2024 07:02
@kozlovic
Copy link
Member

kozlovic commented Aug 5, 2024

@levb I have to say that your PR reviews become hard to follow. You make use a lot of merge and separate PRs (which I understand why), but specifically for this PR, I am not sure what to review, and may have to do full review again.

When branches don't live for too long, I think a rebase makes more sense. Say that you are submitting a PR and during the review process, we agree that something else needs to be done before. Maybe open a PR that has the new change, once committed, update (with a rebase) the original PR so that the new code "sits on top" of the main + the intermediate change.

I think this one is different because you had to rollback some changes, but still.

I will wait on your comment to know what I have to review, and it is ok if the answer is "everything" :-)

@levb
Copy link
Collaborator Author

levb commented Aug 5, 2024

@kozlovic Yes the entire PR, please. If it makes it easier for you I'd be very happy to replace this with a new (single commit) PR now, it would also have the proper base and run the benchmark properly.

Apologies for so many changes, the performance benchmark/investigation went down too many rabbit holes.

@levb levb requested a review from kozlovic August 5, 2024 17:12
@kozlovic
Copy link
Member

kozlovic commented Aug 5, 2024

Yes the entire PR, please. If it makes it easier for you I'd be very happy to replace this with a new (single commit) PR now

If it is the entire PR, it does not really matter to not have all those commits because I need to review everything.

it would also have the proper base and run the benchmark properly.

Then maybe it is better to do so if you don't mind. I am busy with a NATS Server issue investigation at the moment, so I may not be able to review right away anyway, so that give you time to "clean it up". Ok?

@kozlovic
Copy link
Member

kozlovic commented Aug 5, 2024

@levb Just to be clear: you are going to post a new PR instead of this one, correct?

@levb levb closed this Aug 6, 2024
@levb levb deleted the lev-dispatchers branch October 1, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants