Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alloc count regression in #8 #14

Closed
tbg opened this issue Dec 22, 2022 · 5 comments
Closed

Alloc count regression in #8 #14

tbg opened this issue Dec 22, 2022 · 5 comments

Comments

@tbg
Copy link
Collaborator

tbg commented Dec 22, 2022

Looks like #8 gave us a regression in alloc count.

$ git checkout 65a0bf3
$ benchdiff --old HEAD~1 -r BenchmarkRawNode -c 5 .
checking out '91eb751'
building benchmark binaries for '91eb751' 1/1 |
test binaries already exist for '65a0bf3'; skipping build

  pkg=1/1 iter=5/5 go.etcd.io/raft/v3 -

name                     old time/op        new time/op        delta
RawNode/two-voters-10          1.14µs ± 1%        1.39µs ± 3%  +22.47%  (p=0.008 n=5+5)
RawNode/single-voter-10         685ns ± 2%         918ns ± 2%  +33.99%  (p=0.008 n=5+5)

name                     old firstIndex/op  new firstIndex/op  delta
RawNode/single-voter-10          2.00 ± 0%          1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
RawNode/two-voters-10            6.00 ± 0%          3.00 ± 0%  -50.00%  (p=0.008 n=5+5)

name                     old lastIndex/op   new lastIndex/op   delta
RawNode/single-voter-10          2.00 ± 0%          2.00 ± 0%     ~     (all equal)
RawNode/two-voters-10            2.00 ± 0%          2.00 ± 0%     ~     (all equal)

name                     old ready/op       new ready/op       delta
RawNode/single-voter-10          2.00 ± 0%          2.00 ± 0%     ~     (all equal)
RawNode/two-voters-10            2.00 ± 0%          2.00 ± 0%     ~     (all equal)

name                     old term/op        new term/op        delta
RawNode/two-voters-10            1.00 ± 0%          1.00 ± 0%     ~     (all equal)
RawNode/single-voter-10          0.00 ± 0%          0.00 ± 0%  +33.33%  (p=0.008 n=5+5)

name                     old alloc/op       new alloc/op       delta
RawNode/single-voter-10          415B ± 1%          534B ± 0%  +28.62%  (p=0.016 n=5+4)
RawNode/two-voters-10            637B ± 0%          888B ± 1%  +39.43%  (p=0.008 n=5+5)

name                     old allocs/op      new allocs/op      delta
RawNode/two-voters-10            7.00 ± 0%          8.00 ± 0%  +14.29%  (p=0.008 n=5+5)
RawNode/single-voter-10          5.00 ± 0%          6.00 ± 0%  +20.00%  (p=0.008 n=5+5)
@tbg
Copy link
Collaborator Author

tbg commented Dec 22, 2022

cc @nvanbenschoten

@nvanbenschoten
Copy link
Contributor

Thanks @tbg for filing this.

The alloc/op (allocation size) metric regressed, but that was more than compensated by the improvement from 55e6a83. So we did not see a regression in alloc/op across this work, which was part of the reason why we landed the two together. To some extent, the same applies for the time/op metric.

➜ benchdiff --new 55e6a83 -r BenchmarkRawNode -c 10 .
test binaries already exist for '07da8bd'; skipping build
test binaries already exist for '55e6a83'; skipping build

  pkg=1/1 iter=10/10 go.etcd.io/etcd/raft/v3 |

name                     old time/op        new time/op        delta
RawNode/two-voters-10          1.30µs ± 1%        1.13µs ± 2%  -12.97%  (p=0.000 n=10+10)
RawNode/single-voter-10         732ns ± 2%         716ns ± 1%   -2.17%  (p=0.000 n=10+10)

name                     old firstIndex/op  new firstIndex/op  delta
RawNode/single-voter-10          5.00 ± 0%          5.00 ± 0%     ~     (all equal)
RawNode/two-voters-10            9.00 ± 0%          9.00 ± 0%     ~     (all equal)

name                     old lastIndex/op   new lastIndex/op   delta
RawNode/single-voter-10          2.00 ± 0%          2.00 ± 0%     ~     (all equal)
RawNode/two-voters-10            2.00 ± 0%          2.00 ± 0%     ~     (all equal)

name                     old ready/op       new ready/op       delta
RawNode/single-voter-10          2.00 ± 0%          2.00 ± 0%     ~     (all equal)
RawNode/two-voters-10            2.00 ± 0%          2.00 ± 0%     ~     (all equal)

name                     old term/op        new term/op        delta
RawNode/single-voter-10          0.00 ± 0%          0.00 ± 0%     ~     (all equal)
RawNode/two-voters-10            1.00 ± 0%          1.00 ± 0%     ~     (all equal)

name                     old alloc/op       new alloc/op       delta
RawNode/two-voters-10            965B ± 0%          637B ± 0%  -33.96%  (p=0.000 n=10+10)
RawNode/single-voter-10          428B ± 0%          420B ± 1%   -1.94%  (p=0.000 n=8+10)

name                     old allocs/op      new allocs/op      delta
RawNode/single-voter-10          5.00 ± 0%          5.00 ± 0%     ~     (all equal)
RawNode/two-voters-10            7.00 ± 0%          7.00 ± 0%     ~     (all equal)

When I looked into the allocs/op regression earlier (with 2c2960c), I found that it was due to the introduction of the msgsAfterAppend field and the split of messages across two slices. Now that we're tracking messages across two slices, we have two allocs for the bookkeeping.

raft/raft.go

Lines 333 to 348 in 65a0bf3

// msgs contains the list of messages that should be sent out immediately to
// other nodes.
//
// Messages in this list must target other nodes.
msgs []pb.Message
// msgsAfterAppend contains the list of messages that should be sent after
// the accumulated unstable state (e.g. term, vote, []entry, and snapshot)
// has been persisted to durable storage. This includes waiting for any
// unstable state that is already in the process of being persisted (i.e.
// has already been handed out in a prior Ready struct) to complete.
//
// Messages in this list may target other nodes or may target this node.
//
// Messages in this list have the type MsgAppResp, MsgVoteResp, or
// MsgPreVoteResp. See the comment in raft.send for details.
msgsAfterAppend []pb.Message

I think at some point we'll want to find a way to avoid reallocating either of these slices per Ready iteration in the common case, but that runs into questions of ownership because the slices can be handed back to the app. Do we want to expand the library's API for this? For the non-async writes logic, we can probably do something simpler to recycle the msgsAfterAppend slice across Ready iterations within RawNode, because the slice is never handed to the user of the library.

Do you think it's worth it to go part way and shave a heap allocation off the AsyncStorageWrites = false case (at the expense of a bit of slice re-use logic)? If so, I can look into that.

@tbg
Copy link
Collaborator Author

tbg commented Dec 22, 2022

Do you think it's worth it to go part way and shave a heap allocation off the AsyncStorageWrites = false case (at the expense of a bit of slice re-use logic)? If so, I can look into that.

No, I don't think so. Thanks for the explanation, I had forgotten about 55e6a83.

@tbg
Copy link
Collaborator Author

tbg commented Dec 22, 2022

I think at some point we'll want to find a way to avoid reallocating either of these slices per Ready iteration in the common case, but that runs into questions of ownership because the slices can be handed back to the app. Do we want to expand the library's API for this? For the non-async writes logic, we can probably do something simpler to recycle the msgsAfterAppend slice across Ready iterations within RawNode, because the slice is never handed to the user of the library.

Yeah, (as you know) these kind of ownership questions keep coming up on our side of the fence (CRDB) as well, most recently here.

Maybe a model that can work is that we have a static method taking a *[]raftpb.Entry in raft that zeroes and puts slices into sync.Pools that RawNode will pull from. We can then say that raft does not claim ownership over any data it has handed out, but that the application is free to optimize allocations by returning slices to the pool once it no longer references them. This lets those who want to optimize optimize while sparing pool-reuse bugs from those who don't need that level of performance.

It would also be nice to pool the entry.Data slice, which - I believe - raft never "owns" in the first place (i.e. never allocates it, except ConfChanges): entries come in either through Propose or from raft.Storage. So memory reuse is completely up to the app anyway. In CRDB, it's currently hard to track the lifecycle of these because these slices end up both on the raft transport and the raft entry cache.

@tbg
Copy link
Collaborator Author

tbg commented Dec 22, 2022

Filed #15 for discussing the memory reuse angle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants