-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy #32251
Merged
craig
merged 1 commit into
cockroachdb:master
from
nvanbenschoten:nvanbenschoten/cmdqTreeCOW
Nov 16, 2018
Merged
storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy #32251
craig
merged 1 commit into
cockroachdb:master
from
nvanbenschoten:nvanbenschoten/cmdqTreeCOW
Nov 16, 2018
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
nvanbenschoten
force-pushed
the
nvanbenschoten/cmdqTreeCOW
branch
from
November 15, 2018 22:39
8edd89d
to
40584a5
Compare
nvanbenschoten
force-pushed
the
nvanbenschoten/cmdqTreeCOW
branch
2 times, most recently
from
November 15, 2018 23:03
c9ba6a1
to
509bccf
Compare
…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None
nvanbenschoten
force-pushed
the
nvanbenschoten/cmdqTreeCOW
branch
from
November 15, 2018 23:05
509bccf
to
c855b45
Compare
ajwerner
approved these changes
Nov 16, 2018
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Neat!
Reviewed 2 of 2 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained
TFTR! bors r+ |
Build failed |
Test flake: #32062. bors r+ |
craig bot
pushed a commit
that referenced
this pull request
Nov 16, 2018
32251: storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy r=nvanbenschoten a=nvanbenschoten All commits from #32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in #31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in #32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
Build succeeded |
This was referenced Dec 17, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
All commits from #32165 except the last one.
This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global
sync.Pools
when they are no longer referenced.One of the main ideas in #31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule:
In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed.
Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a
sync.Pool
. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach.The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in #32165.
We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in
BenchmarkBTreeDeleteInsertCloneEachTime
between the tests thatReset
old clones (allowing nodes to be freed intosync.Pool
s) and the tests that don'tReset
old clones.