Implement append and split_off for BTreeMap and BTreeSet #26227

jooert · 2015-06-11T20:57:33Z

Changes the internal SearchStack API to return the key on removal as well.

rust-highfive · 2015-06-11T20:57:37Z

(rust_highfive has picked a reviewer for you, use r? to override)

bluss · 2015-06-12T12:29:38Z

src/libcollections/btree/map.rs

+               reason = "recently added as part of collections reform 2")]
+    pub fn append(&mut self, other: &mut Self) {
+        // Read all values from `other` into `self`.
+        // The use of `ptr::read` is safe as we clear `other` afterwards


This doesn't look safe -- clear will run drop on all keys & values, so it leads to double drop.

Oops. 😳
I already have a fix for that.

bluss · 2015-06-12T12:57:54Z

I'm not familiar enough with the search stack code, so I'd love if that part were reviewed by some one else.

jooert · 2015-06-12T17:57:17Z

Updated with fixes to bluss's remarks

aturon · 2015-06-12T18:03:26Z

r? @bluss

(I'm handing this off officially, since you've started reviewing it anyway!)

Gankra · 2015-06-12T20:05:36Z

@glaebhoerl is the most experienced with the current design, but I'll be happy to review this a bit later.

glaebhoerl · 2015-06-12T20:25:40Z

@gankro I think that must have been a typo?

Gankra · 2015-06-12T20:33:04Z

Oops, yes I meant @gereeter

bluss · 2015-06-13T00:40:37Z

Thanks, the parts I looked at look great now

gereeter · 2015-06-13T02:40:04Z

src/libcollections/btree/map.rs

+        {
+            // `unwrap` won't panic because `self.len()` > 0.
+            if at <= self.keys().next().unwrap().borrow() {
+                should_swap = true;


Can this condition just be assigned to should_swap?

gereeter · 2015-06-13T03:32:53Z

This looks like this should work fine, but I'm not sure about the algorithms. append is O(m log n), doing m inserts into a tree of size n. Since the two trees are already sorted, it should be possible to merge in O(m + n) time. split_off is O(n log n), but, by preserving much of the original tree, I think it can be done in O(log n) time.

jooert · 2015-06-14T20:43:46Z

First, thanks everyone for review and input, this is great!
@gereeter I expected that there are algorithms faster than the ones implemented here, but I couldn't find good resources for better ones, do you have any helpful links?
Do you think O(m + n) for append is also possible if the key ranges of the two trees overlap? I can't imagine how that would work, as nodes might get full and must be split.
Regarding the implementation of split_off, I found something about level-balanced b-trees, which add parent pointers to nodes and allow to split in O(log n).

gereeter · 2015-06-16T17:18:15Z

I can't seem to find any references talking about splitting or merging B-Trees, unfortunately - it just isn't an operation that most users have to do.

For append, a very simple (and probably inefficient in terms of constant factor) algorithm would be to iterate through both trees, doing a linear time merge of the two sorted sequences. From there, it is possible to build a B-Tree from a sorted sequence in linear time. "Implementing Sets Efficiently in a Functional Language" by Stephen Adams describes an algorithm (hedge_union) for efficiently merging binary search trees, which probably can be adapted with some work to B-Trees.

For split_off, you could just go down the tree as if you were searching for the key to be split, then splitting the individual nodes around the point where the key would be found, sort of zipping the tree apart. I know this explaination is probably very unclear, so to demonstrate with pictures:

Suppose we want to split the following tree at 7:

                 +----+----+
                 | 5  | 10 |
      ___________+----+----+___________
     /                |                \
+----+----+    +----+----+----+    +----+----+
| 1  | 3  |    | 6  | 7  | 9  |    | 11 | 13 |
+----+----+    +----+----+----+    +----+----+

We start at the root, looking for where seven would go:

                 7 would go here
                      V
                 +----+----+
                 | 5  | 10 |
      ___________+----+----+___________
     /                |                \
+----+----+    +----+----+----+    +----+----+
| 1  | 3  |    | 6  | 7  | 9  |    | 11 | 13 |
+----+----+    +----+----+----+    +----+----+

Since that point is in the middle of the root, we split the root into two piece, one larger and one smaller:

          +----+             +----+
          | 5  |             | 10 |
      ____+----+_____    ____+----+____
     /               \  /              \
+----+----+    +----+----+----+    +----+----+
| 1  | 3  |    | 6  | 7  | 9  |    | 11 | 13 |
+----+----+    +----+----+----+    +----+----+

Once we've done that, we move on to the next node searching for 7:

          +----+             +----+
          | 5  |             | 10 |
      ____+----+_____    ____+----+____
     /               \  /              \
+----+----+    +----+----+----+    +----+----+
| 1  | 3  |    | 6  | 7  | 9  |    | 11 | 13 |
+----+----+    +----+----+----+    +----+----+
                       ^
                   7 is right here

At that point, as in the root node, we split the node into a greater part and a lesster part:

          +----+                     +----+
          | 5  |                     | 10 |
      ____+----+_____              __+----+__
     /               \            /          \
+----+----+     +----+----+    +----+    +----+----+
| 1  | 3  |     | 6  | 7  |    | 9  |    | 11 | 13 |
+----+----+     +----+----+    +----+    +----+----+

Since that node we just split was a leaf node (also, because we actually found our splitting key), we don't need to do any more splitting, and we are left with two trees, one greater than our key and the other less than our key. There is still some more work involved, as this splitting process probably left many of the nodes that we split underfull, requiring steps to recoalesce nodes, but once that is done, the B-Tree is split. Note that since this just goes up and down to search path the the node, it only take log(n) time.

gereeter · 2015-06-16T17:22:04Z

Note: Since the splitting and merging algorithms are fairly involved and badly documented, I would not at all be opposed to merging this PR as is and opening performance issues to use the better algorithms.

jooert · 2015-06-18T14:06:18Z

That sounds interesting, thank you for the explanation! I will try to implement something like that.

bors · 2015-06-18T21:35:30Z

☔ The latest upstream changes (presumably #26192) made this pull request unmergeable. Please resolve the merge conflicts.

jooert · 2015-06-27T14:55:49Z

I pushed a linear time implementation of append; @gereeter, could you please have a look at it?

gereeter · 2015-06-27T19:45:04Z

src/libcollections/btree/map.rs

+
+        // Second, we build a tree from the sorted sequence in linear time.
+        self.length = elements.len();
+        let (depth, root) = Node::from_sorted_iter(elements.into_iter(), self.length, self_b);


If you can build from an iterator, it would be more efficient to make an iterator that merges two iterators instead of going through a Vec.

Yeah, I don't like allocating a new Vec here, too, but for the algorithm in from_sorted_iter to work, I have to know how many elements I have after the merge.

Bleh, I hadn't though of the equal key case. It still might be possible and not too difficult, as in the worst case, only the "right edge" of the BTree will be left underfull, and every underfull node except for the root (which, if I remember correctly, is allowed to be underfull) is adjacent to a completely full node, allowing a simple steal to fix things.

gereeter · 2015-06-27T21:19:20Z

src/libcollections/btree/node.rs

+
+        loop {
+            // Determine how many nodes we need on this level.
+            let num_nodes = num_elements / (capacity + 1) + 1;


Can you add a comment explaining why this calculation is correct?

Gankra · 2015-07-18T23:19:41Z

@jooert what's the status of this PR?

Changes the internal SearchStack API to return the key on removal as well.

jooert · 2015-07-20T15:19:32Z

@gankro Sorry for the very long delay and thanks for the kind words! I've just pushed a new version of append that builds the merged tree entirely using an iterator and only allocates memory for keeping track of the different levels of the tree.
@gereeter Could you please have a look at this, again?

I haven't started implementing the split_off method in log(n) time, so this PR is still very much WIP.

Gankra · 2015-07-20T16:52:55Z

@jooert So I've become increasingly convinced that BTree needs to be refactored to use parent pointers (parent_ptr, edge_index). This would remove all allocations except for split/merge which are obviously necessary, and the whole "search stack" system which is pretty complicated.

Is this something that you think would simplify your code? Something that you'd be interested in doing?

Note that one can actually implement parent pointers but not bother to replace all the search stacks to start. That is it's theoretically possible for the two to co-exist temporarily.

jooert · 2015-07-21T09:23:27Z

Parent pointers would simplify my code in so far as I wouldn't need to keep track of the different levels of the tree using the levels vector. In general, I agree with you that the implementation of BTree would be easier to grasp if the whole search stack stuff would be replaced with an implementation using parent pointers. But, I'm not sure what the performance implications of having two extra pointers per node would be; do you think these are negligible?
I am definitely interested in doing this refactoring, but it will take some time.

Gankra · 2015-07-21T17:55:29Z

To my knowledge, it's a performance slam dunk. This is the strategy used in Google's https://code.google.com/p/cpp-btree/

arthurprs · 2015-07-21T19:20:55Z

@jooert Possibly stupid question. You said two pointers because we'd need not only a "parent pointer" but also an "index in parent"?

We can always use u8 for len/cap/parent_index and move those to the end of the struct. That'd save quite a bit of space. The google implementation linked by @gankro uses u8 as default.

jooert · 2015-07-21T20:24:05Z

@arthurprs Yes, exactly.

Gankra · 2015-07-21T20:38:07Z

Rust will pad the struct to have a size that's a multiple of its alignment, though. So having using a u8 doesn't actually save you anything if it will just be rounded up to what a u64 would have done.

arthurprs · 2015-07-21T20:58:54Z

@gankro not really, it depends where in the struct you want to have your smaller types. If you stick a u8 in between 2 usizes it'll have 7 bytes of padding to allow aligned access on the second. At least in Cish standard ABI.

What I'm proposing is sticking them at the end.

Example: http://is.gd/b7ChEQ

This way the new Node will have the same size as the current one (40 bytes in 64bit builds), the only limitation is having a B <= 255. If that's not enough we can use u16s instead and still keep the 40 bytes size, not true for 32bit builds though.

Gankra · 2015-07-21T21:17:47Z

Yes if you fold other values to be smaller, you will get savings. I was assuming you were just suggesting only making one field a u8 (which would be useless).

Gankra · 2015-07-21T21:18:45Z

Note that in a defunct PR I removed the ability to set B at all, so we can safely just make it a constant that "happens" to work. (the current default has always seemed fine, and gains from changing it are small to trivial).

arthurprs · 2015-07-21T21:21:58Z

Cool, so if we ever go this route (parent pointers) we should consider using these smaller integers types to save space.

apasel422 · 2015-08-08T19:24:52Z

Is anyone working on the parent pointer implementation?

jooert · 2015-08-10T06:50:24Z

Am 08.08.2015 um 21:25 schrieb Andrew Paseltiner:

Is anyone working on the parent pointer implementation?

—
Reply to this email directly or view it on GitHub
#26227 (comment).

I'm not, I haven't found time for it yet. Go for it. :-)

bors · 2015-08-29T01:45:58Z

☔ The latest upstream changes (presumably #28043) made this pull request unmergeable. Please resolve the merge conflicts.

alexcrichton · 2015-09-28T18:51:33Z

What's the status of this? Blocked on the parent pointer rewrite? Ready to go with a rebase? Outstanding concerns?

apasel422 · 2015-09-28T18:53:21Z

I think this should wait pending the rewrite at https://github.com/gereeter/btree-rewrite.

alexcrichton · 2015-09-28T18:55:23Z

Ah ok, in that case I'm gonna close this for now (clearing out the queue).

apasel422 · 2016-01-19T16:03:23Z

@jooert Now that #30426 has landed, are you interested in reopening this PR?

@gereeter

Implement `append` for b-trees. I have finally found time to revive #26227, this time only with an `append` implementation. The algorithm implemented here is linear in the size of the two b-trees. It firsts creates a `MergeIter` from the two b-trees and then builds a new b-tree by pushing key-value pairs from the `MergeIter` into nodes at the right heights. Three functions for stealing have been added to the implementation of `Handle` as well as a getter for the height of a `NodeRef`. The docs have been updated with performance information about `BTreeMap::append` and the remark about B has been removed now that it is the same for all instances of `BTreeMap`. cc @gereeter @gankro @apasel422

rust-highfive assigned aturon Jun 11, 2015

bluss reviewed Jun 12, 2015
View reviewed changes

jooert force-pushed the btree_append_split_off branch from bd1f270 to 6d959cc Compare June 12, 2015 17:53

rust-highfive assigned bluss and unassigned aturon Jun 12, 2015

gereeter reviewed Jun 13, 2015
View reviewed changes

jooert force-pushed the btree_append_split_off branch from 6d959cc to 8026390 Compare June 27, 2015 14:54

gereeter reviewed Jun 27, 2015
View reviewed changes

jooert added 2 commits July 20, 2015 16:21

Implement append and split_off for BTreeMap and BTreeSet

9b903bf

Changes the internal SearchStack API to return the key on removal as well.

Add linear time implementation of append.

5f1a116

jooert force-pushed the btree_append_split_off branch from 8026390 to 5f1a116 Compare July 20, 2015 15:11

apasel422 mentioned this pull request Aug 17, 2015

Change BTreeMap to use parent pointers #27865

Closed

alexcrichton closed this Sep 28, 2015

gereeter mentioned this pull request Nov 30, 2015

Implement BTreeMap::append gereeter/btree-rewrite#5

Open

jooert mentioned this pull request Mar 24, 2016

Implement append for b-trees. #32466

Merged

apasel422 mentioned this pull request Apr 29, 2016

Tracking issue for collections reform part 2 (RFC 509) #19986

Closed

Implement append and split_off for BTreeMap and BTreeSet #26227

Implement append and split_off for BTreeMap and BTreeSet #26227

Conversation

jooert commented Jun 11, 2015

rust-highfive commented Jun 11, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bluss commented Jun 12, 2015

jooert commented Jun 12, 2015

aturon commented Jun 12, 2015

Gankra commented Jun 12, 2015

glaebhoerl commented Jun 12, 2015

Gankra commented Jun 12, 2015

bluss commented Jun 13, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gereeter commented Jun 13, 2015

jooert commented Jun 14, 2015

gereeter commented Jun 16, 2015

gereeter commented Jun 16, 2015

jooert commented Jun 18, 2015

bors commented Jun 18, 2015

jooert commented Jun 27, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Gankra commented Jul 18, 2015

jooert commented Jul 20, 2015

Gankra commented Jul 20, 2015

jooert commented Jul 21, 2015

Gankra commented Jul 21, 2015

arthurprs commented Jul 21, 2015

jooert commented Jul 21, 2015

Gankra commented Jul 21, 2015

arthurprs commented Jul 21, 2015

Gankra commented Jul 21, 2015

Gankra commented Jul 21, 2015

arthurprs commented Jul 21, 2015

apasel422 commented Aug 8, 2015

jooert commented Aug 10, 2015

bors commented Aug 29, 2015

alexcrichton commented Sep 28, 2015

apasel422 commented Sep 28, 2015

alexcrichton commented Sep 28, 2015

apasel422 commented Jan 19, 2016