Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement append and split_off for BTreeMap and BTreeSet #26227

Closed
wants to merge 2 commits into from

Conversation

jooert
Copy link
Contributor

@jooert jooert commented Jun 11, 2015

Changes the internal SearchStack API to return the key on removal as well.

@rust-highfive
Copy link
Collaborator

r? @aturon

(rust_highfive has picked a reviewer for you, use r? to override)

reason = "recently added as part of collections reform 2")]
pub fn append(&mut self, other: &mut Self) {
// Read all values from `other` into `self`.
// The use of `ptr::read` is safe as we clear `other` afterwards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look safe -- clear will run drop on all keys & values, so it leads to double drop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. 😳
I already have a fix for that.

@bluss
Copy link
Member

bluss commented Jun 12, 2015

I'm not familiar enough with the search stack code, so I'd love if that part were reviewed by some one else.

@jooert jooert force-pushed the btree_append_split_off branch from bd1f270 to 6d959cc Compare June 12, 2015 17:53
@jooert
Copy link
Contributor Author

jooert commented Jun 12, 2015

Updated with fixes to bluss's remarks

@aturon
Copy link
Member

aturon commented Jun 12, 2015

r? @bluss

(I'm handing this off officially, since you've started reviewing it anyway!)

@rust-highfive rust-highfive assigned bluss and unassigned aturon Jun 12, 2015
@Gankra
Copy link
Contributor

Gankra commented Jun 12, 2015

@glaebhoerl is the most experienced with the current design, but I'll be happy to review this a bit later.

@glaebhoerl
Copy link
Contributor

@gankro I think that must have been a typo?

@Gankra
Copy link
Contributor

Gankra commented Jun 12, 2015

Oops, yes I meant @gereeter

@bluss
Copy link
Member

bluss commented Jun 13, 2015

Thanks, the parts I looked at look great now

{
// `unwrap` won't panic because `self.len()` > 0.
if at <= self.keys().next().unwrap().borrow() {
should_swap = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this condition just be assigned to should_swap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course.

@gereeter
Copy link
Contributor

This looks like this should work fine, but I'm not sure about the algorithms. append is O(m log n), doing m inserts into a tree of size n. Since the two trees are already sorted, it should be possible to merge in O(m + n) time. split_off is O(n log n), but, by preserving much of the original tree, I think it can be done in O(log n) time.

@jooert
Copy link
Contributor Author

jooert commented Jun 14, 2015

First, thanks everyone for review and input, this is great!
@gereeter I expected that there are algorithms faster than the ones implemented here, but I couldn't find good resources for better ones, do you have any helpful links?
Do you think O(m + n) for append is also possible if the key ranges of the two trees overlap? I can't imagine how that would work, as nodes might get full and must be split.
Regarding the implementation of split_off, I found something about level-balanced b-trees, which add parent pointers to nodes and allow to split in O(log n).

@gereeter
Copy link
Contributor

I can't seem to find any references talking about splitting or merging B-Trees, unfortunately - it just isn't an operation that most users have to do.

For append, a very simple (and probably inefficient in terms of constant factor) algorithm would be to iterate through both trees, doing a linear time merge of the two sorted sequences. From there, it is possible to build a B-Tree from a sorted sequence in linear time. "Implementing Sets Efficiently in a Functional Language" by Stephen Adams describes an algorithm (hedge_union) for efficiently merging binary search trees, which probably can be adapted with some work to B-Trees.

For split_off, you could just go down the tree as if you were searching for the key to be split, then splitting the individual nodes around the point where the key would be found, sort of zipping the tree apart. I know this explaination is probably very unclear, so to demonstrate with pictures:

Suppose we want to split the following tree at 7:

                 +----+----+
                 | 5  | 10 |
      ___________+----+----+___________
     /                |                \
+----+----+    +----+----+----+    +----+----+
| 1  | 3  |    | 6  | 7  | 9  |    | 11 | 13 |
+----+----+    +----+----+----+    +----+----+

We start at the root, looking for where seven would go:

                 7 would go here
                      V
                 +----+----+
                 | 5  | 10 |
      ___________+----+----+___________
     /                |                \
+----+----+    +----+----+----+    +----+----+
| 1  | 3  |    | 6  | 7  | 9  |    | 11 | 13 |
+----+----+    +----+----+----+    +----+----+

Since that point is in the middle of the root, we split the root into two piece, one larger and one smaller:

          +----+             +----+
          | 5  |             | 10 |
      ____+----+_____    ____+----+____
     /               \  /              \
+----+----+    +----+----+----+    +----+----+
| 1  | 3  |    | 6  | 7  | 9  |    | 11 | 13 |
+----+----+    +----+----+----+    +----+----+

Once we've done that, we move on to the next node searching for 7:

          +----+             +----+
          | 5  |             | 10 |
      ____+----+_____    ____+----+____
     /               \  /              \
+----+----+    +----+----+----+    +----+----+
| 1  | 3  |    | 6  | 7  | 9  |    | 11 | 13 |
+----+----+    +----+----+----+    +----+----+
                       ^
                   7 is right here

At that point, as in the root node, we split the node into a greater part and a lesster part:

          +----+                     +----+
          | 5  |                     | 10 |
      ____+----+_____              __+----+__
     /               \            /          \
+----+----+     +----+----+    +----+    +----+----+
| 1  | 3  |     | 6  | 7  |    | 9  |    | 11 | 13 |
+----+----+     +----+----+    +----+    +----+----+

Since that node we just split was a leaf node (also, because we actually found our splitting key), we don't need to do any more splitting, and we are left with two trees, one greater than our key and the other less than our key. There is still some more work involved, as this splitting process probably left many of the nodes that we split underfull, requiring steps to recoalesce nodes, but once that is done, the B-Tree is split. Note that since this just goes up and down to search path the the node, it only take log(n) time.

@gereeter
Copy link
Contributor

Note: Since the splitting and merging algorithms are fairly involved and badly documented, I would not at all be opposed to merging this PR as is and opening performance issues to use the better algorithms.

@jooert
Copy link
Contributor Author

jooert commented Jun 18, 2015

That sounds interesting, thank you for the explanation! I will try to implement something like that.

@bors
Copy link
Contributor

bors commented Jun 18, 2015

☔ The latest upstream changes (presumably #26192) made this pull request unmergeable. Please resolve the merge conflicts.

@jooert jooert force-pushed the btree_append_split_off branch from 6d959cc to 8026390 Compare June 27, 2015 14:54
@jooert
Copy link
Contributor Author

jooert commented Jun 27, 2015

I pushed a linear time implementation of append; @gereeter, could you please have a look at it?


// Second, we build a tree from the sorted sequence in linear time.
self.length = elements.len();
let (depth, root) = Node::from_sorted_iter(elements.into_iter(), self.length, self_b);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can build from an iterator, it would be more efficient to make an iterator that merges two iterators instead of going through a Vec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like allocating a new Vec here, too, but for the algorithm in from_sorted_iter to work, I have to know how many elements I have after the merge.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bleh, I hadn't though of the equal key case. It still might be possible and not too difficult, as in the worst case, only the "right edge" of the BTree will be left underfull, and every underfull node except for the root (which, if I remember correctly, is allowed to be underfull) is adjacent to a completely full node, allowing a simple steal to fix things.


loop {
// Determine how many nodes we need on this level.
let num_nodes = num_elements / (capacity + 1) + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining why this calculation is correct?

@Gankra
Copy link
Contributor

Gankra commented Jul 18, 2015

@jooert what's the status of this PR?

jooert added 2 commits July 20, 2015 16:21
Changes the internal SearchStack API to return the key on removal as well.
@jooert jooert force-pushed the btree_append_split_off branch from 8026390 to 5f1a116 Compare July 20, 2015 15:11
@jooert
Copy link
Contributor Author

jooert commented Jul 20, 2015

@gankro Sorry for the very long delay and thanks for the kind words! I've just pushed a new version of append that builds the merged tree entirely using an iterator and only allocates memory for keeping track of the different levels of the tree.
@gereeter Could you please have a look at this, again?

I haven't started implementing the split_off method in log(n) time, so this PR is still very much WIP.

@Gankra
Copy link
Contributor

Gankra commented Jul 20, 2015

@jooert So I've become increasingly convinced that BTree needs to be refactored to use parent pointers (parent_ptr, edge_index). This would remove all allocations except for split/merge which are obviously necessary, and the whole "search stack" system which is pretty complicated.

Is this something that you think would simplify your code? Something that you'd be interested in doing?

Note that one can actually implement parent pointers but not bother to replace all the search stacks to start. That is it's theoretically possible for the two to co-exist temporarily.

@jooert
Copy link
Contributor Author

jooert commented Jul 21, 2015

Parent pointers would simplify my code in so far as I wouldn't need to keep track of the different levels of the tree using the levels vector. In general, I agree with you that the implementation of BTree would be easier to grasp if the whole search stack stuff would be replaced with an implementation using parent pointers. But, I'm not sure what the performance implications of having two extra pointers per node would be; do you think these are negligible?
I am definitely interested in doing this refactoring, but it will take some time.

@Gankra
Copy link
Contributor

Gankra commented Jul 21, 2015

To my knowledge, it's a performance slam dunk. This is the strategy used in Google's https://code.google.com/p/cpp-btree/

@arthurprs
Copy link
Contributor

@jooert Possibly stupid question. You said two pointers because we'd need not only a "parent pointer" but also an "index in parent"?

We can always use u8 for len/cap/parent_index and move those to the end of the struct. That'd save quite a bit of space. The google implementation linked by @gankro uses u8 as default.

@jooert
Copy link
Contributor Author

jooert commented Jul 21, 2015

@arthurprs Yes, exactly.

@Gankra
Copy link
Contributor

Gankra commented Jul 21, 2015

Rust will pad the struct to have a size that's a multiple of its alignment, though. So having using a u8 doesn't actually save you anything if it will just be rounded up to what a u64 would have done.

@arthurprs
Copy link
Contributor

@gankro not really, it depends where in the struct you want to have your smaller types. If you stick a u8 in between 2 usizes it'll have 7 bytes of padding to allow aligned access on the second. At least in Cish standard ABI.

What I'm proposing is sticking them at the end.

Example: http://is.gd/b7ChEQ

This way the new Node will have the same size as the current one (40 bytes in 64bit builds), the only limitation is having a B <= 255. If that's not enough we can use u16s instead and still keep the 40 bytes size, not true for 32bit builds though.

@Gankra
Copy link
Contributor

Gankra commented Jul 21, 2015

Yes if you fold other values to be smaller, you will get savings. I was assuming you were just suggesting only making one field a u8 (which would be useless).

@Gankra
Copy link
Contributor

Gankra commented Jul 21, 2015

Note that in a defunct PR I removed the ability to set B at all, so we can safely just make it a constant that "happens" to work. (the current default has always seemed fine, and gains from changing it are small to trivial).

@arthurprs
Copy link
Contributor

Cool, so if we ever go this route (parent pointers) we should consider using these smaller integers types to save space.

@apasel422
Copy link
Contributor

Is anyone working on the parent pointer implementation?

@jooert
Copy link
Contributor Author

jooert commented Aug 10, 2015

Am 08.08.2015 um 21:25 schrieb Andrew Paseltiner:

Is anyone working on the parent pointer implementation?


Reply to this email directly or view it on GitHub
#26227 (comment).

I'm not, I haven't found time for it yet. Go for it. :-)

@bors
Copy link
Contributor

bors commented Aug 29, 2015

☔ The latest upstream changes (presumably #28043) made this pull request unmergeable. Please resolve the merge conflicts.

@alexcrichton
Copy link
Member

What's the status of this? Blocked on the parent pointer rewrite? Ready to go with a rebase? Outstanding concerns?

@apasel422
Copy link
Contributor

I think this should wait pending the rewrite at https://github.com/gereeter/btree-rewrite.

@alexcrichton
Copy link
Member

Ah ok, in that case I'm gonna close this for now (clearing out the queue).

@apasel422
Copy link
Contributor

@jooert Now that #30426 has landed, are you interested in reopening this PR?

bors added a commit that referenced this pull request Apr 23, 2016
Implement `append` for b-trees.

I have finally found time to revive #26227, this time only with an `append` implementation.

The algorithm implemented here is linear in the size of the two b-trees. It firsts creates
a `MergeIter` from the two b-trees and then builds a new b-tree by pushing
key-value pairs from the `MergeIter` into nodes at the right heights.

Three functions for stealing have been added to the implementation of `Handle` as
well as a getter for the height of a `NodeRef`.

The docs have been updated with performance information about `BTreeMap::append` and
the remark about B has been removed now that it is the same for all instances of `BTreeMap`.

cc @gereeter @gankro @apasel422
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.