-
Notifications
You must be signed in to change notification settings - Fork 992
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sort-merge join #4539
sort-merge join #4539
Conversation
Timings in the first post were produced on 103b9c6 Adding support for multiple columns revealed a bug in binary search ( Moreover, I've made a draft version parallel |
This reverts commit 2acaf23.
Addresses #4538.
This is now blocked #4566 and by providing C interface to
bmerge
for non-equi join.todo:
x,y
toi,x
?nomatch
argumentmult
argumentsome added features over
bmerge
bmerge
knows only one side:allLen1
, needs expensiveanyDuplicated(starts, incomparables = c(0L, NA_integer_)))
to know the other side)lens
whenallLen1=T
(for compatibility only forout.bmerge=F
)features that it does not have comparing to
bmerge
, as of nowBelow timings of big-to-big join, 1e9 rows, integer in range 1 and 1e9 on both LHS and RHS. Machine has 40 threads.
no duplicates
Note that time of
bmerge
is also greatly reduced for sorted input.When looking more in-depth, for
smerge
vsbmerge
, rather than[
which uses those. The last, most optimistic case "all threads sorted index" took:single duplicate entry on both sides
"all threads sorted index"
smerge
vsbmerge
Full script on https://gist.github.com/jangorecki/cf2ad1a01e7f1493a4bd3ef4444e1cbc