sort-merge join #4539

jangorecki · 2020-06-10T22:38:44Z

Addresses #4538.

This is now blocked #4566 and by providing C interface to bmerge for non-equi join.

todo:

collect feedback
rename x,y to i,x?
support nomatch argument
support mult argument

some added features over bmerge

knows exactly if many-to-many match happened
knows if multiple matches happened on both sides (bmerge knows only one side: allLen1, needs expensive anyDuplicated(starts, incomparables = c(0L, NA_integer_))) to know the other side)
knows the count of output
compact 0 length lens when allLen1=T (for compatibility only for out.bmerge=F)

features that it does not have comparing to bmerge, as of now

non-equi join
rolling join
join on other types than integer
join on multiple columns

Below timings of big-to-big join, 1e9 rows, integer in range 1 and 1e9 on both LHS and RHS. Machine has 40 threads.

system.time(d1[d2, on="x==y"])

no duplicates

## single thread no index
           user  system elapsed 
bmerge  394.931  32.802 427.750 
smerge  336.447  73.674 410.139 

## all threads no index
            user   system  elapsed 
bmerge   819.928  143.668  368.546 
smerge  1136.770  166.111  100.471 

## all threads index
           user  system elapsed 
bmerge  658.381 103.473 377.290 
smerge  579.559 109.165  78.886 

## all threads sorted index
          user  system elapsed 
bmerge  68.579  47.075  69.485 
smerge  37.985  42.468  20.842

Note that time of bmerge is also greatly reduced for sorted input.

When looking more in-depth, for smerge vs bmerge, rather than [ which uses those. The last, most optimistic case "all threads sorted index" took:

bmerge done in 45.6s elapsed (42.7s cpu) 
smerge done in 0.836s elapsed (14.8s cpu)

single duplicate entry on both sides

## single thread no index
           user  system elapsed 
bmerge  429.513  34.142 463.676 
smerge  367.919  80.007 447.952 

## all threads no index
            user   system  elapsed 
bmerge   819.215  149.786  368.750 
smerge  1191.925  212.859  137.033 

## all threads index
           user  system elapsed 
bmerge  654.823  98.173 379.881 
smerge  623.015 160.435 115.979 

## all threads sorted index
          user  system elapsed 
bmerge  87.594  47.124  94.729 
smerge  71.851  64.715  42.599

"all threads sorted index" smerge vs bmerge

bmerge done in 46.6s elapsed (42.9s cpu) 
smerge done in 1.760s elapsed (34.0s cpu)

Full script on https://gist.github.com/jangorecki/cf2ad1a01e7f1493a4bd3ef4444e1cbc

jangorecki · 2020-06-20T21:51:03Z

Timings in the first post were produced on 103b9c6

Adding support for multiple columns revealed a bug in binary search (rollbs function) when splitting input into chunks.
This function needs to perform non-equi join (or rolling join) on multiple columns, it is now probably doing only on last column.
Instead of fixing that routine here I think it make sense to re-use existing bmerge. For that we need to provide C interface to bmerge.

Moreover, I've made a draft version parallel bmerge (info in #4566), where a dedicated struct dt_t is proposed to carry data pointers for all columns. Re-using dt_t struct from #4566 in smerge will allow to more easily support other types of columns rather than integers only.

This reverts commit 2acaf23.

jangorecki added 23 commits June 4, 2020 22:12

dev

38f7cee

devdev

febf07c

unsorted works as well

3480d32

testing and fixing

99589c5

devdevdev

94a0faa

cleanup

6df1415

prepare for batching

c13113a

finally parallel

ffb521a

parallel

518b36c

cleanup and speedup

fd03e04

bucketing now uses binary search

71c348b

rename vars

86a4bcc

getting closer

36b6762

switch magic option

b1f3463

more robust, batching into new function and struct

cba274a

static funs and comments

9b0d195

better verbose

4c5d1ea

simpler batching, more strict type defs

19d1df6

last batch not balanced anymore

7efaa6e

batching balanced again

6423c0b

remove extra checks in bmerge to smerge opt

045e9e8

fix test function

739a35a

improve verbose msg

47cfd8a

jangorecki mentioned this pull request Jun 10, 2020

sort-merge join #4538

Closed

jangorecki changed the title ~~sort-join merge~~ sort-merge join Jun 10, 2020

jangorecki linked an issue Jun 10, 2020 that may be closed by this pull request

sort-merge join #4538

Closed

jangorecki added 2 commits June 11, 2020 00:38

cleanup dev mostly

2806017

comment about thread utilization

4e5f56d

jangorecki requested review from mattdowle and arunsrinivasan June 11, 2020 20:23

jangorecki added 7 commits June 12, 2020 11:47

mult support

405f55d

avoid lens allocation and unsort for mult=first|last

9ee0a7a

move R allocs

5816549

avoid one more unsort

d7c2b4b

mult already supported

49ea9da

mult already supported2

9f1ea21

better algo description

103b9c6

jangorecki added this to the 1.12.11 milestone Jun 13, 2020

multiple columns support

2acaf23

jangorecki added the WIP label Jun 19, 2020

Revert "multiple columns support"

e1643c2

This reverts commit 2acaf23.

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022

jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023

jangorecki closed this Dec 10, 2023

jangorecki removed this from the 1.16.0 milestone Dec 22, 2023

MichaelChirico removed the WIP label Feb 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sort-merge join #4539

sort-merge join #4539

jangorecki commented Jun 10, 2020 •

edited

Loading

jangorecki commented Jun 20, 2020 •

edited

Loading

sort-merge join #4539

sort-merge join #4539

Conversation

jangorecki commented Jun 10, 2020 • edited Loading

jangorecki commented Jun 20, 2020 • edited Loading

jangorecki commented Jun 10, 2020 •

edited

Loading

jangorecki commented Jun 20, 2020 •

edited

Loading