Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sort-merge join #4539

Closed
wants to merge 34 commits into from
Closed

sort-merge join #4539

wants to merge 34 commits into from

Conversation

jangorecki
Copy link
Member

@jangorecki jangorecki commented Jun 10, 2020

Addresses #4538.

This is now blocked #4566 and by providing C interface to bmerge for non-equi join.


todo:

  • collect feedback
  • rename x,y to i,x?
  • support nomatch argument
  • support mult argument

some added features over bmerge

  • knows exactly if many-to-many match happened
  • knows if multiple matches happened on both sides (bmerge knows only one side: allLen1, needs expensive anyDuplicated(starts, incomparables = c(0L, NA_integer_))) to know the other side)
  • knows the count of output
  • compact 0 length lens when allLen1=T (for compatibility only for out.bmerge=F)

features that it does not have comparing to bmerge, as of now

  • non-equi join
  • rolling join
  • join on other types than integer
  • join on multiple columns

Below timings of big-to-big join, 1e9 rows, integer in range 1 and 1e9 on both LHS and RHS. Machine has 40 threads.

system.time(d1[d2, on="x==y"])

no duplicates

## single thread no index
           user  system elapsed 
bmerge  394.931  32.802 427.750 
smerge  336.447  73.674 410.139 

## all threads no index
            user   system  elapsed 
bmerge   819.928  143.668  368.546 
smerge  1136.770  166.111  100.471 

## all threads index
           user  system elapsed 
bmerge  658.381 103.473 377.290 
smerge  579.559 109.165  78.886 

## all threads sorted index
          user  system elapsed 
bmerge  68.579  47.075  69.485 
smerge  37.985  42.468  20.842 

Note that time of bmerge is also greatly reduced for sorted input.

When looking more in-depth, for smerge vs bmerge, rather than [ which uses those. The last, most optimistic case "all threads sorted index" took:

bmerge done in 45.6s elapsed (42.7s cpu) 
smerge done in 0.836s elapsed (14.8s cpu) 

single duplicate entry on both sides

## single thread no index
           user  system elapsed 
bmerge  429.513  34.142 463.676 
smerge  367.919  80.007 447.952 

## all threads no index
            user   system  elapsed 
bmerge   819.215  149.786  368.750 
smerge  1191.925  212.859  137.033 

## all threads index
           user  system elapsed 
bmerge  654.823  98.173 379.881 
smerge  623.015 160.435 115.979 

## all threads sorted index
          user  system elapsed 
bmerge  87.594  47.124  94.729 
smerge  71.851  64.715  42.599 

"all threads sorted index" smerge vs bmerge

bmerge done in 46.6s elapsed (42.9s cpu) 
smerge done in 1.760s elapsed (34.0s cpu) 

Full script on https://gist.github.com/jangorecki/cf2ad1a01e7f1493a4bd3ef4444e1cbc

@jangorecki jangorecki mentioned this pull request Jun 10, 2020
@jangorecki jangorecki changed the title sort-join merge sort-merge join Jun 10, 2020
@jangorecki jangorecki linked an issue Jun 10, 2020 that may be closed by this pull request
@jangorecki jangorecki added this to the 1.12.11 milestone Jun 13, 2020
@jangorecki jangorecki added the WIP label Jun 19, 2020
@jangorecki
Copy link
Member Author

jangorecki commented Jun 20, 2020

Timings in the first post were produced on 103b9c6

Adding support for multiple columns revealed a bug in binary search (rollbs function) when splitting input into chunks.
This function needs to perform non-equi join (or rolling join) on multiple columns, it is now probably doing only on last column.
Instead of fixing that routine here I think it make sense to re-use existing bmerge. For that we need to provide C interface to bmerge.

Moreover, I've made a draft version parallel bmerge (info in #4566), where a dedicated struct dt_t is proposed to carry data pointers for all columns. Re-using dt_t struct from #4566 in smerge will allow to more easily support other types of columns rather than integers only.

@mattdowle mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020
@jangorecki jangorecki modified the milestones: 1.14.3, 1.14.5 Jul 19, 2022
@jangorecki jangorecki modified the milestones: 1.14.11, 1.15.1 Oct 29, 2023
@jangorecki jangorecki closed this Dec 10, 2023
@jangorecki jangorecki removed this from the 1.16.0 milestone Dec 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sort-merge join
3 participants