[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599

arunsrinivasan · 2014-06-08T13:16:25Z

Submitted by: Patrick Nicholson; Assigned to: Nobody; R-Forge link

Many large datasets are split into multiple tables, especially when they are release as flat files. Many datasets that track a lot of variables over time are released as separate files for separate periods. It is useful to write a quick wrapper to read these files into a list:

tabs <- lapply(dir(), function(file) as.data.table(read.csv(file)))

If we were interested in appending the tables in this list, data.table provides a very fast and useful function, rbindlist. Similar functionality exists in SAS DATA steps when you can list multiple datasets in the SET statement to append them. However, SAS also allows you to list many tables in a merge statement.* Without a BY variable, this amounts to do.call("cbind", ...) in R. But with a BY variable....

I am proposing a function that would merge data.tables contained within a list. This would So that the following is possible:

tabs <- lapply(dir(), function(file) data.table(read.csv(file), key="primary_key"))
data <- do.call("[", tabs)
or
data <- mergelist(tabs)

This would be a killer feature. It does not exist elsewhere in R, as far as I can tell. It would allow data.table code to be more concise and require less updating. (Think about going from creating t2011, t2012, t2013... and merging them with t2011[t2012[t2013.... Now think about higher frequency data!) It would also take a bullet out of SAS's gun.

This is the comparable SAS approach if it helps:

%macro readfiles(number_of_files);
%do i=1 %to &number_of_files;
proc import out=imported&i. datafile="C:/file&number_of_files..csv" dbms=csv;
run;
%end;
%mend;

data merged;
set imported:;
by primary_key;
run;

proc datasets library=work nolist;
delete imported:;
run;

mattdowle · 2014-08-12T11:50:14Z

Similar to this suggestion from Aleksandr Blekh as comment on S.O. :
http://stackoverflow.com/questions/4322219/whats-the-fastest-way-to-merge-join-data-frames-in-r/4483202?noredirect=1#comment39346690_4483202
which links to example here :
https://github.com/abnova/diss-floss/blob/master/analysis/sem-models/floss-success-v2-flat/mergeTest.R

Also similar to #694 .

abnova · 2014-08-12T18:55:41Z

Hello, Matt! Thank you for inviting me here. First of all, I apologize for using an abbreviation without first defining it - it's not my style, I was just trying to save some space in the comment area. Having said that, I can tell that my abbreviation SEM refers to structural equation modeling, a very popular set of quantitative research methods, based on multivariate statistical analysis (http://en.wikipedia.org/wiki/Structural_equation_modeling).

As matrices represent a foundation of SEM models, functions, implementing SEM analysis methods, operate on matrices or corresponding data frames. These data structures often have to be constructed from several underlying matrices or data frames, which represent certain variables or indicators in a SEM model. For example, consider my very simple test module in R, which attempts to perform a PLS-PM analysis (a variant of SEM analysis) on a test dataset: https://github.com/abnova/diss-floss/blob/master/analysis/sem-models/floss-success-v2-flat/floss-success-plspm.R. You see that I have to merge several data frames, each representing a particular indicator in SEM model, into a summary data frame, which I then pass to the SEM analysis function plspm() on line 211.

I hope that my explanation is clear enough to get an idea about a scenario, that requires merging multiple data frames or data tables (the corresponding number of indicators in SEM models may be significant for large and complex models, making manual approach to merge not feasible). Your questions or comments are welcome!

mattdowle · 2014-08-14T12:42:17Z

@abnova That's very useful info, thanks.
Does the joinbyv function in the pull request (#694) from @jangorecki work for this?
And is that the same as the mergelist() from Patrick above as well?
I mentioned in a comment on #694 that it'd be nice to tie this up with secondary keys as well to avoiding needing to setkey() each data.table input.

abnova · 2014-08-14T13:07:15Z

@mattdowle You're welcome! I will take a look at functions and issues you're referring to - can't say anything now, as I'm not familiar with those. It might take some time, especially considering my lack of knowledge of data.table codebase and even its user-level functionality. But, I will be in touch eventually.

geneorama · 2014-12-08T22:18:06Z

This would be awesome. I often am reconciling data across several data.frames.

jangorecki · 2014-12-08T23:47:11Z

Just to clarify, #694 code is slightly outdated. The latest version of joinbyv is available here: https://github.com/jangorecki/dwtools/blob/master/R/joinbyv.R
It should not have any other depedency than data.table so you should be able just to run the joinbyv.R code and use function directly without the need of installing dwtools.
In case if you want to read data from csv files or database it could be even combined with db function from dwtools.

MichaelChirico · 2019-05-26T12:56:18Z

I guess #2576 should be closed at the same time.

Like rbind(...) builds an rbindlist call, mergelist build a cbindlist call?

jangorecki · 2019-05-26T13:14:21Z

no, mergelist would iterate joins, cbind just stack datasets

jangorecki · 2020-03-05T03:07:01Z

another use case https://stackoverflow.com/questions/60529112/data-table-join-multiple-tables-in-a-single-join
and a very popular question here: https://stackoverflow.com/q/13273833/2490497

jangorecki · 2020-04-10T16:05:16Z

Function linked in previous comment joinbyv has a much bigger scope. Functionality like that was not requested in this issue, thus I think we can safely strip that out and provide more transparent simpler implementation. I am shaping interface for mergelist in #4370, as of now it is only on specifying columns to join on, and a how that specify type of join (inner/left/right/full).

jangorecki · 2020-05-03T20:03:37Z

I went through linked SO questions and it seems that more commonly left outer join is needed, rather than inner join (this is merge default). Also my use cases that I have in mind (and old #694) would prefer left join as well.

Are there any strong objections against using how="left" as a default in mergelist?
What should be the default for mult argument?

My suggestion is to use new mult="error".
Having that as a default gives us possibility to switch allow.cartesian (many-to-many join) to TRUE by default. Using current [.data.table default mult="all" we will have to keep cartesian check enabled by default to protect users from OOM. This check adds some overhead: anyDuplicated call, in worst case scenario (also run in fresh session without warmup) for:

1e8 adds 5 seconds
1e9 adds 120 seconds

On the other hand mult="error" raises as early as possible, directly from bmerge not needing to complete merging. Prompting user to fix data, or chose proper mult option.
Moreover mult!="all" also makes perfect sense for left join. When we do left join in update-on-join d1[d2, col := i.col] it already works like mult="last".

jangorecki · 2020-05-04T17:56:24Z

Comparison of #4370 to Reduce-merge approach, left join 100 small tables

library(data.table)
test.data.table() ## warmup

N = 100L
l = lapply(1:N, function(i) as.data.table(setNames(list(sample(N, N-i+1L), i), c("id1",paste0("v",i)))))
system.time(a1<-mergelist(l, on="id1", how="left", mult="all", join.many=FALSE)) ## same as defaults in [.data.table
#   user  system elapsed 
#  1.065   0.000   0.058 
system.time(a2<-mergelist(l, on="id1", how="left"))
#   user  system elapsed 
#  1.009   0.000   0.056 
system.time(a3<-mergelist(l, on="id1", how="left", copy=FALSE))
#   user  system elapsed 
#  1.061   0.000   0.057 
system.time(b1<-Reduce(function(...) merge(..., all.x=TRUE, allow.cartesian=FALSE), l))
#   user  system elapsed 
#  6.021   0.007   0.303 
system.time(b2<-Reduce(function(...) merge(..., all.x=TRUE, allow.cartesian=TRUE), l)) ## default in mergelist, but it does mult='error' which is cheap and prevent cartesian already
#   user  system elapsed 
#  6.027   0.000   0.304 
all.equal(a1, a2) && all.equal(a1, a3) && all.equal(a1, b1, check.attributes=FALSE, ignore.row.order=TRUE) && all.equal(a1, b2, check.attributes=FALSE, ignore.row.order=TRUE)
#[1] TRUE

jangorecki · 2020-05-11T23:20:09Z

Current implementation in #4370 will use default how="left".
mult argument default value depends on the how:

left, inner, full, right: mult="error"
forcing user to explicitly mult="all" if duplicated RHS on rows matches to LHS. Even when mult="all", then default for join.many (new allow.cartesian) is still TRUE, so rows explosion may happen, unlike join in [.data.table which uses different default. Manual suggests to use join.many=FALSE whenever using mult="all".
semi, anti: mult="last"
it doesn't really matter there, because multiple rows in RHS are ignored.
cross: mult="all"
we can safely assume that users running how="cross" are aware of rows explosion.

jangorecki self-assigned this Dec 31, 2018

jangorecki added this to the 1.12.4 milestone Jan 22, 2019

jangorecki modified the milestones: 1.12.4, 1.13.0 Jul 25, 2019

mattdowle modified the milestones: 1.12.7, 1.12.9 Dec 8, 2019

jangorecki linked a pull request Apr 10, 2020 that will close this issue

cbindlist, mergelist #4370

Open

9 tasks

jangorecki added the joins Use label:"non-equi joins" for rolling, overlapping, and non-equi joins label Apr 10, 2020

jangorecki mentioned this issue Apr 10, 2020

join vignette #2181

Closed

jangorecki linked a pull request Apr 11, 2020 that will close this issue

cbindlist, mergelist #4370

Open

9 tasks

zeomal mentioned this issue Apr 25, 2020

Create datatable-joins.Rmd for Joins vignette #4398

Closed

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

mattdowle removed this from the 1.14.1 milestone Aug 27, 2021

MichaelChirico mentioned this issue Mar 16, 2024

Master list of most-requested issues #3189

Open

75 tasks

MichaelChirico added the top request One of our most-requested issues label Apr 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599

[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599

arunsrinivasan commented Jun 8, 2014

mattdowle commented Aug 12, 2014

abnova commented Aug 12, 2014

mattdowle commented Aug 14, 2014

abnova commented Aug 14, 2014

geneorama commented Dec 8, 2014

jangorecki commented Dec 8, 2014

MichaelChirico commented May 26, 2019

jangorecki commented May 26, 2019

jangorecki commented Mar 5, 2020 •

edited

Loading

jangorecki commented Apr 10, 2020

jangorecki commented May 3, 2020 •

edited

Loading

jangorecki commented May 4, 2020 •

edited

Loading

jangorecki commented May 11, 2020 •

edited

Loading

[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599

[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599

Comments

arunsrinivasan commented Jun 8, 2014

mattdowle commented Aug 12, 2014

abnova commented Aug 12, 2014

mattdowle commented Aug 14, 2014

abnova commented Aug 14, 2014

geneorama commented Dec 8, 2014

jangorecki commented Dec 8, 2014

MichaelChirico commented May 26, 2019

jangorecki commented May 26, 2019

jangorecki commented Mar 5, 2020 • edited Loading

jangorecki commented Apr 10, 2020

jangorecki commented May 3, 2020 • edited Loading

jangorecki commented May 4, 2020 • edited Loading

jangorecki commented May 11, 2020 • edited Loading

jangorecki commented Mar 5, 2020 •

edited

Loading

jangorecki commented May 3, 2020 •

edited

Loading

jangorecki commented May 4, 2020 •

edited

Loading

jangorecki commented May 11, 2020 •

edited

Loading