-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R-Forge #2461] Faster version of Reduce(merge, list(DT1,DT2,DT3,...)) called mergelist (a la rbindlist) #599
Comments
Similar to this suggestion from Aleksandr Blekh as comment on S.O. : Also similar to #694 . |
Hello, Matt! Thank you for inviting me here. First of all, I apologize for using an abbreviation without first defining it - it's not my style, I was just trying to save some space in the comment area. Having said that, I can tell that my abbreviation As matrices represent a foundation of SEM models, functions, implementing SEM analysis methods, operate on matrices or corresponding data frames. These data structures often have to be constructed from several underlying matrices or data frames, which represent certain variables or indicators in a SEM model. For example, consider my very simple test module in I hope that my explanation is clear enough to get an idea about a scenario, that requires merging multiple data frames or data tables (the corresponding number of indicators in SEM models may be significant for large and complex models, making manual approach to merge not feasible). Your questions or comments are welcome! |
@abnova That's very useful info, thanks. |
@mattdowle You're welcome! I will take a look at functions and issues you're referring to - can't say anything now, as I'm not familiar with those. It might take some time, especially considering my lack of knowledge of |
This would be awesome. I often am reconciling data across several data.frames. |
Just to clarify, #694 code is slightly outdated. The latest version of joinbyv is available here: https://github.com/jangorecki/dwtools/blob/master/R/joinbyv.R |
I guess #2576 should be closed at the same time. Like |
no, |
another use case https://stackoverflow.com/questions/60529112/data-table-join-multiple-tables-in-a-single-join |
Function linked in previous comment |
I went through linked SO questions and it seems that more commonly left outer join is needed, rather than inner join (this is
My suggestion is to use new
On the other hand |
Comparison of #4370 to library(data.table)
test.data.table() ## warmup
N = 100L
l = lapply(1:N, function(i) as.data.table(setNames(list(sample(N, N-i+1L), i), c("id1",paste0("v",i)))))
system.time(a1<-mergelist(l, on="id1", how="left", mult="all", join.many=FALSE)) ## same as defaults in [.data.table
# user system elapsed
# 1.065 0.000 0.058
system.time(a2<-mergelist(l, on="id1", how="left"))
# user system elapsed
# 1.009 0.000 0.056
system.time(a3<-mergelist(l, on="id1", how="left", copy=FALSE))
# user system elapsed
# 1.061 0.000 0.057
system.time(b1<-Reduce(function(...) merge(..., all.x=TRUE, allow.cartesian=FALSE), l))
# user system elapsed
# 6.021 0.007 0.303
system.time(b2<-Reduce(function(...) merge(..., all.x=TRUE, allow.cartesian=TRUE), l)) ## default in mergelist, but it does mult='error' which is cheap and prevent cartesian already
# user system elapsed
# 6.027 0.000 0.304
all.equal(a1, a2) && all.equal(a1, a3) && all.equal(a1, b1, check.attributes=FALSE, ignore.row.order=TRUE) && all.equal(a1, b2, check.attributes=FALSE, ignore.row.order=TRUE)
#[1] TRUE |
Current implementation in #4370 will use default
|
Submitted by: Patrick Nicholson; Assigned to: Nobody; R-Forge link
Many large datasets are split into multiple tables, especially when they are release as flat files. Many datasets that track a lot of variables over time are released as separate files for separate periods. It is useful to write a quick wrapper to read these files into a list:
tabs <- lapply(dir(), function(file) as.data.table(read.csv(file)))
If we were interested in appending the tables in this list, data.table provides a very fast and useful function, rbindlist. Similar functionality exists in SAS DATA steps when you can list multiple datasets in the SET statement to append them. However, SAS also allows you to list many tables in a merge statement.* Without a BY variable, this amounts to do.call("cbind", ...) in R. But with a BY variable....
I am proposing a function that would merge data.tables contained within a list. This would So that the following is possible:
tabs <- lapply(dir(), function(file) data.table(read.csv(file), key="primary_key"))
data <- do.call("[", tabs)
or
data <- mergelist(tabs)
This would be a killer feature. It does not exist elsewhere in R, as far as I can tell. It would allow data.table code to be more concise and require less updating. (Think about going from creating t2011, t2012, t2013... and merging them with t2011[t2012[t2013.... Now think about higher frequency data!) It would also take a bullet out of SAS's gun.
%macro readfiles(number_of_files);
%do i=1 %to &number_of_files;
proc import out=imported&i. datafile="C:/file&number_of_files..csv" dbms=csv;
run;
%end;
%mend;
data merged;
set imported:;
by primary_key;
run;
proc datasets library=work nolist;
delete imported:;
run;
The text was updated successfully, but these errors were encountered: