Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File-backed data.tables #1336

Closed
zachmayer opened this issue Sep 16, 2015 · 18 comments
Closed

File-backed data.tables #1336

zachmayer opened this issue Sep 16, 2015 · 18 comments
Labels
feature request top request One of our most-requested issues

Comments

@zachmayer
Copy link

SFrames are graphlab create's version of data.frames, and have some impressive performance benchmarks on single machines.

I'd really love to see something similar for data.table that could use disk rather than RAM to store the data.

@arunsrinivasan
Copy link
Member

Agreed. Probably for v2.0.0.. depending on how much time and motivation we've.

@zachmayer

This comment has been minimized.

@zachmayer

This comment has been minimized.

@mbacou

This comment has been minimized.

@arunsrinivasan

This comment has been minimized.

@clarkfitzg

This comment has been minimized.

@vors

This comment has been minimized.

@jaapwalhout
Copy link

The links in the original post of @zachmayer are not valid anymore. The GitHub repo of Graphlab/Dato/Turi can be found here. Because Graphlab/Dato/Turi has been acquired by Apple, this repo has been moved to here. It looks like it has evolved into a library for the development of machine learning models.

In case above two links stop working, I've created a fork in my own profile.

@aquasync
Copy link

One potential implementation strategy is via R's custom allocator mechanism. I constructed a file-backed data.table with individual columns backed by mmap-d files based on the code here.

See this gist, where I create the 2B row dataset (~75GB) from the benchmarks and run some aggregations on my laptop (16GB ram). There's many missing pieces that make this far from a user-friendly solution though. Among them: R's custom allocator is used for the entire array object, so there is an R implementation specific header prepended to the data; can't share even read-only between R sessions due to the former; can't hook data.table allocations for new objects (columns/indices) so they won't be memory-mapped; no support for real string columns; requires manual persistence of column attributes.

All those caveats aside, I've already found it to be quite useful when working with a large number of moderate sized datasets, where each is sequentially memory mapped, data.table is told they're already sorted (attr(DT, 'order') = ...) and then performing a "roll" join to extract data with a given lookback, such that the only the data needed for the binary search and the subsequent values needs to be read from disk.

@DrOrrery

This comment has been minimized.

@waynelapierre

This comment has been minimized.

@jonekeat
Copy link

Is something similar to what @aquasync proposed already implemented? I have tried to use mmap package to memory map each column in a list, then setDT, but it cannot work with data.table methods. I am looking for any alternatives before using databases/spark or rewrite into c/c++

@jangorecki
Copy link
Member

@jonekeat disk.frame is possibly an alternative but I haven't tried it myself.

@GitHunter0
Copy link

@jonekeat disk.frame is possibly an alternative but I haven't tried it myself.

disk.frame is the most promising R solution for this matter I've seen so far. It would be very interesting to see data.table and disk.frame contributors working together

@r2evans
Copy link
Contributor

r2evans commented Apr 9, 2024

As a current-day workaround, what about the use of arrow::open_dataset and dtplyr or similar? The data is immutable so "saving" data would need to be an explicit step, but at least fast access to on-desk data should be feasible. (I recognize this does not fully address all likely use-cases for on-disk data.table operations, mostly a technique for mitigating large-data operations.)

@tdhock
Copy link
Member

tdhock commented Apr 9, 2024

This is currently out of scope https://github.com/Rdatatable/data.table/blob/master/GOVERNANCE.md#the-r-package and I don't think anyone has the time/interest/skill to implement, so I'm closing.

@tdhock tdhock closed this as completed Apr 9, 2024
@r2evans
Copy link
Contributor

r2evans commented Apr 9, 2024

I don't disagree, it's definitely big-scope. I offered my comment to illustrate alternative paths.

@MichaelChirico
Copy link
Member

This is currently out of scope https://github.com/Rdatatable/data.table/blob/master/GOVERNANCE.md#the-r-package and I don't think anyone has the time/interest/skill to implement, so I'm closing.

to clarify I'd be glad to have scope expanded for this high-demand FR, but as noted current maintainer core has no time/ability to support this. outside contributions (and commitment to ownership) welcome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request top request One of our most-requested issues
Projects
None yet
Development

No branches or pull requests