Integration with databases #726

hgera000 · 2014-07-09T11:32:04Z

I couldn't find this information anywhere, but was interested to know if there are any plans for data.table to also integrate with databases such as MySQL (through R) to avoid calling extremely large datasets into memory. I believe dplyr has this functionality. Not sure what the hit to performance would be, however.

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2014-07-17T22:50:58Z

@hgera000 this would be a great feature to have. But most likely not for the next stable release (1.9.4). Perhaps for release after or sometime in the near future. It'd be also nice to know how many other users are interested in this feature in data.table (based on which we could bump priority).

jangorecki · 2014-07-18T13:43:44Z

@hgera000 you should be aware of the limitations of such integration. AFAIK dplyr translates R function calls to SQL language and execute it on db (correct me if I'm wrong). That means you are extremely limited not only by the database functions (which comparing to R are tiny) but also by the R-SQL translation process which will most probably not handle all (still very few) database features. Summarizing, this would be a big project, big enough to be a separate package.

arunsrinivasan · 2014-07-18T23:40:26Z

@jangorecki totally on point. Matt and I very briefly discussed about this; on this very point. I'm not that familiar with dplyr's SQL backend functionalities. But I presume it'd be relatively clumsy/limited - correct me if I'm wrong.

That's why it'd be nice to know (in spite of it) if more people are interested in this FR (which'll also result in knowing how they intend to use / why they'd need it specifically).

hgera000 · 2014-07-20T19:54:01Z

Thanks very much @arunsrinivasan and @jangorecki for responding on this. I need look more closely into this feature of dplyr as I have not really tested it out entirely yet. But my impression from the docs was that any R command could be run on the database - but as commented here, perhaps that may not actually be the case.

My workflow is usually 1) Run queries in MySQL to get data into R; 2) convert to a data.table 3) analysis in R. So there are times where it would be useful to not have to load entire (large) datasets into memory in order to calculate certain statistics which are not a function in MySQL (for example, median).

Certainly would be interesting to hear from others if they have used this feature of dplyr and whether or not they found it useful.

jangorecki · 2014-11-25T11:40:16Z

@hgera000, remember to use RMySQL instead of RODBC. Direct db drivers should be a lot faster than odbc.

You might try my latest work: dwtools#db.
My db function is handy intelligent wrapper for off-memory storage (DBI / ODBC / csv / custom defined) which was designed to be easy chainable in data.table.
Full examples: db_examples.R
It does not allow computation on off-memory storage other than by providing get query statement. Yet it can improve memory performance but not storing all used tables in the memory but using them on the fly.

If you would use it on MySQL please let me know if table name provided as scalar character "myschema.mytable" is correctly mapped to expected schema and table on db.

As for the feature in data.table I believe supporting range of multiple off-memory systems would be heavy time consuming. I think it is better to focus on single off-memory storage. Now fread address the csv well, yet fwrite would be handy.

b3nj1 · 2015-07-04T04:30:21Z

@arunsrinivasan I'd be interested in this feature. I have a large code base (4k lines) built around data.table (Thanks!!!). Currently, we use cvs's and fread (again, Thanks!!!), but some of our data sets are uncomfortably large, so this doesn't scale anymore. We'd love to be able to drop in a mysql database without re-writing large amounts of code.

Thanks,
Benjamin

wolkym · 2015-09-18T10:20:13Z

See SSDB project, which has Redis API and is based on LevelDB. Together with RCppRedis it might be a way to go. Current key limit is 120MB.

jangorecki · 2019-01-29T10:14:34Z

There are no plans for transparent integration to existing databases. I suggest to use DBI package which is battle-tested for that.
Still we do plan transparent integration with off-memory on-disk storage. Data stored in binary files will be able to serve some of the features that databases do, like processing data bigger than available memory, or sharing data between different sessions. Status of file-backed data.table can be tracked in #1336.

daroczig · 2019-01-29T10:30:53Z

Although it's probably not exactly the integration functionality you are looking for, but my dbr pkg at https://github.com/daroczig/dbr does something like this:

define your preferred data frame format as data.table:

options('dbr.output_format' = 'data.table')

set up the connection params in a YAML file
query your database via db_query, eg

db_query('select 42', db = 'foobar')

That will automatically return the results in data.table.

arunsrinivasan changed the title ~~Feature Request - integration with databases~~ Integration with databases Jul 9, 2014

arunsrinivasan added the feature request label Jul 9, 2014

jangorecki mentioned this issue May 5, 2015

[Question] what is the most efficient way to persist to disk a data.table? #943

Closed

jangorecki closed this as completed Jan 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration with databases #726

Integration with databases #726

hgera000 commented Jul 9, 2014

arunsrinivasan commented Jul 17, 2014

jangorecki commented Jul 18, 2014

arunsrinivasan commented Jul 18, 2014

hgera000 commented Jul 20, 2014

jangorecki commented Nov 25, 2014

b3nj1 commented Jul 4, 2015

wolkym commented Sep 18, 2015

jangorecki commented Jan 29, 2019 •

edited

Loading

daroczig commented Jan 29, 2019

Integration with databases #726

Integration with databases #726

Comments

hgera000 commented Jul 9, 2014

arunsrinivasan commented Jul 17, 2014

jangorecki commented Jul 18, 2014

arunsrinivasan commented Jul 18, 2014

hgera000 commented Jul 20, 2014

jangorecki commented Nov 25, 2014

b3nj1 commented Jul 4, 2015

wolkym commented Sep 18, 2015

jangorecki commented Jan 29, 2019 • edited Loading

daroczig commented Jan 29, 2019

jangorecki commented Jan 29, 2019 •

edited

Loading