Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration with databases #726

Closed
hgera000 opened this issue Jul 9, 2014 · 9 comments
Closed

Integration with databases #726

hgera000 opened this issue Jul 9, 2014 · 9 comments

Comments

@hgera000
Copy link

hgera000 commented Jul 9, 2014

I couldn't find this information anywhere, but was interested to know if there are any plans for data.table to also integrate with databases such as MySQL (through R) to avoid calling extremely large datasets into memory. I believe dplyr has this functionality. Not sure what the hit to performance would be, however.

@arunsrinivasan arunsrinivasan changed the title Feature Request - integration with databases Integration with databases Jul 9, 2014
@arunsrinivasan
Copy link
Member

@hgera000 this would be a great feature to have. But most likely not for the next stable release (1.9.4). Perhaps for release after or sometime in the near future. It'd be also nice to know how many other users are interested in this feature in data.table (based on which we could bump priority).

@jangorecki
Copy link
Member

@hgera000 you should be aware of the limitations of such integration. AFAIK dplyr translates R function calls to SQL language and execute it on db (correct me if I'm wrong). That means you are extremely limited not only by the database functions (which comparing to R are tiny) but also by the R-SQL translation process which will most probably not handle all (still very few) database features. Summarizing, this would be a big project, big enough to be a separate package.

@arunsrinivasan
Copy link
Member

@jangorecki totally on point. Matt and I very briefly discussed about this; on this very point. I'm not that familiar with dplyr's SQL backend functionalities. But I presume it'd be relatively clumsy/limited - correct me if I'm wrong.

That's why it'd be nice to know (in spite of it) if more people are interested in this FR (which'll also result in knowing how they intend to use / why they'd need it specifically).

@hgera000
Copy link
Author

Thanks very much @arunsrinivasan and @jangorecki for responding on this. I need look more closely into this feature of dplyr as I have not really tested it out entirely yet. But my impression from the docs was that any R command could be run on the database - but as commented here, perhaps that may not actually be the case.

My workflow is usually 1) Run queries in MySQL to get data into R; 2) convert to a data.table 3) analysis in R. So there are times where it would be useful to not have to load entire (large) datasets into memory in order to calculate certain statistics which are not a function in MySQL (for example, median).

Certainly would be interesting to hear from others if they have used this feature of dplyr and whether or not they found it useful.

@jangorecki
Copy link
Member

@hgera000, remember to use RMySQL instead of RODBC. Direct db drivers should be a lot faster than odbc.

You might try my latest work: dwtools#db.
My db function is handy intelligent wrapper for off-memory storage (DBI / ODBC / csv / custom defined) which was designed to be easy chainable in data.table.
Full examples: db_examples.R
It does not allow computation on off-memory storage other than by providing get query statement. Yet it can improve memory performance but not storing all used tables in the memory but using them on the fly.

If you would use it on MySQL please let me know if table name provided as scalar character "myschema.mytable" is correctly mapped to expected schema and table on db.

As for the feature in data.table I believe supporting range of multiple off-memory systems would be heavy time consuming. I think it is better to focus on single off-memory storage. Now fread address the csv well, yet fwrite would be handy.

@b3nj1
Copy link

b3nj1 commented Jul 4, 2015

@arunsrinivasan I'd be interested in this feature. I have a large code base (4k lines) built around data.table (Thanks!!!). Currently, we use cvs's and fread (again, Thanks!!!), but some of our data sets are uncomfortably large, so this doesn't scale anymore. We'd love to be able to drop in a mysql database without re-writing large amounts of code.

Thanks,
Benjamin

@wolkym
Copy link

wolkym commented Sep 18, 2015

See SSDB project, which has Redis API and is based on LevelDB. Together with RCppRedis it might be a way to go. Current key limit is 120MB.

@jangorecki
Copy link
Member

jangorecki commented Jan 29, 2019

There are no plans for transparent integration to existing databases. I suggest to use DBI package which is battle-tested for that.
Still we do plan transparent integration with off-memory on-disk storage. Data stored in binary files will be able to serve some of the features that databases do, like processing data bigger than available memory, or sharing data between different sessions. Status of file-backed data.table can be tracked in #1336.

@daroczig
Copy link

Although it's probably not exactly the integration functionality you are looking for, but my dbr pkg at https://github.com/daroczig/dbr does something like this:

  1. define your preferred data frame format as data.table:
options('dbr.output_format' = 'data.table')
  1. set up the connection params in a YAML file
  2. query your database via db_query, eg
db_query('select 42', db = 'foobar')

That will automatically return the results in data.table.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants