Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is vroom too memory greedy and disk intensive? #507

Closed
CharlesNickmilder opened this issue Aug 16, 2023 · 1 comment
Closed

Is vroom too memory greedy and disk intensive? #507

CharlesNickmilder opened this issue Aug 16, 2023 · 1 comment

Comments

@CharlesNickmilder
Copy link

Hello,

I discovered vroom recently while I was searching for a way to read only specific rows inside a csv. The exact condition was that the column "index" of the output had to contain all the indices I needed but not the other values.

After researches on internet, I found the following subscript that interested me:

test=vroom::vroom(i)|> dplyr::filter(idpixel %in% IndicesNeeded)

where i is the filename.

Looking at the task manager, I noticed huge read and write activities on my internal drive during the mapping. This raised a first red flag for me: I need to scrap around 3000 files, each between 1 and 20 GB. This implies a lot of stress on my internal drive and I don't want to burn it down, can we define another space where that intensive operations could be done?

Another point that I noticed is the memory greed: once the filter is performed, there is no need to keep all the memory allocation, especially given that I don't know a way to get back the data in the middle of a pipe. To free up the memory allocated for the whole database, I have to convert the test extracted DB to another class, e.g. a data.table and perform a gc() afterwards with a command like:

test=as.data.table(test)

As far as I understand, it means that the pointers that create the memory allocation are transfered through the pipe and not recomputed for the data targeted. As the rest of my workflow relies on data.table this patch does not hamper my work. However, I did not find informations about that behaviour anywhare.

Regards,

Charles

PS:

vroom version 1.6.1

dplyr version 1.1.2

@jennybc
Copy link
Member

jennybc commented Sep 28, 2023

I was just re-watching the video below to answer a different question, but I think it's also relevant to your use case. My main advice is that perhaps you should be pre-filtering the input on the way in to R, as opposed to after reading the entire file into R. Based on what I see above, you should be able to express your filter in some concise way inside a pipe() call, which you can use with vroom(). vroom comes up in the video around the 9 minute mark.

https://youtu.be/RYhwZW6ofbI?si=HEGTk4o2P6-4zG6m

@jennybc jennybc closed this as completed Sep 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants