Is vroom too memory greedy and disk intensive? #507

CharlesNickmilder · 2023-08-16T15:05:57Z

Hello,

I discovered vroom recently while I was searching for a way to read only specific rows inside a csv. The exact condition was that the column "index" of the output had to contain all the indices I needed but not the other values.

After researches on internet, I found the following subscript that interested me:

test=vroom::vroom(i)|> dplyr::filter(idpixel %in% IndicesNeeded)

where i is the filename.

Looking at the task manager, I noticed huge read and write activities on my internal drive during the mapping. This raised a first red flag for me: I need to scrap around 3000 files, each between 1 and 20 GB. This implies a lot of stress on my internal drive and I don't want to burn it down, can we define another space where that intensive operations could be done?

Another point that I noticed is the memory greed: once the filter is performed, there is no need to keep all the memory allocation, especially given that I don't know a way to get back the data in the middle of a pipe. To free up the memory allocated for the whole database, I have to convert the test extracted DB to another class, e.g. a data.table and perform a gc() afterwards with a command like:

test=as.data.table(test)

As far as I understand, it means that the pointers that create the memory allocation are transfered through the pipe and not recomputed for the data targeted. As the rest of my workflow relies on data.table this patch does not hamper my work. However, I did not find informations about that behaviour anywhare.

Regards,

Charles

PS:

vroom version 1.6.1

dplyr version 1.1.2

The text was updated successfully, but these errors were encountered:

jennybc · 2023-09-28T22:17:52Z

I was just re-watching the video below to answer a different question, but I think it's also relevant to your use case. My main advice is that perhaps you should be pre-filtering the input on the way in to R, as opposed to after reading the entire file into R. Based on what I see above, you should be able to express your filter in some concise way inside a pipe() call, which you can use with vroom(). vroom comes up in the video around the 9 minute mark.

https://youtu.be/RYhwZW6ofbI?si=HEGTk4o2P6-4zG6m

jennybc closed this as completed Sep 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is vroom too memory greedy and disk intensive? #507

Is vroom too memory greedy and disk intensive? #507

CharlesNickmilder commented Aug 16, 2023

jennybc commented Sep 28, 2023

Is vroom too memory greedy and disk intensive? #507

Is vroom too memory greedy and disk intensive? #507

Comments

CharlesNickmilder commented Aug 16, 2023

jennybc commented Sep 28, 2023