You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I discovered vroom recently while I was searching for a way to read only specific rows inside a csv. The exact condition was that the column "index" of the output had to contain all the indices I needed but not the other values.
After researches on internet, I found the following subscript that interested me:
Looking at the task manager, I noticed huge read and write activities on my internal drive during the mapping. This raised a first red flag for me: I need to scrap around 3000 files, each between 1 and 20 GB. This implies a lot of stress on my internal drive and I don't want to burn it down, can we define another space where that intensive operations could be done?
Another point that I noticed is the memory greed: once the filter is performed, there is no need to keep all the memory allocation, especially given that I don't know a way to get back the data in the middle of a pipe. To free up the memory allocated for the whole database, I have to convert the test extracted DB to another class, e.g. a data.table and perform a gc() afterwards with a command like:
test=as.data.table(test)
As far as I understand, it means that the pointers that create the memory allocation are transfered through the pipe and not recomputed for the data targeted. As the rest of my workflow relies on data.table this patch does not hamper my work. However, I did not find informations about that behaviour anywhare.
Regards,
Charles
PS:
vroom version 1.6.1
dplyr version 1.1.2
The text was updated successfully, but these errors were encountered:
I was just re-watching the video below to answer a different question, but I think it's also relevant to your use case. My main advice is that perhaps you should be pre-filtering the input on the way in to R, as opposed to after reading the entire file into R. Based on what I see above, you should be able to express your filter in some concise way inside a pipe() call, which you can use with vroom(). vroom comes up in the video around the 9 minute mark.
Hello,
I discovered vroom recently while I was searching for a way to read only specific rows inside a csv. The exact condition was that the column "index" of the output had to contain all the indices I needed but not the other values.
After researches on internet, I found the following subscript that interested me:
test=vroom::vroom(i)|> dplyr::filter(idpixel %in% IndicesNeeded)
where i is the filename.
Looking at the task manager, I noticed huge read and write activities on my internal drive during the mapping. This raised a first red flag for me: I need to scrap around 3000 files, each between 1 and 20 GB. This implies a lot of stress on my internal drive and I don't want to burn it down, can we define another space where that intensive operations could be done?
Another point that I noticed is the memory greed: once the filter is performed, there is no need to keep all the memory allocation, especially given that I don't know a way to get back the data in the middle of a pipe. To free up the memory allocated for the whole database, I have to convert the
test
extracted DB to another class, e.g. a data.table and perform agc()
afterwards with a command like:test=as.data.table(test)
As far as I understand, it means that the pointers that create the memory allocation are transfered through the pipe and not recomputed for the data targeted. As the rest of my workflow relies on data.table this patch does not hamper my work. However, I did not find informations about that behaviour anywhare.
Regards,
Charles
PS:
vroom version 1.6.1
dplyr version 1.1.2
The text was updated successfully, but these errors were encountered: