-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add 'chunk' option for getData #292
Comments
Hi @RJDan, Thank you for your bringing this back to our attention. We already had a discussion about reading data in chunks, but back then we switched to reading the whole file because it was faster ( see #141 ). But now, with even more samples and larger tabix files, this seems less stable and possibly less reasonable. Maybe, chunkwise reading will lower the memory pressure and prevent segmentation faults, but the whole purpose of getData is to read the whole dataset into memory, so it is questionable whether reading data in chunks will solve memory issues in the long run. Probably the best we can do here is to approximate possible memory consumption and warn/stop if this exceeds available memory. I guess what you are looking for is some kind of streaming function to read a new chunk of data on every invocation, doing some calculations and repeating until the whole file is processed. We actually have this kind of functionality available internally to process tabix files in chunks, per chromosome or only for specific ranges. These functions are not exported and only available via the We are already using this functionality for memory efficient extraction of the percent methylation matrix or to perform coverage calculations prior to bigwig export. |
Thanks for the reply. I would suggest making a more prominent statement on the website and manual about what does and does not work with 'chunks' for those who don't read things properly (such as myself). Perhaps you could also make mention that if we encounter SegFault issues the first thing to check is that we are not using some function that reads everything into memory. I had a line in my script that created a non-DB object as an intermediate step and it took me a while to figure out what was causing the segfault. Thanks for the pointers about the non-exposed functions. |
Hello
I have had many issues trying to do custom analyses with the output of getData when the data sets are large.
I get a segmentation fault that does not necessarily go away with more RAM. For instance my LSF job scheduler reports that I only use 20 GB of ram when I get the Seg Fault error but I requested a total of 40 GB, so there is another 20 GB available and unused.
I want to request that getData function gets an option to read data in chunks, as the *DB functions currently have.
I know that I could change my code to read in the file in chunks (i.e. don't use getData) or split my files into chunks outside of R, but I think this would be super useful for many people who don't know how to do the more manual manipulation of files.
The text was updated successfully, but these errors were encountered: