optimize lazy execution during pvar read #11

kscott-1 · 2024-11-20T05:44:41Z

row indices create an issue with lazy execution. any function polars provides to add indices requires the lf be read into memory.
instead of reading the entire lf into memory, we track indices by using the variant ID.
we can determine a blank header from the number of columns in a pvar file - pvar spec allows only for 5/6 cols when header is blank.
when scanning the csv, ignore any lines starting with "#" to ignore both metadata and header (if it exists), so that we can manually enter the header from the earlier control block.
the order of the control block for filtering variants provided an issue where the entire lf was read in the case that variant_idxs were not specified, even if variant_ids was.
fix the ordering logic into one if,elif,else block and copy the structure to the psam section.

I noticed some issues with the lazy execution on the pvar file. To the best of my knowledge, there is only so much that can be done to save memory when row indices are important, but these fixes do help. I did not change any of the behavior regarding row indices for psam, so there are some logical differences between the pvar/psam code blocks here. psam files are so small that it probably isn't worth optimizing except for consistency.

I also did some work on the header checking. As it stands, if the header doesn't exist (i.e. #CHROM is not found), the entire file will be read and the if statement checking for #CHROM will continue on until the very end. Doesn't seem like the ideal behavior, so I propose some fixes to that here.

Sorry to drop it all in one commit, I figured I would put up what I have before bed. I may try to look at this a little more, but do with the PR what you wish :)

kscott-1 · 2024-11-20T06:05:57Z

I can only replicate the error here within the .zst files. I will need to investigate further.

the problem is that polars doesn't allow for scanning of compressed files. However, polars has a simple method to read zst files and it seems that the engine allows for scan_csv to read the csv immediately into memory instead, but breaks working with the header. it's a problem I need to think about more.

kscott-1 · 2024-11-21T19:04:49Z

The more I look at this, the more it seems like a duckdb + polars problem rather than just polars. Particularly, when there is an INFO column (large one at that), polars chokes hard regardless of optimization strategy. A tool like duckdb could efficiently ignore that column.

Example on why I do not feel polars alone is enough here - I have a pvar file of ~10GB with the INFO column taking up 80-90% of that size. When scanning the file and selecting just the ID column and collecting using polars, memory maxes out at ~11GB as shown in the image below, that is still not good enough IMO. For reference, reading in the entire file instead does take up ~23GB of memory so the improvements over that are still good. I don't have much experience optimizing memory management for full projects like this, but I think it is worth considering other options.

Curious to hear thoughts on the direction you guys are wishing this to go.

E: Just as a further demonstration of my suggestion, running the same benchmark (tracking indices too) and including filtering down to a small list of 5 variants took 160MB using duckdb, whereas just to get row indices and start filtering columns with polars I was past 10GB. Speed is moderately decreased (took my query 20 seconds from reading to filtering). On smaller files, the speed would be even less noticeable.

salcc · 2024-11-21T20:28:36Z

Thanks for investigating this! We tried to optimize for the reading speed and memory usage of the SNPObject, but haven't done many tests on the memory usage of reading itself, so some improvements might be possible. However, we would not like to compromise speed since it might often be preferable to use more memory if the processing is faster, for example, when working in a cluster with a lot of memory and processing whole biobanks.

We have not used duckdb in the past, and the fourth Google result for duckdb is https://www.reddit.com/r/dataengineering/comments/zp6ai6/what_are_the_actual_use_cases_for_duckdb_when_you/ which says that it's very slow compared to polars, but the post is 2 years old, so things might have changed. However, if you believe that using duckdb or any other modification can improve memory usage (without compromising speed, or even improving speed), we are happy to accept your contributions!

salcc · 2024-11-21T20:53:09Z

I hadn't read the edit before. I understand that memory usage is improved as a tradeoff of the reading speed. If you believe the improvements are considerable, we could consider having two backends, one with only polars and one integrating duckdb. This is actually similar to what we did for the VCF reader, since VCF files are effectively CSV/TSV too (what we did there is to use our own polars reader by default, but the user can also choose to use scikit-allel as an alternative backend, which tends to be slower but more memory-efficient).

kscott-1 · 2024-11-21T20:57:11Z

Understood, it is good to know the overall goals. I think the polars solution is clearly best if speed is paramount. I think more testing needs to be done before making any serious changes, but it is something to think about.

For the time being, I will clean up this PR with a single commit that takes care of the issues I found with 'zst' files and should improve the header reading logic. 'zst' files need to be read in using a simple read_csv as opposed to the scan.

* row indices create an issue with lazy execution. any function polars provides to add indices requires the lf be read into memory. * instead of reading the entire lf into memory, we track indices by using the variant ID. * we can determine a blank header from the number of columns in a pvar file - pvar spec allows only for 5/6 cols when header is blank. * the order of the control block for filtering variants provided an issue where the entire lf was read in the case that variant_idxs were not specified, even if variant_ids was. * fix the ordering logic into one if,elif,else block and copy the structure to the psam section. Signed-off-by: Kyle Scott <[email protected]>

snputils/snp/io/read/pgen.py

salcc · 2024-11-22T21:04:02Z

We released v0.2.10 that includes this enhancement and some other bug fixes!

salcc added the enhancement New feature or request label Nov 20, 2024

kscott-1 force-pushed the dev branch from d1c2f03 to a7a5a4c Compare November 21, 2024 21:04

kscott-1 marked this pull request as ready for review November 21, 2024 21:09

salcc self-assigned this Nov 21, 2024

salcc self-requested a review November 21, 2024 21:26

salcc requested changes Nov 21, 2024

View reviewed changes

snputils/snp/io/read/pgen.py Show resolved Hide resolved

salcc approved these changes Nov 21, 2024

View reviewed changes

salcc merged commit 5dd876b into AI-sandbox:main Nov 21, 2024
4 checks passed

kscott-1 deleted the dev branch November 22, 2024 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize lazy execution during pvar read #11

optimize lazy execution during pvar read #11

kscott-1 commented Nov 20, 2024

kscott-1 commented Nov 20, 2024 •

edited

Loading

kscott-1 commented Nov 21, 2024 •

edited

Loading

salcc commented Nov 21, 2024

salcc commented Nov 21, 2024

kscott-1 commented Nov 21, 2024

salcc commented Nov 22, 2024

optimize lazy execution during pvar read #11

optimize lazy execution during pvar read #11

Conversation

kscott-1 commented Nov 20, 2024

kscott-1 commented Nov 20, 2024 • edited Loading

kscott-1 commented Nov 21, 2024 • edited Loading

salcc commented Nov 21, 2024

salcc commented Nov 21, 2024

kscott-1 commented Nov 21, 2024

salcc commented Nov 22, 2024

kscott-1 commented Nov 20, 2024 •

edited

Loading

kscott-1 commented Nov 21, 2024 •

edited

Loading