Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize lazy execution during pvar read #11

Merged
merged 1 commit into from
Nov 21, 2024
Merged

Conversation

kscott-1
Copy link
Contributor

  • row indices create an issue with lazy execution. any function polars provides to add indices requires the lf be read into memory.
  • instead of reading the entire lf into memory, we track indices by using the variant ID.
  • we can determine a blank header from the number of columns in a pvar file - pvar spec allows only for 5/6 cols when header is blank.
  • when scanning the csv, ignore any lines starting with "#" to ignore both metadata and header (if it exists), so that we can manually enter the header from the earlier control block.
  • the order of the control block for filtering variants provided an issue where the entire lf was read in the case that variant_idxs were not specified, even if variant_ids was.
  • fix the ordering logic into one if,elif,else block and copy the structure to the psam section.

I noticed some issues with the lazy execution on the pvar file. To the best of my knowledge, there is only so much that can be done to save memory when row indices are important, but these fixes do help. I did not change any of the behavior regarding row indices for psam, so there are some logical differences between the pvar/psam code blocks here. psam files are so small that it probably isn't worth optimizing except for consistency.

I also did some work on the header checking. As it stands, if the header doesn't exist (i.e. #CHROM is not found), the entire file will be read and the if statement checking for #CHROM will continue on until the very end. Doesn't seem like the ideal behavior, so I propose some fixes to that here.

Sorry to drop it all in one commit, I figured I would put up what I have before bed. I may try to look at this a little more, but do with the PR what you wish :)

@kscott-1
Copy link
Contributor Author

kscott-1 commented Nov 20, 2024

I can only replicate the error here within the .zst files. I will need to investigate further.

  • the problem is that polars doesn't allow for scanning of compressed files. However, polars has a simple method to read zst files and it seems that the engine allows for scan_csv to read the csv immediately into memory instead, but breaks working with the header. it's a problem I need to think about more.

@salcc salcc added the enhancement New feature or request label Nov 20, 2024
@kscott-1
Copy link
Contributor Author

kscott-1 commented Nov 21, 2024

The more I look at this, the more it seems like a duckdb + polars problem rather than just polars. Particularly, when there is an INFO column (large one at that), polars chokes hard regardless of optimization strategy. A tool like duckdb could efficiently ignore that column.

Example on why I do not feel polars alone is enough here - I have a pvar file of ~10GB with the INFO column taking up 80-90% of that size. When scanning the file and selecting just the ID column and collecting using polars, memory maxes out at ~11GB as shown in the image below, that is still not good enough IMO. For reference, reading in the entire file instead does take up ~23GB of memory so the improvements over that are still good. I don't have much experience optimizing memory management for full projects like this, but I think it is worth considering other options.

memusage_plscan

Curious to hear thoughts on the direction you guys are wishing this to go.

E: Just as a further demonstration of my suggestion, running the same benchmark (tracking indices too) and including filtering down to a small list of 5 variants took 160MB using duckdb, whereas just to get row indices and start filtering columns with polars I was past 10GB. Speed is moderately decreased (took my query 20 seconds from reading to filtering). On smaller files, the speed would be even less noticeable.

@salcc
Copy link
Member

salcc commented Nov 21, 2024

Thanks for investigating this! We tried to optimize for the reading speed and memory usage of the SNPObject, but haven't done many tests on the memory usage of reading itself, so some improvements might be possible. However, we would not like to compromise speed since it might often be preferable to use more memory if the processing is faster, for example, when working in a cluster with a lot of memory and processing whole biobanks.

We have not used duckdb in the past, and the fourth Google result for duckdb is https://www.reddit.com/r/dataengineering/comments/zp6ai6/what_are_the_actual_use_cases_for_duckdb_when_you/ which says that it's very slow compared to polars, but the post is 2 years old, so things might have changed. However, if you believe that using duckdb or any other modification can improve memory usage (without compromising speed, or even improving speed), we are happy to accept your contributions!

@salcc
Copy link
Member

salcc commented Nov 21, 2024

I hadn't read the edit before. I understand that memory usage is improved as a tradeoff of the reading speed. If you believe the improvements are considerable, we could consider having two backends, one with only polars and one integrating duckdb. This is actually similar to what we did for the VCF reader, since VCF files are effectively CSV/TSV too (what we did there is to use our own polars reader by default, but the user can also choose to use scikit-allel as an alternative backend, which tends to be slower but more memory-efficient).

@kscott-1
Copy link
Contributor Author

Understood, it is good to know the overall goals. I think the polars solution is clearly best if speed is paramount. I think more testing needs to be done before making any serious changes, but it is something to think about.

For the time being, I will clean up this PR with a single commit that takes care of the issues I found with 'zst' files and should improve the header reading logic. 'zst' files need to be read in using a simple read_csv as opposed to the scan.

* row indices create an issue with lazy execution. any function polars
  provides to add indices requires the lf be read into memory.
* instead of reading the entire lf into memory, we track indices by
  using the variant ID.
* we can determine a blank header from the number of columns in a pvar
  file - pvar spec allows only for 5/6 cols when header is blank.
* the order of the control block for filtering variants provided an
  issue where the entire lf was read in the case that variant_idxs were
  not specified, even if variant_ids was.
* fix the ordering logic into one if,elif,else block and copy the
  structure to the psam section.

Signed-off-by: Kyle Scott <[email protected]>
@kscott-1 kscott-1 marked this pull request as ready for review November 21, 2024 21:09
@salcc salcc self-assigned this Nov 21, 2024
@salcc salcc self-requested a review November 21, 2024 21:26
snputils/snp/io/read/pgen.py Show resolved Hide resolved
@salcc salcc merged commit 5dd876b into AI-sandbox:main Nov 21, 2024
4 checks passed
@kscott-1 kscott-1 deleted the dev branch November 22, 2024 13:06
@salcc
Copy link
Member

salcc commented Nov 22, 2024

We released v0.2.10 that includes this enhancement and some other bug fixes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants