Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sequential VCF parsing #94

Merged
merged 10 commits into from
Jan 23, 2024
Merged

Conversation

jeromekelleher
Copy link
Collaborator

Sticking a prototype of a sequential VCF parsing method here for experimentation.

Currently, this is parsing the 1M sample VCF at a rate of about 50-60 variants per second, which gives about 36 hours for the whole thing (which is totally acceptable). Memory usage seems very predictable, and in-line what what you would expect. CPU usage is at an average of about 150%, with some peaks when the chunks are being flushed. So, the bottleneck is the main thread, where we're moving information from the decoded VCF record (done in a background thread) into the numpy buffers. The main question then, I guess is how much this will be slowed down by adding all the necessary complexity of INFO fields etc.

@benjeffery, do you think you could try this out on your ugly VCFs, and see how it goes? Maybe hard-code in a few extra fields to see how it performs when we put in more?

@jeromekelleher
Copy link
Collaborator Author

Note: opened an issue about the num_records thing here: brentp/cyvcf2#294

@jeromekelleher
Copy link
Collaborator Author

If this seems to work reasonably well, I would consider trying to contribute gil-dropping methods that decode GTs and other large arrays into a supplied numpy array upstream. This would allow us to do this in background threads also.

@benjeffery
Copy link
Contributor

Will try this out now!

@jeromekelleher
Copy link
Collaborator Author

Update.

I've added support here for working on "partitioned" VCFs, where a chromosome has been split into contiguous parts. These are consumed in parallel processes. It seems to work quite well on the simulated data I have.

@benjeffery is testing these on some real world horror shows.

We're also looking at how it performs on real-world full chromosome VCFs for 200k samples.

@jeromekelleher
Copy link
Collaborator Author

Great idea from @benjeffery - use a shared memory mutex on first and last chunks of each partition rather than running odds and evens separately. This will improve parallelism when number of chunks is < 100

@jeromekelleher
Copy link
Collaborator Author

Some experiments on UKB WGS data (200k), single large VCFs

  • chr16, 55 it/s, estimated 65 hours
  • chr2 estimated 113 hours

@benjeffery
Copy link
Contributor

benjeffery commented Jan 22, 2024

Some data points with GeL data:
3.7TB of chr20 VCF split into 28 parts, parsed with this code (implies max parallelism of 14): ~7 hours (estimated still ongoing). 600it/s.

Wow!

@jeromekelleher
Copy link
Collaborator Author

After discussing on the community call yesterday, we're going to move this into sgkit. I'm going to merge this here for now as a useful reference, we can delete it later.

@jeromekelleher jeromekelleher merged commit 35f0a5b into sgkit-dev:main Jan 23, 2024
1 check passed
@jeromekelleher jeromekelleher deleted the sequential-vcf branch January 23, 2024 11:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants