Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two-pass non-Dask VCF conversion #1185

Closed
wants to merge 45 commits into from

Conversation

jeromekelleher
Copy link
Collaborator

Very much WIP - not ready for review!

@jeromekelleher
Copy link
Collaborator Author

I've just added a basic plink conversion approach, which converts the HAPNEST chr21 in about 20 minutes (6 workers, 8 encode threads per worker, max of about 40 gigs of RAM per worker). It's chugging through chr2 in what looks like linearly scaling time, so something in the order of an hour. Watching on linux perf, the vast majority of the time is spent on Blosc encoding and compressing the chunks.

In contrast, using the existing plink_to_zarr function, Dask seems to sit there thinking about the task graph for several minutes before doing anything useful (and emits an opaque and unhelpful warning for users who just want to convert their data). Looking at perf, time seems to be mostly spent doing numpy things, with Blosc encoding coming much further down (although it's using lz4, so not a like-for-like comparison).

I'll update when it finishes to give the overall timing.

@jeromekelleher
Copy link
Collaborator Author

Update - it failed after about an hour with a bunch of completely cryptic messages.

@jeromekelleher
Copy link
Collaborator Author

Closing as development has moved to https://github.com/jeromekelleher/bio2zarr

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant