Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate using bgzipped and indexed fasta as alternative to 2bit, and standardize on a standard reference format for spark tools #1718

Closed
2 tasks
akiezun opened this issue Apr 17, 2016 · 7 comments

Comments

@akiezun
Copy link
Contributor

akiezun commented Apr 17, 2016

the 2bit fasta isa bit of a pain to deal with. #1580 shows that size-wise bgzip fasta is comparable.

This ticket is to:

  • establish whether we can use a bgzipped+index fasta as a reference (htjdk support and speed)
  • if it works for walkers, evaluate performance of distributing the reference+index to spark executors and using it as a replacement for 2 bit
@SHuang-Broad
Copy link
Contributor

@akiezun, when you say "reference+index" do you mean .fai only or other index files like the ones required/generated by bwa like .bwt index files?

@droazen
Copy link
Contributor

droazen commented Apr 19, 2016

@akiezun Since 2bit format has a more compact in-memory representation than fasta (~4x smaller), you'd need to document the effects on Spark memory usage as well.

@akiezun
Copy link
Contributor Author

akiezun commented Apr 19, 2016

it would be a whole bunch of work though. htsjdk does not know about bgzipped fastas nor gzi index files. I'm looking for a way to estimate the benefits without doing the work yet. Suggestions welcome

@akiezun akiezun added the Spark label Apr 21, 2016
@akiezun
Copy link
Contributor Author

akiezun commented May 3, 2016

Moving this past alpha2 - requires substantial development in htsjdk

@droazen
Copy link
Contributor

droazen commented Jun 27, 2016

Reassigning to @jamesemery

@droazen droazen assigned jamesemery and unassigned akiezun Jun 27, 2016
@droazen droazen changed the title Investigate using bgzipped and indexed fasta as alternative to broadcast Investigate using bgzipped and indexed fasta as alternative to 2bit, and standardize on a standard reference format for spark tools Jun 27, 2016
@droazen droazen added this to the alpha-3 milestone Jun 27, 2016
@droazen
Copy link
Contributor

droazen commented Aug 3, 2016

Untagging for alpha-3, since we're going to try #2074 first as a cheaper alternative. Keeping this as an alpha3-candidate

@droazen droazen removed this from the alpha-3 milestone Aug 3, 2016
@droazen droazen added this to the 4.0 release milestone Mar 22, 2017
@tomwhite tomwhite removed their assignment Jun 13, 2017
@droazen droazen removed this from the Engine-4.0 milestone Oct 17, 2017
@tomwhite
Copy link
Contributor

We no longer require 2bit for Spark: #5127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants