Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bundle the whole HG19 and HG38 reference in git lfs for tests #5111

Closed
SHuang-Broad opened this issue Aug 14, 2018 · 15 comments
Closed

Bundle the whole HG19 and HG38 reference in git lfs for tests #5111

SHuang-Broad opened this issue Aug 14, 2018 · 15 comments

Comments

@SHuang-Broad
Copy link
Contributor

Feature request

Tool(s) or class(es) involved

SV pipeline, Funcotator, etc.

Description

In trying to build test data for SV, time and time again we face the problem of not being able to find actual desired events on the two chromosomes 20 and 21, hence end up having to painfully perform all kinds of coordinate hacks in order to have enough test coverage.

It seems that the Funcotator team is also facing a similar issue.

Therefore it will be great if the whole reference genome for HG38, and maybe HG19 as well, can be included in the tests, so that tool developers spend less time worrying about hassles in moving real events to chr20 and chr21.

One of the potential downside is obvious: it increases the repo size and time for running tests (downloading a bigger file) on Travis.

@magicDGS
Copy link
Contributor

Now that htsjdk has support for indexed block-compressed FASTA, maybe the best approach is to add support also in GATK to allow this test resource to be smaller (and at the same time, check integration with other tools).

@droazen - is there any plan to include support for bgzip FASTA in GATK soon? I can take that as a small project if you are interested, but I should plan it somehow to be sure about the rettirements in GATK to support them.

@droazen
Copy link
Contributor

droazen commented Aug 16, 2018

@magicDGS @lbergelson has a branch coming soon that adds GATK support for fasta.gz, I believe. We definitely plan to check in the references in compressed form, as they are ~4x smaller!

@droazen droazen changed the title Investigate possibility of bundling the whole HG19 and HG38 reference in tests Bundle the whole HG19 and HG38 reference in git lfs for tests Sep 21, 2018
@droazen
Copy link
Contributor

droazen commented Sep 21, 2018

Now that the fasta.gz support has been merged, this ticket is unblocked. As a first step, we need to select exactly which references go in.

HG38 is easy -- I'd suggest we just use the official copy in the Broad filesystem at /seq/references/Homo_sapiens_assembly38/v0/Homo_sapiens_assembly38.fasta, and make our compressed version from that.

HG19 is more complicated, since we need to choose between several different variations (like "b37"). The official copy in the Broad filesystem at /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta appears to actually be b37 going by the contig naming convention ("1" vs. "chr1"). It's worth noting that most of our existing tests are written against a subsetted version of b37 in large (human_g1k_v37.20.21.fasta). @jonn-smith and @SHuang-Broad please weigh in as to whether a b37 reference would be ok, or whether you require a different HG19 variant. I don't think we can afford to check in more than one.

@SHuang-Broad
Copy link
Contributor Author

SHuang-Broad commented Sep 21, 2018

I'm OK with the HG38 only, considering that we are evaluating against HG38, unless other SV team members have different opinions.
Also, it's OK to remove those HLA stuff, since we don't evaluate SV calls on them, AND they can reck havoc for constructor SimpleInterval(String).

@droazen
Copy link
Contributor

droazen commented Sep 21, 2018

I think we'll definitely want to keep the HLA contigs, as they are great for finding bugs in our parsing code :)

@SHuang-Broad
Copy link
Contributor Author

Even better!

@jonn-smith
Copy link
Collaborator

We ultimately need both an "hg19" reference and HG38.

For "hg19" I'm fine with B37 - it is almost completely equivalent with respect to the sequence (only with different contig names).

@jonn-smith
Copy link
Collaborator

FYI, for me getting a B37 reference checked in is currently a higher priority.

@sooheelee
Copy link
Contributor

The full GRCh38 analysis set allows us to test allosomal contig contexts, e.g. as presented by DetermineGermlineContigPloidy.

@lbergelson
Copy link
Member

I'm opposed to including 2 entire references since it will raise our git lfs files to somewhere around 5gb. This is a significant drag on downloading / building / testing gatk and should be avoided if possible. I understand that I may be overruled here, but keeping the test files to a reasonable size was and should remain an important goal of gatk4.

It looks like there may be some options to slim down the existing test files that we should take advantage of if possible. There are a number of large vcfs and fasta files which are NOT currently compressed in our large files. We should compress them.

@lbergelson
Copy link
Member

A rough estimate of reduction from compressing uncompressed vcf and fasta files in our test data is that we'll go from ~2.8G in the large folder currently -> 2.1G. We may not want to compress everything, since we probably want tests on uncompressed files as well, but it would be a good thing to look into.

@droazen
Copy link
Contributor

droazen commented Sep 21, 2018

@lbergelson Not having these full references available is a significant drag on development, has wasted massive amounts of both Jonn's and Steve's time (and others too), and resulted in inferior tests compared to what we could have. I think this outweighs the other considerations you mention.

I think that we can afford the extra lfs usage purely from a quota perspective given that we've just cut total usage in ~half. Removing or compressing some existing files should help quite a bit as well, as you suggested.

@lbergelson
Copy link
Member

Can't we get away with just the hg38 one?

@droazen
Copy link
Contributor

droazen commented Sep 21, 2018

@lbergelson We can't, unfortunately -- in order to write good Funcotator tests we need b37/hg19 as well.

@jonn-smith
Copy link
Collaborator

jonn-smith commented Sep 21, 2018

I've created a regression test corpus for Funcotator that exercises all of the variant classifications that it can produce, but to do so I pulled from data sets that I had already been annotating. As such, it spans more than the chromosomes that are checked in as references for Funcotator tests already. These are also HG19 (really, B37) variants, because Oncotator can't do HG38 and I needed ground truth. So I need the B37 reference to go in so that these regression tests can be run.

It will also preemptively solve the issue of needing to add unit tests for variants outside of the "supported testing regions" in Funcotator (where "supported testing regions" are loci supported by references in the tests) . This would be for when we find a problem variant in user data and need to add a new unit test. Anecdotally, I have found cases like this in some of the germline data that I've been running.

On the plus side, adding in a complete HG19 reference will allow me to delete the references checked in for my unit tests. It won't be much saved space, but it will be some (~120Mb).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants