-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bundle the whole HG19 and HG38 reference in git lfs for tests #5111
Comments
Now that htsjdk has support for indexed block-compressed FASTA, maybe the best approach is to add support also in GATK to allow this test resource to be smaller (and at the same time, check integration with other tools). @droazen - is there any plan to include support for bgzip FASTA in GATK soon? I can take that as a small project if you are interested, but I should plan it somehow to be sure about the rettirements in GATK to support them. |
@magicDGS @lbergelson has a branch coming soon that adds GATK support for |
Now that the HG38 is easy -- I'd suggest we just use the official copy in the Broad filesystem at HG19 is more complicated, since we need to choose between several different variations (like "b37"). The official copy in the Broad filesystem at |
I'm OK with the HG38 only, considering that we are evaluating against HG38, unless other SV team members have different opinions. |
I think we'll definitely want to keep the HLA contigs, as they are great for finding bugs in our parsing code :) |
Even better! |
We ultimately need both an "hg19" reference and HG38. For "hg19" I'm fine with B37 - it is almost completely equivalent with respect to the sequence (only with different contig names). |
FYI, for me getting a B37 reference checked in is currently a higher priority. |
The full GRCh38 analysis set allows us to test allosomal contig contexts, e.g. as presented by DetermineGermlineContigPloidy. |
I'm opposed to including 2 entire references since it will raise our git lfs files to somewhere around 5gb. This is a significant drag on downloading / building / testing gatk and should be avoided if possible. I understand that I may be overruled here, but keeping the test files to a reasonable size was and should remain an important goal of gatk4. It looks like there may be some options to slim down the existing test files that we should take advantage of if possible. There are a number of large vcfs and fasta files which are NOT currently compressed in our large files. We should compress them. |
A rough estimate of reduction from compressing uncompressed vcf and fasta files in our test data is that we'll go from ~2.8G in the large folder currently -> 2.1G. We may not want to compress everything, since we probably want tests on uncompressed files as well, but it would be a good thing to look into. |
@lbergelson Not having these full references available is a significant drag on development, has wasted massive amounts of both Jonn's and Steve's time (and others too), and resulted in inferior tests compared to what we could have. I think this outweighs the other considerations you mention. I think that we can afford the extra lfs usage purely from a quota perspective given that we've just cut total usage in ~half. Removing or compressing some existing files should help quite a bit as well, as you suggested. |
Can't we get away with just the hg38 one? |
@lbergelson We can't, unfortunately -- in order to write good Funcotator tests we need b37/hg19 as well. |
I've created a regression test corpus for Funcotator that exercises all of the variant classifications that it can produce, but to do so I pulled from data sets that I had already been annotating. As such, it spans more than the chromosomes that are checked in as references for Funcotator tests already. These are also It will also preemptively solve the issue of needing to add unit tests for variants outside of the "supported testing regions" in Funcotator (where "supported testing regions" are loci supported by references in the tests) . This would be for when we find a problem variant in user data and need to add a new unit test. Anecdotally, I have found cases like this in some of the germline data that I've been running. On the plus side, adding in a complete HG19 reference will allow me to delete the references checked in for my unit tests. It won't be much saved space, but it will be some (~120Mb). |
Feature request
Tool(s) or class(es) involved
SV pipeline, Funcotator, etc.
Description
In trying to build test data for SV, time and time again we face the problem of not being able to find actual desired events on the two chromosomes 20 and 21, hence end up having to painfully perform all kinds of coordinate hacks in order to have enough test coverage.
It seems that the Funcotator team is also facing a similar issue.
Therefore it will be great if the whole reference genome for HG38, and maybe HG19 as well, can be included in the tests, so that tool developers spend less time worrying about hassles in moving real events to chr20 and chr21.
One of the potential downside is obvious: it increases the repo size and time for running tests (downloading a bigger file) on Travis.
The text was updated successfully, but these errors were encountered: