-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LocusDepthtoBAF tool #7776
LocusDepthtoBAF tool #7776
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @tedsharpe this is going to be a major QOL improvement for users, who sometimes spend weeks wrangling SNP calls from various sources.
Mostly minor comments below. How much larger at the LD files after adding the ref / alt indices? These files can live in nearline storage for years, so minimizing their size is important, even if it means slowing down LD -> BAF conversion a bit. It probably compresses out well though.
On a related note, sometimes projects will mix/match samples. It may happen that someone would try to combine LD files generated with different site lists. This is not an issue with PE/SR which are collected genome-wide nor for counts where we enforce a check on the bin intervals. Is it possible to add something to the LD header to ensure consistency? Maybe a hash on the sites list would be sufficient? Or better yet an explicit check that the input LD sites are consistent across all samples? I believe all biallelic snp sites are emitted in CollectSVEvidence.
Edit: we also need to check this against the current pipeline, ie using HaplotypeCaller to call SNPs. I suspect the chi squared model will be good enough, but we should do a head-to-head on at least one sample before merging this.
@@ -24,6 +24,14 @@ public BafEvidence( final String sample, final String contig, | |||
this.value = value; | |||
} | |||
|
|||
// value-altering constructor | |||
public BafEvidence( final BafEvidence that, final double value ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add a quick unit test for this?
@@ -11,25 +12,36 @@ public final class LocusDepth implements SVFeature { | |||
private final String contig; | |||
private final int position; | |||
private final String sample; | |||
private final byte refCall; // index into nucleotideValues | |||
// next three fields index "ACGT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you create an enum type for this?
public final static String BCI_VERSION = "1.0"; | ||
|
||
public LocusDepth( final Locatable loc, final String sample, final byte refCall ) { | ||
public LocusDepth( final Locatable loc, final String sample, final int refIndex, final int altIndex ) { | ||
Utils.validateArg(refIndex >= 0 && refIndex <= 3, "refIndex must be between 0 and 3"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about Utils.nonNull()
as well for loc
and sample
? I tend to use them where I can in constructors for clarity. You can even combine it with the initialization like this.sample = Utils.nonNull(sample);
|
||
/** | ||
* <p>Merges locus-sorted LocusDepth evidence files, and calculates the bi-allelic frequency for each | ||
* sample and site, and writes these values as a BafEvidence output file.</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a description of the filtering methodology
final double refDiff = locusDepth.getRefDepth() - expectRefAlt; | ||
final double altDiff = locusDepth.getAltDepth() - expectRefAlt; | ||
double chiSq = (refDiff * refDiff + altDiff * altDiff) / expectRefAlt; | ||
double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment that this is a Pearson chi-squared test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq); | |
final double fitProb = 1. - chiSqDist.cumulativeProbability(chiSq); |
if ( fitProb < minHetLikelihood ) { | ||
return 0.; | ||
} | ||
return (double)locusDepth.getAltDepth() / totalDepth; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return (double)locusDepth.getAltDepth() / totalDepth; | |
return locusDepth.getAltDepth() / (double) totalDepth; |
I can never remember if casts get applied before division
@@ -694,14 +695,16 @@ private boolean readNextLocus() { | |||
return false; | |||
} | |||
VariantContext snp = snpSourceItr.next(); | |||
while ( !snp.isSNP() ) { | |||
while ( !snp.isSNP() || !snp.isBiallelic() ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's future-proof this and allow multiallelic by default, but have a command line flag to restrict only to biallelic
1 1000 smpl1 0 3 4 0 0 6 | ||
1 1000 smpl2 0 3 5 0 0 5 | ||
1 1000 smpl3 0 3 6 0 0 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a few more records at other loci, including some on another contig?
shortName = StandardArgumentDefinitions.FEATURE_SHORT_NAME ) | ||
private List<FeatureInput<LocusDepth>> locusDepthFiles; | ||
|
||
@Argument(fullName = SAMPLE_NAMES_NAME, doc = "Sample names", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on what this does in the doc? Is it used for subsetting?
@Argument( | ||
doc = "Minimum likelihood of site being biallelic and heterozygous", | ||
fullName = MIN_HET_LIKELIHOOD, optional = true ) | ||
private double minHetLikelihood = .5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a probability instead of likelihood
I believe that this is ready for a re-review. I think all comments were addressed. |
Codecov Report
@@ Coverage Diff @@
## master #7776 +/- ##
================================================
+ Coverage 18.644% 79.577% +60.933%
- Complexity 4635 34845 +30210
================================================
Files 1261 2214 +953
Lines 73745 173540 +99795
Branches 11768 18736 +6968
================================================
+ Hits 13749 138098 +124349
+ Misses 57944 29026 -28918
- Partials 2052 6416 +4364
|
for ( final int altIndex : altIndices ) { | ||
double baf = calcBAF(ld, refIndex, altIndex); | ||
if ( baf != BafEvidence.MISSING_VALUE ) { | ||
beList.add(new BafEvidence(ld.getSample(), ld.getContig(), ld.getStart(), baf)); | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some cases this will lead to bias. I didn't think through some of the problems multi-allelics are going to cause us with compatibility here. Apologies for flip-flopping on this, but let's stick to bi-allelic for both LD collection and BAF generation. Regardless, it seems like this implementation would be problematic in some edge cases. I think we will need to either provide the sites list to this tool, or go back to storing the ref/alt alleles in LD (which would be fine, probably simpler and faster).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the multi-allelic option, which kind of weirded me out anyway.
However, I'm confused by the last part of your comment: we do provide the sites list to this tool. That's where I'm getting the ref/alt call info, since it's no longer in the LD records. (See lines 210-223.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry just realized I missed that, all good to just rip out the multi-allelics.
No description provided.