-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added optional mappability and segmental-duplication annotation to AnnotateIntervals. #5162
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @samuelklee! I had a few questions about the tracks themselves and scoring logic that I left in the comments
private FeatureInput<BEDFeature> mappabilityTrackPath; | ||
|
||
@Argument( | ||
doc = "Path to segmental-duplication track in .bed or .bed.gz format (see https://bismap.hoffmanlab.org/). " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't find the segmental-duplication track on that page, only mappability one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yup---forgot to remove the link after copying/pasting.
reference = ReferenceDataSource.of(referenceArguments.getReferencePath()); //the GATKTool ReferenceDataSource is package-protected, so we cannot access it directly | ||
features = new FeatureManager( //the GATKTool FeatureManager is package-protected, so we cannot access it directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FeatureManager is public, I'm confused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FeatureManager
class is public, but the instance field in the GATKTool
class is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see
@@ -150,4 +258,59 @@ public Double apply(final Locatable interval, | |||
return totalCount == 0 ? Double.NaN : gcCount / (double) totalCount; | |||
} | |||
} | |||
|
|||
/** | |||
* If scores are provided, intervals will be annotated with the length-weighted average; scores may not be NaN. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
scores may not be NaN.
It seems that you do allow scores to take NaN value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is supposed to convey that scores in the input tracks may not be NaN. However, they may be NaN in the output. (Actually, I'm not sure what happens in the former case, though---I'll add some exception throwing if needed and a corresponding test.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, you're right, I think the doc/code is inconsistent. Right now it will take a provided NaN and turn it into a 1.
I think the issue is that an explicit NaN
in the track is the same as a missing score---both are represented as NaN
. However, we want the former to be NaN
(or throw) and the latter to be 1.
Since it's unlikely that input tracks will actually have NaN
, I'll update the documentation to indicate that these will be taken as 1. Not ideal, but I don't see an easy way around it.
} | ||
} | ||
|
||
public static class MappabilityAnnotator extends BEDLengthWeightedAnnotator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Access could be private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, done for all.
final FeatureContext featureContext) { | ||
final int intervalLength = interval.getLengthOnReference(); | ||
if (intervalLength == 0) { | ||
return Double.NaN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just throw an exception here? or a warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check in the traversal loop.
if (intervalLength == 0) { | ||
return Double.NaN; | ||
} | ||
double lengthWeightedSum = 0.; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to assign a score of 0 if a region is not in the mappability track?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is what we indicate in the documentation.
*/ | ||
enum AnnotatedIntervalTableColumn { | ||
CONTIG, | ||
START, | ||
END, | ||
GC_CONTENT; | ||
END; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this enum be now moved up to AbstractLocatableCollection
and shared with SimpleIntervalCollection
since it's the same minimal header?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AllelicCountCollection
doesn't have the same minimal header (it has a single POSITION
= START
= END
header instead).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I didn't see that AllelicCountCollection
also extends it
import java.util.Map; | ||
|
||
/** | ||
* Represents an immutable ordered collection of named, typed annotations for an interval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe it's an ordered collection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe ImmutableMap
is ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right it has user-specified order
*/ | ||
private static final AnnotatedIntervalCollection EXPECTED_ALL_ANNOTATIONS = new AnnotatedIntervalCollection( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be made package private and shared with AnnotatedIntervalCollectionUnitTest
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not in the same package, but I can make it public
if that's acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that seems reasonable since it's the same test case
new AnnotationMap(Arrays.asList( | ||
Pair.of(CopyNumberAnnotations.GC_CONTENT, Double.NaN), | ||
Pair.of(CopyNumberAnnotations.MAPPABILITY, 0.0), | ||
Pair.of(CopyNumberAnnotations.SEGMENTAL_DUPLICATION_CONTENT, 0.)))))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example doesn't have any intervals which intersect with the intervals in seg dup track. Can we add a few, for example: 20:1238782-1239901,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, added 20:1238001-1239000 to make the numbers a bit nicer.
f3ea4c4
to
f06546b
Compare
@asmirnov239 OK, think I got everything! Thanks for your careful review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @slee! Just one comment feel free to merge after
@@ -68,6 +71,11 @@ | |||
Pair.of(CopyNumberAnnotations.GC_CONTENT, 0.448), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add another interval which partially overlaps with intervals in mappability track?
This produces a resource that will be used as input to an upcoming tool to filter intervals based on these annotations (as well as coverage statistics). Currently, we have an external python script performing this step in the gCNV pipeline.
I also updated the AnnotateIntervals task and calls in WDL, but these changes are untested; the reviewer should check carefully for typos.
Currently, all annotations are of double type, but I've added code that can support all types supported by the TSV code as well. Additional tracks can also be added relatively easily. Currently, allowed annotations and their corresponding types are hardcoded; we could possibly move this information to the SAM-style header in the future.
For the Umap hg19 k100 single-read mappability track and the segmental-duplication track used by the Talkowski lab, annotation of 1kb bins on hg19 takes less than a minute with the default feature lookahead (which is exposed as a parameter).
I tested using the Umap multi-read mappability track (which is orders of magnitude larger, but is actually what is used in the external script), but this is much slower (documentation indicates that the single-read track should be used to dissuade this). We should evaluate whether or not using the single-read track suffices for filtering.