Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added optional mappability and segmental-duplication annotation to AnnotateIntervals. #5162

Merged
merged 4 commits into from
Sep 13, 2018

Conversation

samuelklee
Copy link
Contributor

@samuelklee samuelklee commented Sep 6, 2018

This produces a resource that will be used as input to an upcoming tool to filter intervals based on these annotations (as well as coverage statistics). Currently, we have an external python script performing this step in the gCNV pipeline.

I also updated the AnnotateIntervals task and calls in WDL, but these changes are untested; the reviewer should check carefully for typos.

Currently, all annotations are of double type, but I've added code that can support all types supported by the TSV code as well. Additional tracks can also be added relatively easily. Currently, allowed annotations and their corresponding types are hardcoded; we could possibly move this information to the SAM-style header in the future.

For the Umap hg19 k100 single-read mappability track and the segmental-duplication track used by the Talkowski lab, annotation of 1kb bins on hg19 takes less than a minute with the default feature lookahead (which is exposed as a parameter).

I tested using the Umap multi-read mappability track (which is orders of magnitude larger, but is actually what is used in the external script), but this is much slower (documentation indicates that the single-read track should be used to dissuade this). We should evaluate whether or not using the single-read track suffices for filtering.

Copy link
Collaborator

@asmirnov239 asmirnov239 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @samuelklee! I had a few questions about the tracks themselves and scoring logic that I left in the comments

private FeatureInput<BEDFeature> mappabilityTrackPath;

@Argument(
doc = "Path to segmental-duplication track in .bed or .bed.gz format (see https://bismap.hoffmanlab.org/). " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find the segmental-duplication track on that page, only mappability one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, yup---forgot to remove the link after copying/pasting.

reference = ReferenceDataSource.of(referenceArguments.getReferencePath()); //the GATKTool ReferenceDataSource is package-protected, so we cannot access it directly
features = new FeatureManager( //the GATKTool FeatureManager is package-protected, so we cannot access it directly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FeatureManager is public, I'm confused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FeatureManager class is public, but the instance field in the GATKTool class is not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see

@@ -150,4 +258,59 @@ public Double apply(final Locatable interval,
return totalCount == 0 ? Double.NaN : gcCount / (double) totalCount;
}
}

/**
* If scores are provided, intervals will be annotated with the length-weighted average; scores may not be NaN.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scores may not be NaN. It seems that you do allow scores to take NaN value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to convey that scores in the input tracks may not be NaN. However, they may be NaN in the output. (Actually, I'm not sure what happens in the former case, though---I'll add some exception throwing if needed and a corresponding test.)

Copy link
Contributor Author

@samuelklee samuelklee Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, you're right, I think the doc/code is inconsistent. Right now it will take a provided NaN and turn it into a 1.

I think the issue is that an explicit NaN in the track is the same as a missing score---both are represented as NaN. However, we want the former to be NaN (or throw) and the latter to be 1.

Since it's unlikely that input tracks will actually have NaN, I'll update the documentation to indicate that these will be taken as 1. Not ideal, but I don't see an easy way around it.

}
}

public static class MappabilityAnnotator extends BEDLengthWeightedAnnotator {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Access could be private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done for all.

final FeatureContext featureContext) {
final int intervalLength = interval.getLengthOnReference();
if (intervalLength == 0) {
return Double.NaN;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just throw an exception here? or a warning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, good point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check in the traversal loop.

if (intervalLength == 0) {
return Double.NaN;
}
double lengthWeightedSum = 0.;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to assign a score of 0 if a region is not in the mappability track?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is what we indicate in the documentation.

*/
enum AnnotatedIntervalTableColumn {
CONTIG,
START,
END,
GC_CONTENT;
END;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this enum be now moved up to AbstractLocatableCollection and shared with SimpleIntervalCollection since it's the same minimal header?

Copy link
Contributor Author

@samuelklee samuelklee Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AllelicCountCollection doesn't have the same minimal header (it has a single POSITION = START = END header instead).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I didn't see that AllelicCountCollection also extends it

import java.util.Map;

/**
* Represents an immutable ordered collection of named, typed annotations for an interval.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe it's an ordered collection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe ImmutableMap is ordered.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right it has user-specified order

*/
private static final AnnotatedIntervalCollection EXPECTED_ALL_ANNOTATIONS = new AnnotatedIntervalCollection(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be made package private and shared with AnnotatedIntervalCollectionUnitTest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not in the same package, but I can make it public if that's acceptable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that seems reasonable since it's the same test case

new AnnotationMap(Arrays.asList(
Pair.of(CopyNumberAnnotations.GC_CONTENT, Double.NaN),
Pair.of(CopyNumberAnnotations.MAPPABILITY, 0.0),
Pair.of(CopyNumberAnnotations.SEGMENTAL_DUPLICATION_CONTENT, 0.))))));
Copy link
Collaborator

@asmirnov239 asmirnov239 Sep 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't have any intervals which intersect with the intervals in seg dup track. Can we add a few, for example: 20:1238782-1239901,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, added 20:1238001-1239000 to make the numbers a bit nicer.

@samuelklee
Copy link
Contributor Author

@asmirnov239 OK, think I got everything! Thanks for your careful review.

Copy link
Collaborator

@asmirnov239 asmirnov239 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @slee! Just one comment feel free to merge after

@@ -68,6 +71,11 @@
Pair.of(CopyNumberAnnotations.GC_CONTENT, 0.448),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add another interval which partially overlaps with intervals in mappability track?

@samuelklee samuelklee merged commit 63f7b63 into master Sep 13, 2018
@samuelklee samuelklee deleted the sl_annot branch September 13, 2018 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants