Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added a few of the remaining useful features to MarkDuplicatesSpark and associated tests #5377

Merged
merged 5 commits into from
Dec 11, 2018

Conversation

jamesemery
Copy link
Collaborator

@jamesemery jamesemery commented Oct 30, 2018

Note: This branch is still blocked on several changes in Picard (broadinstitute/picard#1236, and possibly broadinstitute/picard#1245), once those are resolved and released then this branch should hopefully get the stamp of approval from @takutosato.

  • Added optical/library duplicate marking of reads (note: this does not include library tagging)
  • Added the ability to remove reads from the output bam based on their duplicate marking status

Resolves #4675
Resolves #5377

@jamesemery jamesemery requested a review from lbergelson October 30, 2018 21:13
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery Back to you. Looks good modulo typos and some random stuff.

mutex = {MarkDuplicatesSparkArgumentCollection.DUPLICATE_TAGGING_POLICY_LONG_NAME, MarkDuplicatesSparkArgumentCollection.REMOVE_SEQUENCING_DUPLICATE_READS}, optional = true)
public boolean removeAllDuplicates = false;

@Argument(fullName = MarkDuplicatesSparkArgumentCollection.REMOVE_SEQUENCING_DUPLICATE_READS, doc = "If true do not write duplicates to the output file instead of writing them with appropriate flags set.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing because it has the same comment as removeAllDuplicates.

@@ -93,5 +94,6 @@ private void registerGATKClasses(Kryo kryo) {
kryo.register(ReadsKey.class, new FieldSerializer(kryo, ReadsKey.class));
kryo.register(ReadsKey.KeyForFragment.class, new FieldSerializer(kryo, ReadsKey.KeyForFragment.class));
kryo.register(ReadsKey.KeyForPair.class, new FieldSerializer(kryo, ReadsKey.KeyForPair.class));
kryo.register(SerializableOpticalDuplicatesFinder.class, new FieldSerializer(kryo, SerializableOpticalDuplicatesFinder.class));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weren't we getting rid of this class? Should this be replaces with a registration of OpticalDuplicateFinder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

@@ -68,7 +68,7 @@ protected void runTool(final JavaSparkContext ctx) {
try (final BwaSparkEngine bwaEngine = new BwaSparkEngine(ctx, referenceArguments.getReferenceFileName(), bwaArgs.indexImageFile, getHeaderForReads(), getReferenceSequenceDictionary())) {
final ReadFilter filter = makeReadFilter(bwaEngine.getHeader());
final JavaRDD<GATKRead> alignedReads = bwaEngine.alignPaired(getUnfilteredReads()).filter(filter::test);
final JavaRDD<GATKRead> markedReads = MarkDuplicatesSpark.mark(alignedReads, bwaEngine.getHeader(), markDuplicatesSparkArgumentCollection.duplicatesScoringStrategy, new SerializableOpticalDuplicatesFinder(), getRecommendedNumReducers(), markDuplicatesSparkArgumentCollection.dontMarkUnmappedMates);
final JavaRDD<GATKRead> markedReads = MarkDuplicatesSpark.mark(alignedReads, bwaEngine.getHeader(), new SerializableOpticalDuplicatesFinder(), markDuplicatesSparkArgumentCollection, getRecommendedNumReducers());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SerializableOpticalDuplicateFinder -> OpticalDuplicateFinder?

@@ -174,7 +174,7 @@ protected void runTool(final JavaSparkContext ctx) {
header = getHeaderForReads();
}

final JavaRDD<GATKRead> markedReads = MarkDuplicatesSpark.mark(alignedReads, header, markDuplicatesSparkArgumentCollection.duplicatesScoringStrategy, new SerializableOpticalDuplicatesFinder(), getRecommendedNumReducers(), markDuplicatesSparkArgumentCollection.dontMarkUnmappedMates);
final JavaRDD<GATKRead> markedReads = MarkDuplicatesSpark.mark(alignedReads, header, new SerializableOpticalDuplicatesFinder(), markDuplicatesSparkArgumentCollection, getRecommendedNumReducers());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

// Reads with this marker will be treated as non-duplicates always
public static int MARKDUPLICATES_NO_OPTICAL_MARKER = -1;
// Reads with this marker will be treated and marked as optical duplicates
public static int MARKDUPLICATES_OPPTICAL_DUPLICATE_MARKER = -2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo OPPTICAL

try {
metricsOutput.read(new FileReader(metricsFile));
} catch (final FileNotFoundException ex) {
System.err.println("Metrics file not found: " + ex);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a test failure?

} catch (final FileNotFoundException ex) {
System.err.println("Metrics file not found: " + ex);
}
final List<GATKDuplicationMetrics> nonEmptyMetrics = metricsOutput.getMetrics().stream().filter(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a big block of complicated duplicated code, could you extract it to a method instead?

Assert.assertEquals(observedMetrics.READ_PAIR_OPTICAL_DUPLICATES, expectedList.get(5));
Assert.assertEquals(observedMetrics.PERCENT_DUPLICATION, expectedList.get(6));

//Note: IntelliJ does not like it when a parameter for a test is null (can't print it and skips the test)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this comment is true? ImmutableList.of doesn't like nulls but I didn't think intellij had any trouble with it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not my comment, was originally adams

List<?> expectedList = metricsExpected.get(observedMetrics.LIBRARY);
expectedOpticalDuplicatesGroups += (Long) expectedList.get(5);
}
// NOTE: this test will fail if we add a more comprehensive example set with optical duplicates containing secondary/supplementary reads
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes it sound like there's a bug in the code, maybe make it clear that that it's the test that can't handle a more complicated scenerio?

oneLineSummary = "Compares the base qualities of two SAM/BAM/CRAM files",
programGroup = DiagnosticsAndQCProgramGroup.class
)
public class CompareReads extends GATKTool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to add this tool here? I don't see any tests for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either be removed or tested.

@jamesemery jamesemery force-pushed the je_finalFeaturesForMarkDuplicatesTieout branch from 298349b to 891d2f0 Compare November 28, 2018 20:47
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery One comment I somehow missed in my previous review, can we push compare reads to another branch or add tests?

build.gradle Outdated
@@ -58,7 +58,7 @@ repositories {

final requiredJavaVersion = "8"
final htsjdkVersion = System.getProperty('htsjdk.version','2.16.1')
final picardVersion = System.getProperty('picard.version','2.18.15')
final picardVersion = System.getProperty('picard.version','2.18.14-22-ge9e24bf-SNAPSHOT')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a conflict here now

oneLineSummary = "Compares the base qualities of two SAM/BAM/CRAM files",
programGroup = DiagnosticsAndQCProgramGroup.class
)
public class CompareReads extends GATKTool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either be removed or tested.

@jamesemery
Copy link
Collaborator Author

@lbergelson I removed compareReads. Can you give me another review? I would like to get this branch in in the shortish term.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@lbergelson lbergelson merged commit d2b7bb8 into master Dec 11, 2018
@lbergelson lbergelson deleted the je_finalFeaturesForMarkDuplicatesTieout branch December 11, 2018 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tie Out MarkDuplicatesSpark Compared to Picard Mark Duplicates Output
3 participants