-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CompareReferences tool #7930
CompareReferences tool #7930
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7930 +/- ##
===============================================
- Coverage 87.078% 87.062% -0.017%
- Complexity 36997 37007 +10
===============================================
Files 2217 2218 +1
Lines 173849 173758 -91
Branches 18791 18769 -22
===============================================
- Hits 151385 151277 -108
- Misses 15832 15896 +64
+ Partials 6632 6585 -47
|
@orlicohen It looks like you accidentally rebased a bunch of unrelated commits -- we'll have to repair the branch before we can review |
a0e00bb
to
437eb84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. A few bigger comments and mostly a few minor comments and requests for some more documentation.
@@ -697,6 +697,10 @@ public final <T extends Feature> Object getHeaderForFeatures( final FeatureInput | |||
return hasFeatures() ? features.getHeader(featureDescriptor) : null; | |||
} | |||
|
|||
public final GATKPath getReferencePath(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public methods in GATKTool definitely need javadocs explaining themselves.
* | ||
*/ | ||
@CommandLineProgramProperties( | ||
summary = "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fill these summaries in.
src/main/java/org/broadinstitute/hellbender/tools/reference/CompareReferences.java
Show resolved
Hide resolved
@@ -469,6 +469,7 @@ public static Set<String> getContigNames(SAMSequenceDictionary dict) { | |||
for (SAMSequenceRecord dictionaryEntry : dict.getSequences()) | |||
contigNames.add(dictionaryEntry.getSequenceName()); | |||
return contigNames; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accidental, remove
src/main/java/org/broadinstitute/hellbender/tools/reference/ReferencePair.java
Show resolved
Hide resolved
* @return the status set as a formatted String | ||
*/ | ||
public String statusAsString(){ | ||
String output = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... this currently is missing any indication of WHAT lines the status' are being generated by. I don't see where those are getting printed and it looks like we forgot to do that. I think we need to point the user at the specific lines that are problemeatic somehow for example "chr3 mismatches with reference A and B" for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery We said that we would print just the statuses per reference pair in a compact tabular form -- linking the statuses to individual TableRows is not a feature that has been implemented yet, and is probably out of scope for this PR.
*/ | ||
private void writeTableToStdOutput(ReferenceSequenceTable table){ | ||
// print header | ||
List<String> columnNames = table.getColumnNames(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think now is the time to go back and merge these two writers. They are both printing the same data (and are both also writing the header for example). I would advocate simply providing the table writer an output stream that prints to stdout here. I think that will be much cleaner rather than having these two slightly incongruent methods.
@@ -59,7 +59,8 @@ public void testAddFileArgument() throws Exception { | |||
Assert.assertEquals(args.getArgsArray(), new String[]{"--foo", file.getAbsolutePath()}); | |||
} | |||
|
|||
private enum AnEnum { OPTION1, OPTION2} | |||
private enum | |||
AnEnum { OPTION1, OPTION2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accidental undo
...est/java/org/broadinstitute/hellbender/tools/reference/CompareReferencesIntegrationTest.java
Show resolved
Hide resolved
|
||
@Test | ||
public void testCalculateMD5(){ | ||
final File reference = new File( "src/test/resources/org/broadinstitute/hellbender/tools/reference/CompareReferences/hg19mini.fasta"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test looks good but I would like to get an idea for how long it takes on a full reference. If you try running it on Homo_sapiens_assembly38.20.21.fasta how long does it take?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen Posted an initial round of comments -- still need to review the tests
* | ||
* Note: If no output file provided, table will print to standard output. | ||
*/ | ||
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "If provided, file to output table.", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the doc = " "
attribute for this argument, mention that the output is in TSV format. Eg., "If specified, output reference sequence table in TSV format to this file. Otherwise print it to stdout."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen looks like you missed this
src/main/java/org/broadinstitute/hellbender/tools/reference/CompareReferences.java
Show resolved
Hide resolved
private GATKPath output; | ||
|
||
@Argument(fullName = "md5-calculation-mode", shortName = "md5-calculation-mode", doc = "MD5CalculationMode indicating method of MD5 calculation.", optional = true) | ||
private MD5CalculationMode md5CalculationMode = MD5CalculationMode.USE_DICT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think James is correct that we probably want to default to RECALCULATE_IF_MISSING
@Argument(fullName = "display-sequences-by-name", doc = "If provided, the table by sequence name will be printed.", optional = true) | ||
private boolean displaySequencesByName = false; | ||
|
||
@Argument(fullName = "display-only-differing-sequences", doc = "If provided, only display sequence names ", optional = true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"only display sequence names that differ in their actual sequence"
src/main/java/org/broadinstitute/hellbender/tools/reference/CompareReferences.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/reference/ReferenceSequenceTable.java
Show resolved
Hide resolved
* Given a sequence name, returns the set of its corresponding rows | ||
* | ||
* @param sequenceName The sequence name as a String | ||
* @return the set of TableRows that contain the sequence name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"or an empty Set if there are no such rows"
* @return the set of TableRows that contain the sequence name | ||
*/ | ||
public Set<TableRow> queryBySequenceName(String sequenceName){ | ||
return tableBySequenceName.get(sequenceName) == null ? Collections.emptySet() : tableBySequenceName.get(sequenceName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Save the return value from get()
in a local variable so that you don't have to call it twice.
src/main/java/org/broadinstitute/hellbender/tools/reference/ReferenceSequenceTable.java
Show resolved
Hide resolved
} | ||
|
||
@Override | ||
public boolean equals(Object o) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document the reason why equals()
and hashCode()
only check the md5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen I would add a little more detail as to why you chose to do this specifically.
} | ||
break; | ||
case ALWAYS_RECALCULATE: | ||
md5 = ReferenceUtils.calculateMD5(source, referenceInterval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be a check here if this md5 mismatches with the md5 from the dictionary (if it exists of course). If it ever does we should use the logger to throw a warning to user listing the two md5s and the contig name and the reference for which it is mismatching.
src/main/java/org/broadinstitute/hellbender/tools/reference/CompareReferences.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen Here are the rest of my review comments
|
||
if(sequenceNameFound == 2){ | ||
pair.addStatus(ReferencePair.Status.DIFFER_IN_SEQUENCE); | ||
} else if(sequenceNameFound == 1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment explaining the logic here (eg., why 2 means DIFFER_IN_SEQUENCE and 1 means DIFFER_IN_SEQUENCES_PRESENT)
src/main/java/org/broadinstitute/hellbender/tools/reference/ReferenceSequenceTable.java
Show resolved
Hide resolved
src/main/java/org/broadinstitute/hellbender/tools/reference/CompareReferences.java
Show resolved
Hide resolved
int ref1Index = pair.getRef1ColumnIndex(); | ||
int ref2Index = pair.getRef2ColumnIndex(); | ||
Set<ReferenceSequenceTable.TableRow> rows = queryBySequenceName(sequenceName); | ||
int sequenceNameFound = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really counting the number of times a sequence name was found and the other reference was missing the sequence. Consider renaming to make this clearer. Eg., sequenceNameFoundInOneRef
|
||
import java.util.*; | ||
|
||
public class ReferenceSequenceTableUnitTest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extends GATKBaseTest
for (ReferenceSequenceTable.TableEntry entry : row.getEntries()) { | ||
if (entry.getColumnValue().equals(expectedSequenceName)) { | ||
contigPresent = true; | ||
continue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think you want break;
here to break out of the inner loop
continue; | ||
} | ||
} | ||
Assert.assertTrue(contigPresent); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a String as a second arg to assertTrue()
that explains what went wrong when the assert fails. Eg.,
"Contig " + expectedSequenceName + " not found in all rows returned by queryBySequenceName(expectedSequenceName)"
This will get output when the assert fails
src/test/java/org/broadinstitute/hellbender/tools/reference/ReferenceSequenceTableUnitTest.java
Show resolved
Hide resolved
} | ||
} | ||
|
||
public Map<ReferencePair, Set<ReferencePair.Status>> manuallySetReferencePairStatus(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private
@@ -0,0 +1,5 @@ | |||
MD5 Length hg19mini.fasta hg19mini_1renamed.fasta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove the three files in the tempreferences
directory from this PR using git rm
return tableBySequenceName.get(sequenceName) == null ? Collections.emptySet() : tableBySequenceName.get(sequenceName); | ||
} | ||
|
||
private String calculateMD5(SAMSequenceRecord record, ReferenceDataSource source) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mehtod is somewhat confusing in name. calculateMD5
implies we are actually calculating the md5 but we aren't here and indeed there is a second method you added that does that. Change this to something like getMD5ForRecord()
or some such.
src/main/java/org/broadinstitute/hellbender/utils/reference/ReferenceUtils.java
Show resolved
Hide resolved
@Override | ||
public void traverse(){ | ||
ReferenceSequenceTable table = new ReferenceSequenceTable(referenceSources, md5CalculationMode); | ||
logger.warn("Building reference sequence table."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make these two logger.wan()
messages logger.info()
messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple things you missed and we few pretty minor comments but otherwise this is looking good. Let me know when you want me to look at that refacotred mode. When you do that refactor remember that you should probably have at least a unit test asserting that you can build the table from a sequence dictionary.
writeTableBySequenceName(table); | ||
} | ||
|
||
logger.warn("Analyzing table."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.info()
DIFFER_IN_SEQUENCE_NAMES, | ||
// found sequences with same names but inconsistent MD5s between the 2 references | ||
DIFFER_IN_SEQUENCE, | ||
// a sequence is present in one reference but not the other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"at least one sequence"
src/main/java/org/broadinstitute/hellbender/utils/reference/ReferenceUtils.java
Show resolved
Hide resolved
/** | ||
* Given a reference data source and a sequence interval, calculates the MD5 for the given sequence. | ||
* | ||
* Note: does not close the ReferenceDataSource it's passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a to this comment a line stating that this md5 code mimics the MD5 calculation code in {@link picard.sam.CreateSequenceDictionary}
in that it allws IUPAC bases but uppercases all the bases.
} | ||
|
||
public boolean compareRows(ReferenceSequenceTable.TableRow row1, ReferenceSequenceTable.TableRow row2){ | ||
return row1.getMd5().equals(row2.getMd5()) && row1.getLength() == row2.getLength(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case ALWAYS_RECALCULATE: | ||
md5 = ReferenceUtils.calculateMD5(referencePath, referenceInterval); | ||
if(md5FromDict != null && !md5FromDict.equals(md5)){ | ||
logger.warn(String.format("MD5 Mismatch for sequence %s. Found '%s', but calculated '%s'", record.getSequenceName(), md5FromDict, md5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add "This could indicate that the sequence dictionary is invalid"
/** | ||
* Display reference comparison as a tab-delimited table and summarize reference differences. | ||
* | ||
* <p>This tool generates a MD5-keyed table comparing specified references and does an analysis to summarize the differences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good javadoc for the class
} | ||
|
||
@Override | ||
public Object onTraversalSuccess() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you aren't using it you can get rid of this override.
} | ||
} | ||
|
||
public static class CompareReferencesOutputTableWriter extends TableWriter<ReferenceSequenceTable.TableRow> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a quick javadoc here explaining this is necessary to write the table output
* @return the corresponding TableRow from the tableByMD5 | ||
*/ | ||
public TableRow queryByMD5(String md5){ | ||
if(!tableBuilt){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you repeat this line in a bunch of places. Can you make a helper method checkTableStatus()
or something like that and pull out all these if(!tableBuilt)
statements to call it for readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@orlicohen A few final comments for you, then this should be ready for merge. Tool is looking good!
* @return path for the reference specified at the command line | ||
*/ | ||
public final GATKPath getReferencePath(){ | ||
return referenceArguments.getReferenceSpecifier(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since referenceArguments
is protected
, this method is not needed and should be deleted. You can just access referenceArguments.getReferenceSpecifier()
directly from your tool.
* <pre> | ||
* gatk CompareReferences \ | ||
* -R reference1.fasta \ | ||
* -ref-comp reference2.fasta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The argument is actually -refcomp
(or --references-to-compare
)
USE_DICT, | ||
RECALCULATE_IF_MISSING, | ||
ALWAYS_RECALCULATE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a one-line comment above each enum value explaining what each mode does.
private void writeTable(ReferenceSequenceTable table){ | ||
TableColumnCollection columns = new TableColumnCollection(table.getColumnNames()); | ||
try(CompareReferences.CompareReferencesOutputTableWriter writer = output == null | ||
? new CompareReferences.CompareReferencesOutputTableWriter(new OutputStreamWriter(System.out), columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no test coverage for the stdout case, have you run the tool manually without -O
and confirmed that the table gets printed to stdout correctly? It might be a good idea to also add an integration test that runs the tool without -O
, and just makes sure it completes successfully/doesn't crash (don't need to check the actual output).
* @param table | ||
*/ | ||
public void writeTableBySequenceName(ReferenceSequenceTable table){ | ||
List<String> output = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it turns out we didn't need to save the output after all, you can remove this output
array and just print directly in your main loop below. This cuts down on unnecessary memory usage.
case ALWAYS_RECALCULATE: | ||
md5 = ReferenceUtils.calculateMD5(referencePath, referenceInterval); | ||
if(md5FromDict != null && !md5FromDict.equals(md5)){ | ||
logger.warn(String.format("MD5 Mismatch for sequence %s. Found '%s', but calculated '%s'. Sequence dictionary may be invalid.", record.getSequenceName(), md5FromDict, md5)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning should mention the name of the reference involved. It should also say "Found %s in sequence dictionary" for clarity
pair.removeStatus(ReferencePair.Status.EXACT_MATCH); | ||
} | ||
if(!ref1Value.getColumnValue().equals(ref2Value.getColumnValue()) && | ||
!(ref1Value.getColumnValue().equals(MISSING_ENTRY_DISPLAY_STRING) || ref2Value.getColumnValue().equals(MISSING_ENTRY_DISPLAY_STRING))){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the TableEntry.isEmpty()
method whenever you need to check whether an entry is missing. Eg., ref1Value.isEmpty()
instead of ref1Value.getColumnValue().equals(MISSING_ENTRY_DISPLAY_STRING)
. Comment also applies to the rest of the method below.
public String toString(){ | ||
String row = String.format("%s\t", this.md5); | ||
for(int i = 0; i < entries.length; i++){ | ||
row += entries[i].columnValue; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the toString()
method end up adding the MD5 twice, since it's in column 0?
/** | ||
* Given a reference data source and a sequence interval, calculates the MD5 for the given sequence. | ||
* | ||
* Note: does not close the ReferenceDataSource it's passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is obsolete, since the method no longer takes a ReferenceDataSource
throw new GATKException("Incorrect MessageDigest algorithm specified in calculateMD5()", exception); | ||
} | ||
|
||
try(final ReferenceDataSource source = ReferenceDataSource.of(referencePath.toPath(), true)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment: we pass in "true" as the second arg to ReferenceDataSource.of()
to tell it not to modify IUPAC bases or capitalization for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks like all comments have been addressed! Will merge once tests pass
New tool to compare references.
(Linked to #6837)