Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompareReferences tool #7930

Merged
merged 22 commits into from
Jul 21, 2022
Merged

CompareReferences tool #7930

merged 22 commits into from
Jul 21, 2022

Conversation

orlicohen
Copy link
Contributor

New tool to compare references.
(Linked to #6837)

@codecov
Copy link

codecov bot commented Jul 7, 2022

Codecov Report

Merging #7930 (a736923) into master (804d1e2) will decrease coverage by 0.017%.
The diff coverage is 83.402%.

@@               Coverage Diff               @@
##              master     #7930       +/-   ##
===============================================
- Coverage     87.078%   87.062%   -0.017%     
- Complexity     36997     37007       +10     
===============================================
  Files           2217      2218        +1     
  Lines         173849    173758       -91     
  Branches       18791     18769       -22     
===============================================
- Hits          151385    151277      -108     
- Misses         15832     15896       +64     
+ Partials        6632      6585       -47     
Impacted Files Coverage Δ
.../hellbender/tools/reference/CompareReferences.java 66.667% <66.667%> (ø)
...ute/hellbender/utils/reference/ReferenceUtils.java 73.529% <66.667%> (-5.418%) ⬇️
...ls/reference/CompareReferencesIntegrationTest.java 77.465% <77.465%> (ø)
...bender/tools/reference/ReferenceSequenceTable.java 85.311% <85.311%> (ø)
...tute/hellbender/tools/reference/ReferencePair.java 86.486% <86.486%> (ø)
...ools/reference/ReferenceSequenceTableUnitTest.java 97.872% <97.872%> (ø)
...bender/utils/reference/ReferenceUtilsUnitTest.java 75.556% <100.000%> (+6.984%) ⬆️
...er/tools/spark/sv/evidence/BreakpointEvidence.java 66.822% <0.000%> (-11.916%) ⬇️
...on/FindBreakpointEvidenceSparkIntegrationTest.java 96.296% <0.000%> (-3.704%) ⬇️
...rg/broadinstitute/hellbender/utils/SVInterval.java 90.000% <0.000%> (-2.857%) ⬇️
... and 31 more

@orlicohen orlicohen requested a review from droazen July 8, 2022 15:35
@droazen droazen self-assigned this Jul 8, 2022
@droazen
Copy link
Contributor

droazen commented Jul 18, 2022

@orlicohen It looks like you accidentally rebased a bunch of unrelated commits -- we'll have to repair the branch before we can review

@orlicohen orlicohen force-pushed the oc_referencecomparator_6837 branch from a0e00bb to 437eb84 Compare July 18, 2022 18:40
Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. A few bigger comments and mostly a few minor comments and requests for some more documentation.

@@ -697,6 +697,10 @@ public final <T extends Feature> Object getHeaderForFeatures( final FeatureInput
return hasFeatures() ? features.getHeader(featureDescriptor) : null;
}

public final GATKPath getReferencePath(){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public methods in GATKTool definitely need javadocs explaining themselves.

*
*/
@CommandLineProgramProperties(
summary = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fill these summaries in.

@@ -469,6 +469,7 @@ public static Set<String> getContigNames(SAMSequenceDictionary dict) {
for (SAMSequenceRecord dictionaryEntry : dict.getSequences())
contigNames.add(dictionaryEntry.getSequenceName());
return contigNames;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidental, remove

* @return the status set as a formatted String
*/
public String statusAsString(){
String output = "";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this currently is missing any indication of WHAT lines the status' are being generated by. I don't see where those are getting printed and it looks like we forgot to do that. I think we need to point the user at the specific lines that are problemeatic somehow for example "chr3 mismatches with reference A and B" for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery We said that we would print just the statuses per reference pair in a compact tabular form -- linking the statuses to individual TableRows is not a feature that has been implemented yet, and is probably out of scope for this PR.

*/
private void writeTableToStdOutput(ReferenceSequenceTable table){
// print header
List<String> columnNames = table.getColumnNames();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now is the time to go back and merge these two writers. They are both printing the same data (and are both also writing the header for example). I would advocate simply providing the table writer an output stream that prints to stdout here. I think that will be much cleaner rather than having these two slightly incongruent methods.

@@ -59,7 +59,8 @@ public void testAddFileArgument() throws Exception {
Assert.assertEquals(args.getArgsArray(), new String[]{"--foo", file.getAbsolutePath()});
}

private enum AnEnum { OPTION1, OPTION2}
private enum
AnEnum { OPTION1, OPTION2}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

accidental undo


@Test
public void testCalculateMD5(){
final File reference = new File( "src/test/resources/org/broadinstitute/hellbender/tools/reference/CompareReferences/hg19mini.fasta");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test looks good but I would like to get an idea for how long it takes on a full reference. If you try running it on Homo_sapiens_assembly38.20.21.fasta how long does it take?

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen Posted an initial round of comments -- still need to review the tests

*
* Note: If no output file provided, table will print to standard output.
*/
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "If provided, file to output table.", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the doc = " " attribute for this argument, mention that the output is in TSV format. Eg., "If specified, output reference sequence table in TSV format to this file. Otherwise print it to stdout."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen looks like you missed this

private GATKPath output;

@Argument(fullName = "md5-calculation-mode", shortName = "md5-calculation-mode", doc = "MD5CalculationMode indicating method of MD5 calculation.", optional = true)
private MD5CalculationMode md5CalculationMode = MD5CalculationMode.USE_DICT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think James is correct that we probably want to default to RECALCULATE_IF_MISSING

@Argument(fullName = "display-sequences-by-name", doc = "If provided, the table by sequence name will be printed.", optional = true)
private boolean displaySequencesByName = false;

@Argument(fullName = "display-only-differing-sequences", doc = "If provided, only display sequence names ", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"only display sequence names that differ in their actual sequence"

* Given a sequence name, returns the set of its corresponding rows
*
* @param sequenceName The sequence name as a String
* @return the set of TableRows that contain the sequence name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"or an empty Set if there are no such rows"

* @return the set of TableRows that contain the sequence name
*/
public Set<TableRow> queryBySequenceName(String sequenceName){
return tableBySequenceName.get(sequenceName) == null ? Collections.emptySet() : tableBySequenceName.get(sequenceName);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save the return value from get() in a local variable so that you don't have to call it twice.

}

@Override
public boolean equals(Object o) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Document the reason why equals() and hashCode() only check the md5

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen I would add a little more detail as to why you chose to do this specifically.

}
break;
case ALWAYS_RECALCULATE:
md5 = ReferenceUtils.calculateMD5(source, referenceInterval);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be a check here if this md5 mismatches with the md5 from the dictionary (if it exists of course). If it ever does we should use the logger to throw a warning to user listing the two md5s and the contig name and the reference for which it is mismatching.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen Here are the rest of my review comments


if(sequenceNameFound == 2){
pair.addStatus(ReferencePair.Status.DIFFER_IN_SEQUENCE);
} else if(sequenceNameFound == 1){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment explaining the logic here (eg., why 2 means DIFFER_IN_SEQUENCE and 1 means DIFFER_IN_SEQUENCES_PRESENT)

int ref1Index = pair.getRef1ColumnIndex();
int ref2Index = pair.getRef2ColumnIndex();
Set<ReferenceSequenceTable.TableRow> rows = queryBySequenceName(sequenceName);
int sequenceNameFound = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really counting the number of times a sequence name was found and the other reference was missing the sequence. Consider renaming to make this clearer. Eg., sequenceNameFoundInOneRef


import java.util.*;

public class ReferenceSequenceTableUnitTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extends GATKBaseTest

for (ReferenceSequenceTable.TableEntry entry : row.getEntries()) {
if (entry.getColumnValue().equals(expectedSequenceName)) {
contigPresent = true;
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think you want break; here to break out of the inner loop

continue;
}
}
Assert.assertTrue(contigPresent);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a String as a second arg to assertTrue() that explains what went wrong when the assert fails. Eg.,

"Contig " + expectedSequenceName + " not found in all rows returned by queryBySequenceName(expectedSequenceName)"

This will get output when the assert fails

}
}

public Map<ReferencePair, Set<ReferencePair.Status>> manuallySetReferencePairStatus(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private

@@ -0,0 +1,5 @@
MD5 Length hg19mini.fasta hg19mini_1renamed.fasta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the three files in the tempreferences directory from this PR using git rm

return tableBySequenceName.get(sequenceName) == null ? Collections.emptySet() : tableBySequenceName.get(sequenceName);
}

private String calculateMD5(SAMSequenceRecord record, ReferenceDataSource source) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mehtod is somewhat confusing in name. calculateMD5 implies we are actually calculating the md5 but we aren't here and indeed there is a second method you added that does that. Change this to something like getMD5ForRecord() or some such.

@droazen droazen changed the title **DO NOT MERGE** CompareReferences tool CompareReferences tool Jul 20, 2022
@Override
public void traverse(){
ReferenceSequenceTable table = new ReferenceSequenceTable(referenceSources, md5CalculationMode);
logger.warn("Building reference sequence table.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make these two logger.wan() messages logger.info() messages

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things you missed and we few pretty minor comments but otherwise this is looking good. Let me know when you want me to look at that refacotred mode. When you do that refactor remember that you should probably have at least a unit test asserting that you can build the table from a sequence dictionary.

writeTableBySequenceName(table);
}

logger.warn("Analyzing table.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.info()

DIFFER_IN_SEQUENCE_NAMES,
// found sequences with same names but inconsistent MD5s between the 2 references
DIFFER_IN_SEQUENCE,
// a sequence is present in one reference but not the other
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"at least one sequence"

/**
* Given a reference data source and a sequence interval, calculates the MD5 for the given sequence.
*
* Note: does not close the ReferenceDataSource it's passed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a to this comment a line stating that this md5 code mimics the MD5 calculation code in {@link picard.sam.CreateSequenceDictionary} in that it allws IUPAC bases but uppercases all the bases.

}

public boolean compareRows(ReferenceSequenceTable.TableRow row1, ReferenceSequenceTable.TableRow row2){
return row1.getMd5().equals(row2.getMd5()) && row1.getLength() == row2.getLength();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case ALWAYS_RECALCULATE:
md5 = ReferenceUtils.calculateMD5(referencePath, referenceInterval);
if(md5FromDict != null && !md5FromDict.equals(md5)){
logger.warn(String.format("MD5 Mismatch for sequence %s. Found '%s', but calculated '%s'", record.getSequenceName(), md5FromDict, md5));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add "This could indicate that the sequence dictionary is invalid"

/**
* Display reference comparison as a tab-delimited table and summarize reference differences.
*
* <p>This tool generates a MD5-keyed table comparing specified references and does an analysis to summarize the differences
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good javadoc for the class

}

@Override
public Object onTraversalSuccess() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you aren't using it you can get rid of this override.

}
}

public static class CompareReferencesOutputTableWriter extends TableWriter<ReferenceSequenceTable.TableRow> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a quick javadoc here explaining this is necessary to write the table output

* @return the corresponding TableRow from the tableByMD5
*/
public TableRow queryByMD5(String md5){
if(!tableBuilt){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you repeat this line in a bunch of places. Can you make a helper method checkTableStatus() or something like that and pull out all these if(!tableBuilt) statements to call it for readability.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen A few final comments for you, then this should be ready for merge. Tool is looking good!

* @return path for the reference specified at the command line
*/
public final GATKPath getReferencePath(){
return referenceArguments.getReferenceSpecifier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since referenceArguments is protected, this method is not needed and should be deleted. You can just access referenceArguments.getReferenceSpecifier() directly from your tool.

* <pre>
* gatk CompareReferences \
* -R reference1.fasta \
* -ref-comp reference2.fasta
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The argument is actually -refcomp (or --references-to-compare)

USE_DICT,
RECALCULATE_IF_MISSING,
ALWAYS_RECALCULATE;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a one-line comment above each enum value explaining what each mode does.

private void writeTable(ReferenceSequenceTable table){
TableColumnCollection columns = new TableColumnCollection(table.getColumnNames());
try(CompareReferences.CompareReferencesOutputTableWriter writer = output == null
? new CompareReferences.CompareReferencesOutputTableWriter(new OutputStreamWriter(System.out), columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there is no test coverage for the stdout case, have you run the tool manually without -O and confirmed that the table gets printed to stdout correctly? It might be a good idea to also add an integration test that runs the tool without -O, and just makes sure it completes successfully/doesn't crash (don't need to check the actual output).

* @param table
*/
public void writeTableBySequenceName(ReferenceSequenceTable table){
List<String> output = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it turns out we didn't need to save the output after all, you can remove this output array and just print directly in your main loop below. This cuts down on unnecessary memory usage.

case ALWAYS_RECALCULATE:
md5 = ReferenceUtils.calculateMD5(referencePath, referenceInterval);
if(md5FromDict != null && !md5FromDict.equals(md5)){
logger.warn(String.format("MD5 Mismatch for sequence %s. Found '%s', but calculated '%s'. Sequence dictionary may be invalid.", record.getSequenceName(), md5FromDict, md5));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning should mention the name of the reference involved. It should also say "Found %s in sequence dictionary" for clarity

pair.removeStatus(ReferencePair.Status.EXACT_MATCH);
}
if(!ref1Value.getColumnValue().equals(ref2Value.getColumnValue()) &&
!(ref1Value.getColumnValue().equals(MISSING_ENTRY_DISPLAY_STRING) || ref2Value.getColumnValue().equals(MISSING_ENTRY_DISPLAY_STRING))){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the TableEntry.isEmpty() method whenever you need to check whether an entry is missing. Eg., ref1Value.isEmpty() instead of ref1Value.getColumnValue().equals(MISSING_ENTRY_DISPLAY_STRING). Comment also applies to the rest of the method below.

public String toString(){
String row = String.format("%s\t", this.md5);
for(int i = 0; i < entries.length; i++){
row += entries[i].columnValue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the toString() method end up adding the MD5 twice, since it's in column 0?

/**
* Given a reference data source and a sequence interval, calculates the MD5 for the given sequence.
*
* Note: does not close the ReferenceDataSource it's passed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment is obsolete, since the method no longer takes a ReferenceDataSource

throw new GATKException("Incorrect MessageDigest algorithm specified in calculateMD5()", exception);
}

try(final ReferenceDataSource source = ReferenceDataSource.of(referencePath.toPath(), true)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment: we pass in "true" as the second arg to ReferenceDataSource.of() to tell it not to modify IUPAC bases or capitalization for us.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks like all comments have been addressed! Will merge once tests pass

@droazen droazen merged commit 64f9953 into master Jul 21, 2022
@droazen droazen deleted the oc_referencecomparator_6837 branch July 21, 2022 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants