Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CompareReferences tool #7930

Merged
merged 22 commits into from
Jul 21, 2022
Merged
Show file tree
Hide file tree
Changes from 19 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
92bc20a
initial version of reference comparator tool
orlicohen Jun 29, 2022
85bc87a
basic skeleton code
orlicohen Jun 29, 2022
4335f4f
basic output table implementation
orlicohen Jun 30, 2022
aea17a5
working on output table
orlicohen Jul 1, 2022
7869a86
updates to mode 1 comparison
orlicohen Jul 5, 2022
41ce2f6
refactoring/cleaning up existing code and adding write to file output…
orlicohen Jul 5, 2022
fa26001
table to file output and integration tests
orlicohen Jul 6, 2022
2671b43
checking in references and expected outputs, minor tweaks
orlicohen Jul 6, 2022
8f74dd7
added md5 calculation mode code and integration tests
orlicohen Jul 7, 2022
006ada6
added table query functionality
orlicohen Jul 8, 2022
b344c64
refactored TableEntry
orlicohen Jul 8, 2022
b2507a1
prelim unit tests
orlicohen Jul 8, 2022
e3b1aac
unit tests for query methods, initial table analysis
orlicohen Jul 11, 2022
ee0db1a
unit tests updated
orlicohen Jul 12, 2022
6c00ada
initial table analysis, work in progress
orlicohen Jul 14, 2022
233bc97
table analysis for exact match and diffs in sequence, sequence name, …
orlicohen Jul 15, 2022
9a5cf45
superset/subset table analysis
orlicohen Jul 15, 2022
437eb84
finished table analysis and added documentation
orlicohen Jul 18, 2022
5c887a1
update to subset/superset logic, added case to analyze table unit test
orlicohen Jul 18, 2022
dc59744
responding to review comments
orlicohen Jul 20, 2022
70ec346
responding to review comments, refactored ReferenceSequenceTable cons…
orlicohen Jul 20, 2022
a736923
responding to final review comments
orlicohen Jul 21, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -697,6 +697,10 @@ public final <T extends Feature> Object getHeaderForFeatures( final FeatureInput
return hasFeatures() ? features.getHeader(featureDescriptor) : null;
}

public final GATKPath getReferencePath(){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Public methods in GATKTool definitely need javadocs explaining themselves.

return referenceArguments.getReferenceSpecifier();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since referenceArguments is protected, this method is not needed and should be deleted. You can just access referenceArguments.getReferenceSpecifier() directly from your tool.

}

/**
* Initialize our data sources, make sure that all tool requirements for input data have been satisfied
* and start the progress meter.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
package org.broadinstitute.hellbender.tools.reference;

import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.barclay.argparser.CommandLineProgramProperties;
import org.broadinstitute.hellbender.cmdline.StandardArgumentDefinitions;
import org.broadinstitute.hellbender.engine.GATKPath;
import org.broadinstitute.hellbender.engine.GATKTool;
import org.broadinstitute.hellbender.engine.ReferenceDataSource;
import org.broadinstitute.hellbender.exceptions.UserException;
import org.broadinstitute.hellbender.utils.tsv.DataLine;
import org.broadinstitute.hellbender.utils.tsv.TableColumnCollection;
import org.broadinstitute.hellbender.utils.tsv.TableWriter;
import picard.cmdline.programgroups.ReferenceProgramGroup;

import java.io.IOException;
import java.nio.file.Path;
import java.util.*;

/**
*
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
*/
@CommandLineProgramProperties(
summary = "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fill these summaries in.

oneLineSummary = "",
programGroup = ReferenceProgramGroup.class
)
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
public class CompareReferences extends GATKTool {

@Argument(fullName = "references-to-compare", shortName = "refcomp", doc = "Reference sequence file(s) to compare.")
private List<GATKPath> references;

/**
* Output file will be written here.
*
* Note: If no output file provided, table will print to standard output.
*/
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME, doc = "If provided, file to output table.", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the doc = " " attribute for this argument, mention that the output is in TSV format. Eg., "If specified, output reference sequence table in TSV format to this file. Otherwise print it to stdout."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@orlicohen looks like you missed this

private GATKPath output;

@Argument(fullName = "md5-calculation-mode", shortName = "md5-calculation-mode", doc = "MD5CalculationMode indicating method of MD5 calculation.", optional = true)
private MD5CalculationMode md5CalculationMode = MD5CalculationMode.USE_DICT;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think James is correct that we probably want to default to RECALCULATE_IF_MISSING

@Argument(fullName = "display-sequences-by-name", doc = "If provided, the table by sequence name will be printed.", optional = true)
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
private boolean displaySequencesByName = false;

@Argument(fullName = "display-only-differing-sequences", doc = "If provided, only display sequence names ", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"only display sequence names that differ in their actual sequence"

private boolean onlyDisplayDifferingSequences = false;

public enum MD5CalculationMode {
USE_DICT,
RECALCULATE_IF_MISSING,
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
ALWAYS_RECALCULATE;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a one-line comment above each enum value explaining what each mode does.


private Map<GATKPath, ReferenceDataSource> referenceSources;

@Override
public boolean requiresReference() {
return true;
}

@Override
public void onTraversalStart() {
// add data source for -R reference
referenceSources = new LinkedHashMap<>();
referenceSources.put(getReferencePath(), directlyAccessEngineReferenceDataSource());

// add data sources for remaining references
for(GATKPath path : references){
referenceSources.put(path, ReferenceDataSource.of(path.toPath()));
}
}

@Override
public void traverse(){
ReferenceSequenceTable table = new ReferenceSequenceTable(referenceSources, md5CalculationMode);
table.build();
jamesemery marked this conversation as resolved.
Show resolved Hide resolved

if(output == null){
writeTableToStdOutput(table);
}
else{
writeTableToFileOutput(table);
}

if(displaySequencesByName){
tableBySequenceName(table);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to add a way to capture this optional second table to a file. We can discuss at one of our planning meetings.

}

List<GATKPath> refs = new ArrayList<>();
refs.addAll(referenceSources.keySet());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refs is unused and can be deleted, I believe.


List<ReferencePair> referencePairs = table.analyzeTable();
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
for(ReferencePair pair : referencePairs){
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
System.out.println(pair);
}
}

/**
* Given a table, write table to standard output.
*
* @param table
*/
private void writeTableToStdOutput(ReferenceSequenceTable table){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To unify the table writing as James suggests, you'll need to add a second constructor to your CompareReferencesOutputTableWriter class below that takes a Writer rather than a Path, and calls super(writer, columns). To make it write to stdout, you could then pass in a new OutputStreamWriter(System.out) to this constructor.

new OutputStreamWriter(System.out)

// print header
List<String> columnNames = table.getColumnNames();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think now is the time to go back and merge these two writers. They are both printing the same data (and are both also writing the header for example). I would advocate simply providing the table writer an output stream that prints to stdout here. I think that will be much cleaner rather than having these two slightly incongruent methods.

for(int i = 0 ; i < columnNames.size(); i++){
if(i == 0){
System.out.print(columnNames.get(i));
}
else{
System.out.print("\t" + columnNames.get(i));
}
}
System.out.println();

// use string format to output as a table
for(ReferenceSequenceTable.TableRow row : table){
for(ReferenceSequenceTable.TableEntry currEntry : row.getEntries()){
System.out.printf("%s\t", currEntry.getColumnValue());
}
System.out.println();
}
}

/**
* Given a table, write table to file output.
*
* @param table
*/
private void writeTableToFileOutput(ReferenceSequenceTable table) {
TableColumnCollection columns = new TableColumnCollection(table.getColumnNames());
try(CompareReferences.CompareReferencesOutputTableWriter writer = new CompareReferences.CompareReferencesOutputTableWriter(output.toPath(), columns)){
writer.writeHeaderIfApplies();
for(ReferenceSequenceTable.TableRow row : table){
writer.writeRecord(row);
}
}
catch(IOException exception){
throw new UserException.CouldNotCreateOutputFile(output, "Failed to write output table.", exception);
}
}

@Override
public Object onTraversalSuccess() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you aren't using it you can get rid of this override.

return null;
}

/**
* Given a table, write table by sequence name to standard output
*
* @param table
*/
public void tableBySequenceName(ReferenceSequenceTable table){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tableBySequenceName() -> writeTableBySequenceName()

List<String> output = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it turns out we didn't need to save the output after all, you can remove this output array and just print directly in your main loop below. This cuts down on unnecessary memory usage.

output.add("Sequence \tMD5 \tReference\n");

for(String sequenceName : table.getAllSequenceNames()){
Set<ReferenceSequenceTable.TableRow> rows = table.queryBySequenceName(sequenceName);
if(onlyDisplayDifferingSequences) {
if(rows.size() > 1) {
output.addAll(displayBySequenceName(rows, sequenceName));
}
} else {
output.addAll(displayBySequenceName(rows, sequenceName));
}
}

for(String str : output){
System.out.print(str);
}
System.out.println();
}

private List<String> displayBySequenceName(Set<ReferenceSequenceTable.TableRow> rows, String sequenceName){
List<String> output = new ArrayList<>();
output.add(sequenceName);
for(ReferenceSequenceTable.TableRow row : rows) {
ReferenceSequenceTable.TableEntry[] entries = row.getEntries();
output.add("\n\t" + row.getMd5() + "\t");

for(int i = 2; i < entries.length; i++) {
if(entries[i].getColumnValue().equals(sequenceName)) {
output.add(entries[i].getColumnName() + "\t");
}
}
}
output.add("\n");
return output;
}

@Override
public void closeTool() {
for(Map.Entry<GATKPath, ReferenceDataSource> entry : referenceSources.entrySet()){
if(!entry.getKey().equals(getReferencePath())){
entry.getValue().close();
}
}
}

public static class CompareReferencesOutputTableWriter extends TableWriter<ReferenceSequenceTable.TableRow> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a quick javadoc here explaining this is necessary to write the table output

private TableColumnCollection columnCollection;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The columnCollection instance variable is unused and can be deleted.


public CompareReferencesOutputTableWriter(final Path table, TableColumnCollection columns) throws IOException {
super(table, columns);
columnCollection = columns;
}

@Override
protected void composeLine(final ReferenceSequenceTable.TableRow record, final DataLine dataLine) {
List<String> columnNames = record.getColumnNames();
ReferenceSequenceTable.TableEntry[] entries = record.getEntries();
for(int i = 0; i < columnNames.size(); i++){
dataLine.set(entries[i].getColumnName(), entries[i].getColumnValue());
}
}
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
package org.broadinstitute.hellbender.tools.reference;

import org.broadinstitute.hellbender.engine.GATKPath;

import java.util.*;

public class ReferencePair {
jamesemery marked this conversation as resolved.
Show resolved Hide resolved

private final GATKPath ref1;
private final GATKPath ref2;
private final int ref1ColumnIndex;
private final int ref2ColumnIndex;
public enum Status{
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
EXACT_MATCH,
DIFFER_IN_SEQUENCE_NAMES,
DIFFER_IN_SEQUENCE,
DIFFER_IN_SEQUENCES_PRESENT,
SUPERSET,
SUBSET;
}
private EnumSet<Status> analysis;

public ReferencePair(ReferenceSequenceTable table, GATKPath reference1, GATKPath reference2){
ref1 = reference1;
ref2 = reference2;
ref1ColumnIndex = table.getColumnIndices().get(ref1.toPath().getFileName().toString());
ref2ColumnIndex = table.getColumnIndices().get(ref2.toPath().getFileName().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When looking up the reference column indices, call the ReferenceSequenceTable.getReferenceDisplayName() method to turn each reference path into its column name.

analysis = EnumSet.of(Status.EXACT_MATCH);
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
}

public void addStatus(Status status) {
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
analysis.add(status);
}

public void removeStatus(Status status){
analysis.remove(status);
}

public int getRef1ColumnIndex() {
return ref1ColumnIndex;
}

public int getRef2ColumnIndex() {
return ref2ColumnIndex;
}

public String getRef1(){
return ref1.toPath().getFileName().toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call ReferenceSequenceTable.getReferenceDisplayName() here too

}
public String getRef2(){
return ref2.toPath().getFileName().toString();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call ReferenceSequenceTable.getReferenceDisplayName()

}

/**
* Displays a ReferencePair's status set
*
* @return the status set as a formatted String
*/
public String statusAsString(){
String output = "";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... this currently is missing any indication of WHAT lines the status' are being generated by. I don't see where those are getting printed and it looks like we forgot to do that. I think we need to point the user at the specific lines that are problemeatic somehow for example "chr3 mismatches with reference A and B" for example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesemery We said that we would print just the statuses per reference pair in a compact tabular form -- linking the statuses to individual TableRows is not a feature that has been implemented yet, and is probably out of scope for this PR.

for(Status status : analysis){
output += String.format("\t%s\n", status.name());
}
return output;
}

public EnumSet<Status> getStatus(){
return analysis;
jamesemery marked this conversation as resolved.
Show resolved Hide resolved
}

public String toString(){
return String.format("REFERENCE PAIR: %s, %s\nStatus:\n%s",
ReferenceSequenceTable.getReferenceDisplayName(ref1),
ReferenceSequenceTable.getReferenceDisplayName(ref2),
statusAsString());
}

@Override
public boolean equals(Object o) {
if (this == o) return true;
if (o == null || getClass() != o.getClass()) return false;
ReferencePair that = (ReferencePair) o;
return Objects.equals(ref1, that.ref1) && Objects.equals(ref2, that.ref2);
}

@Override
public int hashCode() {
return Objects.hash(ref1, ref2);
}
}
Loading