-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CNNVariant Update models, validate scores, cleanup training #5175
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it took a while to get to this - did a first pass on everything except the CNNVariantTrain changes, but I had some questions, and the concordance test fails (not sure why that doesn't show up as a red x on the PR...), I think its just that the test file name is misspelled in the test.
scripts/gatkcondaenv.yml.template
Outdated
@@ -43,9 +43,8 @@ dependencies: | |||
- scikit-learn==0.19.1 | |||
- scipy==1.0.0 | |||
- six==1.11.0 | |||
- $tensorFlowDependency | |||
- tensorflow-tensorboard==0.4.0rc3 | |||
- tensorflow==1.9.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as we still have the build script generating two conda yml files, we should keep the $tensorFlowDependency
reference in here. Also, the version referenced in the yml doesn't match the version in build.gradle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
import org.broadinstitute.hellbender.utils.tsv.TableWriter; | ||
import org.broadinstitute.hellbender.utils.tsv.TableReader; | ||
|
||
public class InfoConcordanceRecord { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whole class needs javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
return writer; | ||
} | ||
catch (IOException e) { | ||
throw new UserException(String.format("Encountered an IO exception while reading from %s.", outputTable), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is writing, not reading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
programGroup=VariantEvaluationProgramGroup.class) | ||
@DocumentedFeature | ||
@BetaFeature | ||
public class VcfInfoConcordance extends AbstractConcordanceWalker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficiently different from the Concordance
tool to be a separate tool ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tool name should probably have a verb. Maybe EvaluateInfoFieldConcordance
or EvaluateSiteConcordance
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, and yes I think this is different enough from concordance which evaluates a caller, because this tool is intended for filter scores or annotations.
@Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME) | ||
protected File summary; | ||
@Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true) | ||
protected String evalInfoKey = "CNN_2D"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the existing GATKVCFConstant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
scoreScan.useDelimiter("\\n"); | ||
writeVCFHeader(vcfWriter); | ||
} catch (IOException e) { | ||
throw new GATKException("Error when trying to write annotated VCF.", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be the score file - might be worth having a nested try/catch for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the Scanner is never closed anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems createVCFWriter wont throw IOExceptions, so I just fixed the error message. Scanner is now closed.
logger.info("Done scoring variants with CNN."); | ||
if ( vcfWriter != null ) { | ||
vcfWriter.close(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should test the Scanner for null and also close that if not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -486,5 +495,20 @@ private void setArchitectureAndWeightsFromResources() { | |||
} | |||
} | |||
|
|||
private void startPythonSession(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this to something more specific - maybe one suggestion would be initializePythonArgsAndModel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be a lot simple if all of the code that figures out the values for weights
and architecture
was in one place, i.e., all of the if blocks in setArchitectureAndWeightsFromResources
and startPythonSession
together in one place. The case where you use Python None
would require special treatment, but it would be much easier to read if the logic were consolidated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed and consolidated the fxns.
pythonExecutor.sendSynchronousCommand(String.format("tempFile = open('%s', 'w+')" + NL, scoreFile.getAbsolutePath())); | ||
pythonExecutor.sendSynchronousCommand("import vqsr_cnn" + NL); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move the above import vqsr_cnn
into the python code like you did with the keras import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not easily, because the executePythonCommand
calls vqsr_cnn.score_and_write_batch so this python session needs the vqsr_cnn module loaded
@@ -92,15 +92,15 @@ | |||
* -weights path/to/my_weights.hd5 | |||
* </pre> | |||
*/ | |||
@BetaFeature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done only superficial review of the python code behind this. Has anyone else ever reviewed it ? We should talk about how to get that done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbabadi did a review too, but I would be happy for someone to take closer look if there's a volunteer.
@cmnbroad thanks for the review back to you! |
Codecov Report
@@ Coverage Diff @@
## master #5175 +/- ##
============================================
- Coverage 86.77% 86.75% -0.02%
- Complexity 29910 29935 +25
============================================
Files 1835 1838 +3
Lines 138574 138742 +168
Branches 15255 15276 +21
============================================
+ Hits 120250 120369 +119
- Misses 12772 12808 +36
- Partials 5552 5565 +13
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Round 2.
import org.broadinstitute.hellbender.utils.variant.GATKVCFConstants; | ||
import picard.cmdline.programgroups.VariantEvaluationProgramGroup; | ||
|
||
@CommandLineProgramProperties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tool needs javadoc, including a usage example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The doc here, and the summary lines below, should reflect/mention that this handles one key at a time (currently, the doc uses plural "keys" so it sounds like it can evaluate more than one at a time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc added, we can use two different keys from 1 VCF or 1 or 2 keys for 2 different vcfs. Clarified in the doc.
static final String USAGE_SUMMARY = "This tool evaluates info fields from an input VCF against a VCF that has been validated and is considered to represent ground truth.\n"; | ||
public static final String SUMMARY_LONG_NAME = "summary"; | ||
public static final String SUMMARY_SHORT_NAME = "S"; | ||
@Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fullName
can use the constant above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
public static final String SUMMARY_SHORT_NAME = "S"; | ||
@Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME) | ||
protected String summary; | ||
@Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another nit - can you put a blank line between each command line arg definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
@Argument(doc="A table of summary statistics (true positives, sensitivity, etc.)", fullName="summary", shortName=SUMMARY_SHORT_NAME) | ||
protected String summary; | ||
@Argument(fullName="eval-info-key", shortName="eval-info-key", doc="Info key from eval vcf", optional=true) | ||
protected String evalInfoKey = GATKVCFConstants.CNN_2D_KEY; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are going to be optional, and have this default, the doc should specify that that it uses this key, and where it comes from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made them required, I think that is less error-prone.
|
||
private void infoDifference(VariantContext eval, VariantContext truth) { | ||
double evalVal = Double.valueOf((String)eval.getAttribute(this.evalInfoKey)); | ||
double truthVal = Double.valueOf((String)truth.getAttribute(this.truthInfoKey)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should handle the case where the key isn't present, otherwise it will throw a null pointer exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tool should inspect both truth and eval headers and throw, or at least warn/log, if the header doesn't contain an info header line for the respective keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
import htsjdk.variant.variantcontext.VariantContext; | ||
|
||
import java.io.File; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
if (Math.abs(delta) > this.epsilon) { | ||
this.logger.warn(String.format("Difference (%f) greater than epsilon (%f) at %s:%d %s:", delta, this.epsilon, eval.getContig(), eval.getStart(), eval.getAlleles().toString())); | ||
this.logger.warn(String.format("\t\tTruth info: " + truth.getAttributes().toString())); | ||
this.logger.warn(String.format("\t\t Eval info: " + eval.getAttributes().toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extra space before " Eval".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
final double mean; | ||
final double std; | ||
|
||
public InfoConcordanceRecord(VariantContext.Type type, String evalKey, String trueKey, double mean, double std) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Public methods and classes need javadoc - it should be trivial to add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
} | ||
|
||
String getArgsAndModel; | ||
if (weights != null && architecture != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my comment in the last review round, I was thinking more along the lines of collapsing the logic to make it easier to see whats going on, ie.
if (!(tensorType.equals(TensorType.read_tensor) || tensorType.equals(TensorType.reference))) {
throw new GATKException("No default architecture for tensor type:" + tensorType.name());
}
if (architecture == null) {
architecture = tensorType.equals(TensorType.read_tensor) ?
IOUtils.writeTempResourceFromPath(resourcePathReadTensor, null).getAbsolutePath() :
IOUtils.writeTempResourceFromPath(resourcePathReferenceTensor, null).getAbsolutePath();
}
if (weights == null) {
weights = tensorType.equals(TensorType.read_tensor) ?
IOUtils.writeTempResourceFromPath(
resourcePathReadTensor.replace(".json", ".hd5"),
null).getAbsolutePath() :
IOUtils.writeTempResourceFromPath(
resourcePathReferenceTensor.replace(".json", ".hd5"), null).getAbsolutePath();
}
// single logger call...
// single start_session_get_args_and_model call...
There would always be some initial value for weights
, even if the user provides an architecture
but not weights, so there could be a single call to the python code with everything, instead of varying the number of args. It would be much cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what is the behavior if the user provides arch/weights that are out of sync with each other ? I still think it might make more sense to have the input just be a folder name, so it matches the output from CNNVariantTrain, maybe with optional per-file arch/weight overrides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, but can we save this for the PEP8 PR that is on-deck? It will require several python changes to the python code as well as updates to the WDLs, etc so I would rather only refactor that stuff once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@@ -5,6 +5,7 @@ | |||
|
|||
TENSOR_MAPS_2D = ['read_tensor'] | |||
TENSOR_MAPS_1D = ['reference'] | |||
# noinspection PyInterpreter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about why this is necessary, or if its accidental ? this applies to the next line right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accident, I don't know.
final Path summary = createTempPath("summary", ".txt"); | ||
final ArgumentsBuilder argsBuilder = new ArgumentsBuilder(); | ||
argsBuilder.addArgument(AbstractConcordanceWalker.EVAL_VARIANTS_SHORT_NAME, inputVcf) | ||
.addArgument(AbstractConcordanceWalker.TRUTH_VARIANTS_LONG_NAME, inputVcf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also meant to suggest adding at least one test that uses a different file for truth than eval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
@cmnbroad thanks for the review. I think addressed all comments except the arguments/weights simplification, which I would prefer to save for the PEP8 refactor we discussed. Back to you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there - a few remaining cleanup requests and a bug.
switch (concordanceState) { | ||
case TRUE_POSITIVE: { | ||
snpCount++; | ||
indelCount++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is counting every variant as a snp and an indel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh man that's bad. Thanks for the catch, fixed.
this.logger.warn(String.format("Difference (%f) greater than epsilon (%f) at %s:%d %s:", delta, this.epsilon, eval.getContig(), eval.getStart(), eval.getAlleles().toString())); | ||
this.logger.warn(String.format("\t\tTruth info: " + truth.getAttributes().toString())); | ||
this.logger.warn(String.format("\t\t Eval info: " + eval.getAttributes().toString())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally don't leave debugging code enabled. If its just for (your) debugging, I'd either remove it, or else make it conditional on an @Advanced
or @Hidden
arg (and probably make epsilon @Advanced
or @Hidden
as well).
this.std = std; | ||
} | ||
|
||
public VariantContext.Type getVariantType() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the public methods and (embedded classes) still need javadoc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
} | ||
|
||
String getArgsAndModel; | ||
if (weights != null && architecture != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
} | ||
|
||
@Test | ||
public void test2Vcfs() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two test methods are identical except for the test params. It would be preferable to use a DataProvider with truth/eval files, keys, and snp/indel mean/stdev and remove the redundancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup that is much nicer. Thanks!
@cmnbroad back to you. The build.gradle conflict is from the genomicsdb and tensorflow version updates. I can resolve and rebase if it looks good to you. |
@lucidtronix Looks good now - lets squash and rebase/resolve on master and run tests again and should be good to go. |
fb661e4
to
65d0edd
Compare
Squashed rebased and checks pass, @cmnbroad good to merge? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lucidtronix!
Code cleanup and technical debt payback, plus updated models.