Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNVQ recalibration for flow based reads #8697

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
0ce6afe
ApplySNVQR tool squashed
dror27 Nov 14, 2023
bf5d2d8
[BIOIN-1385] FlowBasedRead now can estimate filling_value from the read
ilyasoifer Dec 27, 2023
876a4e6
[BIOIN-1385] Removed probabilityRatioThreshold
ilyasoifer Dec 27, 2023
86d3ae4
[BIOIN-1385] Removed mistaken usage by fillingValue
ilyasoifer Dec 28, 2023
c5e2ad8
[BIOIN-1385] Fixed bug in t0 parsing
ilyasoifer Dec 30, 2023
fcbbd3c
[BIOIN-1385] Reverted the aggressive change in filling probability in…
ilyasoifer Jan 1, 2024
ba29b3c
[BIOIN-1385] Minor update of poorly_modelded_reads
ilyasoifer Jan 1, 2024
ac66953
[BIOIN-1385] FlowBasedRead class cleanup
ilyasoifer Jan 2, 2024
8d9782f
[BIOIN-1385] Updated ground truth scorer for the case when the fillin…
ilyasoifer Jan 2, 2024
a032030
Fixed typing error + missed file
ilyasoifer Jan 3, 2024
626a458
ceil -> round
ilyasoifer Jan 3, 2024
b4d8341
Set phred score for bins with no zeros to maximum rather than zero
ilyasoifer Jan 5, 2024
2d69e85
Updated tests
ilyasoifer Jan 5, 2024
92aab19
Updated tests
ilyasoifer Jan 5, 2024
c564dc6
Updated tests
ilyasoifer Jan 5, 2024
51ebdb5
-10 qual fix
dror27 Jan 18, 2024
a7947b9
Guard against negative qualities
dror27 Jan 21, 2024
5e6da62
(not final) reuse AddFlowBaseQuality code to accelerate
dror27 Jan 22, 2024
fcb1f83
--debug-use-add-flow-base-quality-algorithm flasg
dror27 Jan 23, 2024
8f17ea5
use of add_base_q now default
dror27 Jan 23, 2024
8634c11
ApplySNVQR test data rebuilt
dror27 Jan 23, 2024
9ba77d3
add_base_q integration in progress ... missing mapping function
dror27 Jan 23, 2024
44b6763
allow unsorted/unmapped input to ApplySNVQR when no model specified
dror27 Jan 24, 2024
2f478de
SNVQ of called base adjusted to sum to 1
dror27 Jan 24, 2024
13aa6ee
update called base snvq
dror27 Jan 25, 2024
5850557
--limit-phred-score replaces --limit-score
dror27 Jan 28, 2024
d6707d2
flow flags passed to add base quality
dror27 Jan 29, 2024
d0e4a0f
BQ adjusted
dror27 Jan 29, 2024
f79da59
Fixed boundary case
ilyasoifer Jan 31, 2024
b2337dd
Snvq modes added. see --snvq-mode param
dror27 Feb 4, 2024
78818b4
SNVq corrected (all 80 issue)
dror27 Feb 5, 2024
f081cb5
Add missing X_ADJACENT_REF_DIFF
dror27 Feb 19, 2024
1a8b734
Tool renamed to AddFlowSNVQuality and model/conf removed
dror27 Feb 26, 2024
1461c82
further cleanup tests
dror27 Feb 26, 2024
1170d4b
FlowFeatureMapper changes removed
dror27 Feb 26, 2024
7da3aa6
expected results adjusted
dror27 Feb 26, 2024
2b25b21
snvq-mode removed. Geometric mode is one used
dror27 Mar 3, 2024
3cb409c
some comments addressed
dror27 Mar 3, 2024
94d4fe7
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/feat…
dror27 Mar 3, 2024
4afc042
Update src/main/java/org/broadinstitute/hellbender/tools/walkers/feat…
dror27 Mar 3, 2024
4f8320e
0.0001 to keep consistent with GATK
dror27 Mar 3, 2024
9c17210
Merge branch 'ultima.snv-qual.develop.rebase' of github.com:Ultimagen…
dror27 Mar 3, 2024
5577f79
includeDupReads removed
dror27 Mar 3, 2024
ef7ea5e
additional updates
dror27 Mar 3, 2024
eacc100
DEFAULT_FILLING_VALUE restored to 0.001
dror27 Mar 11, 2024
a2b7f54
snvq-mode restored
dror27 Mar 11, 2024
06284b0
AddFlowSNVQuality integeration test expected data adjusted
Mar 16, 2024
4ee1255
fix comment spec on expected output
dror27 Mar 16, 2024
bc0ff83
some of the comments addressed
dror27 Mar 18, 2024
ff06e36
additional comments addressed
dror27 Mar 21, 2024
403bd29
additional comments addressed
dror27 Mar 25, 2024
a6afd8e
Clarified function
ilyasoifer Mar 28, 2024
516c152
limit scores precision to regain platform independence
dror27 Mar 31, 2024
4ee5b6d
documented generateHmerBaseErrorProbabilities
dror27 Mar 31, 2024
d9c090b
AddFlowSNVQuality now overrides read base quality by default
dror27 Mar 31, 2024
0c73917
additional doc comments addressed
dror27 Apr 4, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ public class FlowBasedArgumentCollection implements Serializable {
private static final long serialVersionUID = 0;

public static final String FLOW_USE_T0_TAG = "flow-use-t0-tag";
public static final String PROBABILITY_RATIO_THRESHOLD_LONG_NAME = "flow-probability-threshold";
public static final String REMOVE_LONGER_THAN_ONE_INDELS_LONG_NAME = "flow-remove-non-single-base-pair-indels";
public static final String REMOVE_ONE_TO_ZERO_PROBS_LONG_NAME = "flow-remove-one-zero-probs";
public static final String NUMBER_OF_POSSIBLE_PROBS_LONG_NAME = "flow-quantization-bins";
Expand All @@ -27,8 +26,7 @@ public class FlowBasedArgumentCollection implements Serializable {



private static final double DEFAULT_RATIO_THRESHOLD = 0.003;
private static final double DEFAULT_FILLING_VALUE = 0.001;
public static final double DEFAULT_FILLING_VALUE = 0.001;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 0.0001?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to keep it 0.001 to be consistent with the GATK

private static final boolean DEFAULT_REMOVE_LONGER_INDELS = false;
private static final boolean DEFAULT_REMOVE_ONE_TO_ZERO = false;
private static final boolean DEFAULT_SYMMETRIC_INDELS = false;
Expand All @@ -45,10 +43,6 @@ public class FlowBasedArgumentCollection implements Serializable {
@Argument(fullName = FLOW_USE_T0_TAG, doc = "Use t0 tag if exists in the read to create flow matrix", optional = true)
public boolean useT0Tag = DEFAULT_FLOW_USE_T0_TAG;

@Advanced
@Argument(fullName = PROBABILITY_RATIO_THRESHOLD_LONG_NAME, doc = "Lowest probability ratio to be used as an option", optional = true)
public double probabilityRatioThreshold = DEFAULT_RATIO_THRESHOLD;

@Advanced
@Argument(fullName = REMOVE_LONGER_THAN_ONE_INDELS_LONG_NAME, doc = "Should the probabilities of more then 1 indel be used", optional = true)
public boolean removeLongerThanOneIndels = DEFAULT_REMOVE_LONGER_INDELS;
Expand Down

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
package org.broadinstitute.hellbender.tools.walkers.featuremapping;

import org.broadinstitute.barclay.argparser.Advanced;
import org.broadinstitute.barclay.argparser.Argument;
import org.broadinstitute.barclay.argparser.Hidden;

import java.io.Serializable;
import java.util.List;

/**
* Set of arguments for the {@link AddFlowSNVQuality}
*/
public class AddFlowSNVQualityArgumentCollection implements Serializable{
private static final long serialVersionUID = 1L;
public static final String MAX_PHRED_SCORE_FULL_NAME = "max-phred-score";
public static final String KEEP_SUPPLEMENTARY_ALIGNMENTS_FULL_NAME = "keep-supplementary-alignments";
public static final String INCLUDE_QC_FAILED_READ_FULL_NAME = "include-qc-failed-read";
public static final String SNVQ_MODE_FULL_NAME = "snvq-mode";
public static final String OUTPUT_QUALITY_ATTRIBUTE_FULL_NAME = "output-quality-attribute";
public static final String DEBUG_READ_NAME_FULL_NAME = "debug-read-name";
public static final String DEBUG_COLLECT_STATS_INTO_FULL_NAME = "debug-collect-stats-into";

public enum SnvqModeEnum {
Legacy,
Optimistic,
Pessimistic,
Geometric
};

/**
* maximum value for
* delta in score
**/
@Argument(fullName = MAX_PHRED_SCORE_FULL_NAME, doc = "Limit value for phred scores", optional = true)
public double maxPhredScore = Double.NaN;

/**
* keep supplementary alignments?
**/
@Argument(fullName = KEEP_SUPPLEMENTARY_ALIGNMENTS_FULL_NAME, doc = "keep supplementary alignments ?", optional = true)
public boolean keepSupplementaryAlignments = true;

@Advanced
@Argument(fullName= INCLUDE_QC_FAILED_READ_FULL_NAME, doc = "include reads with QC failed flag", optional = true)
public boolean includeQcFailedReads = true;

/**
* snvq computation mode
*/
@Argument(fullName = SNVQ_MODE_FULL_NAME, doc = "snvq calculation mode.", optional = true)
public SnvqModeEnum snvMode = SnvqModeEnum.Geometric;

/**
* By default this tool overwrites the QUAL field with the new qualities. Setting this argument saves the original qualities in the specified SAM tag.
*/
@Argument(fullName = OUTPUT_QUALITY_ATTRIBUTE_FULL_NAME, doc = "alternate SAM tag to put original quality scores instead of overwriting the QUAL field. If not used, QUAL will be overwritten.", optional = true)
public String outputQualityAttribute = null;

/**
* debug read names?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think more comments on all of these arguments would be useful. For example, this doc should be more like:

Read names of reads to output details of as part of debugging. 
Statistics about the reads will be output to xxx filename.

Copy link
Contributor

@dror27 dror27 Mar 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thank you for the texts

**/
@Hidden
@Argument(fullName = DEBUG_READ_NAME_FULL_NAME, doc = "Read names of reads to output details of as part of debugging. ", optional = true)
public List<String> debugReadName = null;

@Advanced
@Hidden
@Argument(fullName= DEBUG_COLLECT_STATS_INTO_FULL_NAME, doc = "Statistics about the reads will be output to given filename.", optional = true)
public String debugCollectStatsInto = null;
}
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,25 @@ private static double[][] extractErrorProbBands(final FlowBasedRead flowRead, fi
return result;
}

/**
* The following functions estimate the error probability for an hmer. specifically two error
* probability values are generated: one for the first base of the hmer and another for the
* rest of its bases.
*
* The computation itself is performed in a subsequent function: generateSidedHmerBaseErrorProbability
* It iterates over the possible valid combinations of errors and sums them up.
*
* @param key - key (hmer length) in flow space
* @param errorProbBands - for each flow (position in the key) three error probabilities are provided:
* [0] - for the hmer being one base shorter
* [1] - for the hmer to be at its length
* [2] - for the hmer to be one base longer
* @param flow - the flow (index) for which to generate the probabilities (0 <= flow < key.length)
* @param flowOrderLength - the cycle length of of the flow order (usually 4)
* @return an array of two probabilities:
* [0] - probability for the first base of the hmer
* [1] - probability for the rest of the bases of the hmer
*/
@VisibleForTesting
protected static double[] generateHmerBaseErrorProbabilities(final int[] key, final double[][] errorProbBands, final int flow, final int flowOrderLength) {

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -858,9 +858,9 @@ else if ( hasQ || hasZ ) {
cols.put("ReadName", read.getName());

// haplotypes and reference scores
cols.put("PaternalHaplotypeScore", paternal.score);
cols.put("MaternalHaplotypeScore", maternal.score);
cols.put("RefHaplotypeScore", refScore);
cols.put("PaternalHaplotypeScore", String.format("%.6f", paternal.score));
cols.put("MaternalHaplotypeScore", String.format("%.6f", maternal.score));
cols.put("RefHaplotypeScore", String.format("%.6f", refScore));

// build haplotype keys
final FlowBasedReadUtils.ReadGroupInfo rgInfo = FlowBasedReadUtils.getReadGroupInfo(getHeaderForReads(), read);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,7 @@ public class GroundTruthScorer extends ReadWalker {
private static final int BASE_VALUE_MAX = FlowBasedRead.DEFAULT_FLOW_ORDER.length() - 1;

private static final double NORMALIZED_SCORE_THRESHOLD_DEFAULT = -0.1;
private static final double DEFAULT_RATIO_THRESHOLD = 0.003;

/*
Private accumulator class for counting false/true observations (hence Boolean).
Expand Down Expand Up @@ -502,7 +503,7 @@ public void closeTool() {
// write reports
if ( reportFilePath != null ) {
final GATKReport report = new GATKReport(
BooleanAccumulator.newReportTable(qualReport, "qual", fbargs.probabilityRatioThreshold, omitZerosFromReport),
BooleanAccumulator.newReportTable(qualReport, "qual", DEFAULT_RATIO_THRESHOLD, omitZerosFromReport),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the SNVQ changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not. It is a FlowBaseRead update. right @ilyasoifer ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no, there were some issues in the calculation because of how the flow matrix was parsed in the haplotypeCaller (in the context of the haplotypeCaller we always allow pretty high error probability) that caused artefacts in the SNVQ calculation.
The update made the SNVQ calculation more precise

BooleanAccumulator.newReportTable(qualReport, "qual", "hmer", omitZerosFromReport),
BooleanAccumulator.newReportTable(qualReport, "qual", "hmer", "deviation", "base", omitZerosFromReport),
PercentileReport.newReportTable(percentileReports, qualityPercentiles)
Expand Down
Original file line number Diff line number Diff line change
@@ -1,23 +1,51 @@
package org.broadinstitute.hellbender.tools.walkers.groundtruth;

import org.apache.commons.collections.map.LazySortedMap;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

import java.io.IOException;
import java.io.PrintWriter;
import java.util.LinkedHashMap;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;
import java.util.concurrent.atomic.AtomicInteger;

public class SeriesStats {

private static final Logger logger = LogManager.getLogger(SeriesStats.class);

// local state
private double last = Double.NaN;
private int count = 0;
private double sum = 0;
private double min = Double.NaN;
private double max = Double.NaN;
private SortedMap<Double, AtomicInteger> bins = new TreeMap<>();
private int intCount = 0;
private Map<Double, SeriesStats> auxBins = new LinkedHashMap<>();

public void csvWrite(final String path) throws IOException {
logger.info("Writing SeriesStats " + toDigest() + " into " + path);
PrintWriter pw = new PrintWriter(path);
pw.println("value,count");
boolean intKeys = isIntKeys();
for (Map.Entry<Double, AtomicInteger> entry : bins.entrySet() ) {
if ( intKeys ) {
pw.println(String.format("%d,%d", entry.getKey().intValue(), entry.getValue().get()));
} else {
pw.println(String.format("%f,%d", entry.getKey(), entry.getValue().get()));
}
}
pw.close();
}

void add(double v) {
public void add(int v) {
add((double)v);
intCount++;
}

public void add(double v) {

// save in simple values
last = v;
Expand All @@ -31,10 +59,11 @@ void add(double v) {
count++;

// save in bins
if ( bins.containsKey(v) ) {
bins.get(v).incrementAndGet();
final Double key = v;
if ( bins.containsKey(key) ) {
bins.get(key).incrementAndGet();
} else {
bins.put(v, new AtomicInteger(1));
bins.put(key, new AtomicInteger(1));
}
}

Expand Down Expand Up @@ -109,4 +138,23 @@ public double getStd() {
return Math.sqrt(variance);
}

public Map<Double, AtomicInteger> getBins() {
return this.bins;
}

public Map<Double, SeriesStats> getAuxBins() {
return this.auxBins;
}

public String toDigest() {
if ( isIntKeys() ) {
return String.format("count=%d, min=%d, max=%d, median=%d, bin.count=%d", getCount(), (int)getMin(), (int)getMax(), (int)getMedian(), getBins().size());
} else {
return String.format("count=%d, min=%f, max=%f, median=%f, bin.count=%d", getCount(), getMin(), getMax(), getMedian(), getBins().size());
}
}

private boolean isIntKeys() {
return (count == intCount);
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -133,12 +133,12 @@ public AlleleLikelihoods<GATKRead, Haplotype> computeReadLikelihoods(final List<
@Override
public ToDoubleFunction<GATKRead> log10MinTrueLikelihood(final double expectedErrorRate, final boolean capLikelihoods) {
final double log10ErrorRate = Math.log10(expectedErrorRate);
final double catastrophicErrorRate = fbargs.fillingValue;
final double log10catastrophicErrorRate = Math.log10(fbargs.fillingValue);
final double largeEventErrorRate = Math.max(fbargs.fillingValue, 0.000001); // error rate for non-hmer/snv errors that are not seq. errors.
final double log10catastrophicErrorRate = Math.log10(largeEventErrorRate);
return read -> {
final double maxErrorsForRead = capLikelihoods ? Math.max(MAX_ERRORS_FOR_READ_CAP, Math.ceil(read.getLength() * expectedErrorRate)) : Math.ceil(read.getLength() * expectedErrorRate);
final double maxCatastrophicErrorsForRead = capLikelihoods ? Math.max(MAX_CATASTROPHIC_ERRORS_FOR_READ_CAP, Math.ceil(read.getLength() * fbargs.fillingValue)) :
Math.ceil(read.getLength() * fbargs.fillingValue);
final double maxCatastrophicErrorsForRead = capLikelihoods ? Math.max(MAX_CATASTROPHIC_ERRORS_FOR_READ_CAP, Math.ceil(read.getLength() * largeEventErrorRate)) :
Math.ceil(read.getLength() * largeEventErrorRate);
return maxErrorsForRead * log10ErrorRate + maxCatastrophicErrorsForRead * log10catastrophicErrorRate;
};
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -132,12 +132,13 @@ public AlleleLikelihoods<GATKRead, Haplotype> computeReadLikelihoods(final List<
@Override
public ToDoubleFunction<GATKRead> log10MinTrueLikelihood(final double expectedErrorRate, final boolean capLikelihoods) {
final double log10ErrorRate = Math.log10(expectedErrorRate);
final double catastrophicErrorRate = Math.log10(fbargs.fillingValue);
final double largeEventErrorRate = 0.001; // error rate for non-hmer/snv errors that are not seq. errors.
final double log10catastrophicErrorRate = Math.log10(largeEventErrorRate);

return read -> {
final double maxErrorsForRead = Math.max(3.0, Math.ceil(read.getLength() * expectedErrorRate));
final double maxCatastrophicErrorsForRead = Math.max(2.0, Math.ceil(read.getLength() * fbargs.fillingValue));
return maxErrorsForRead * log10ErrorRate + maxCatastrophicErrorsForRead*catastrophicErrorRate;
final double maxCatastrophicErrorsForRead = Math.max(2.0, Math.ceil(read.getLength() * largeEventErrorRate));
return maxErrorsForRead * log10ErrorRate + maxCatastrophicErrorsForRead*log10catastrophicErrorRate;
};
}

Expand Down
Loading
Loading