Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding soft clipped ratio to OverclippedReadFilter #5995

Merged
merged 5 commits into from
Jun 14, 2019

Conversation

jonn-smith
Copy link
Collaborator

Added two new arguments to OverclippedReadFilter that enable filtering based on a ratio of soft clipped reads to total read length.

Necessary for long reads analysis.

@codecov
Copy link

codecov bot commented Jun 10, 2019

Codecov Report

Merging #5995 into master will decrease coverage by 68.497%.
The diff coverage is 7.921%.

@@               Coverage Diff               @@
##              master    #5995        +/-   ##
===============================================
- Coverage     86.927%   18.43%   -68.497%     
+ Complexity     32732     9207     -23525     
===============================================
  Files           2014     2016         +2     
  Lines         151333   151426        +93     
  Branches       16612    16623        +11     
===============================================
- Hits          131549    27908    -103641     
- Misses         13724   120739    +107015     
+ Partials        6060     2779      -3281
Impacted Files Coverage Δ Complexity Δ
...lbender/cmdline/ReadFilterArgumentDefinitions.java 0% <ø> (ø) 0 <0> (ø) ⬇️
.../engine/filters/OverclippedReadFilterUnitTest.java 5.263% <0%> (-94.737%) 1 <0> (-3)
...llbender/engine/filters/OverclippedReadFilter.java 13.043% <12.5%> (-86.957%) 0 <0> (-9)
...llbender/engine/filters/SoftClippedReadFilter.java 14.706% <14.706%> (ø) 1 <1> (?)
.../engine/filters/SoftClippedReadFilterUnitTest.java 3.571% <3.571%> (ø) 1 <1> (?)
...ls/variant/writers/GVCFBlockCombiningIterator.java 0% <0%> (-100%) 0% <0%> (-1%)
...nder/tools/copynumber/utils/TagGermlineEvents.java 0% <0%> (-100%) 0% <0%> (-3%)
...r/tools/spark/pathseq/PSBwaArgumentCollection.java 0% <0%> (-100%) 0% <0%> (-1%)
...titute/hellbender/utils/mcmc/DecileCollection.java 0% <0%> (-100%) 0% <0%> (-5%)
...ender/tools/readersplitters/ReadGroupSplitter.java 0% <0%> (-100%) 0% <0%> (-3%)
... and 1675 more

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of issues, and one or two questions. The bigger question I have however is why continue to use minAlignedBases in the ratio tests at all, especially since it's no longer configurable. You are enforcing that reads below (right now 30) still fail the overclipped filter and then asserting on top of that that their clipping is below your threshold. I would say you have two options:

  1. Make them mutually exclusive, thus if you are in one of the ratio modes the aligned length isn't considered at all. This has the drawback that you can never apply both tests if you want but does away with the hard coded 30 value that should affect the ratio tests.
  2. Make them all apply configurably. I.e. you can run with both a clipping ratio test AND a minimum absolute number of aligned bases test (and perhaps even both clipping ratio and edge clipping ratio). This is doable in this class with relative ease (test just gathers everything at once and decides based on settings which ones to run) but you may also want to pull the ratio tests out into separate filters if you do that.


@Override
public boolean test(final GATKRead read) {
logger.debug( "About to test read: " + read.toString());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This debug statement is going to be expensive, as it has to call read.toString() for every read which means every read gets parsed completely even if debug is off. Either use the message supplier paradigm or encase this in an if statement.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to remove all debug logging statements anyway - they were in for... well.. debugging.

public boolean test(final GATKRead read) {
logger.debug( "About to test read: " + read.toString());

// NOTE: Since we have mutex'd the args for the clipping ratios, we only need to see if they
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a line about this behavior to the javadoc for this filter as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


final double softClipRatio = ((double)numSoftClippedBases / (double)totalLength);

logger.debug( " Num Soft Clipped Bases: " + numSoftClippedBases );
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, use suppliers or hide them behind if statements to prevent unnecessary method calls.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug statements are getting removed.

static final long serialVersionUID = 1L;
private final Logger logger = LogManager.getLogger(this.getClass());

private static final double DEFAULT_MIN_SOFT_CLIPPED_RATIO = Double.MIN_VALUE;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These default values get filled in on the documentation page here: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_engine_filters_OverclippedReadFilter.php meaning that you will get some godawful representation of Double.MIN_VALUE filled in on this page. I would suggest either choosing a more manageable default value like 1.0, or setting this to a Double object and making the default value null so it shows up as such and is not confusing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double / null it is!

private int minAlignedBases = 30;

@VisibleForTesting
@Argument(fullName = "soft-clipped-ratio-threshold",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move these names to constants in ReadFilterArgumentDefinitions

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


@VisibleForTesting
@Argument(fullName = "leading-trailing-soft-clipped-ratio",
doc = "Minimum ratio of soft clipped bases (leading / trailing the cigar string) to total bases in read for read to be included.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to necessarily make these ("leading-trailing-soft-clipped-ratio", "soft-clipped-ratio-threshold") mutually exclusive? It seems like it may be possible that one cares about different values for one or the other.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my use case there is never a time when you want to do a combined filter on both. I don't think it's useful to have them both on at the same time.

We can fix it if someone ever wants to make this happen together.

logger.debug( " Read Filter status: " + (softClipRatio > minimumSoftClippedRatio) );
logger.debug( " Aligned Length: " + alignedLength );

return (alignedLength >= minAlignedBases) && (softClipRatio > minimumSoftClippedRatio);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should either be || or you should drop the first clause

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OBE

final double clipRatio,
final boolean expectedResult) {

final OverclippedReadFilter filter = new OverclippedReadFilter(10);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks


// ---------------------------------------
// Min bases test:
testData.add(new Object[] { "20S5M", 0.2, false }); // 20/25 = .8
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think these tests actually demonstrate the minAlignedBases behavior, as both of them have <30 matching cigar bases.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were showing that strings below 30 bases will return false even though by the other metric they would pass. The other tests have a smaller minAlignedBases and all pass because the threshold has been hit.

These tests have since been removed anyway because of the new class structure.

logger.debug( " Read Filter status: " + (softClipRatio > minimumLeadingTrailingSoftClippedRatio) );
logger.debug( " Aligned Length: " + alignedLength );

return (alignedLength >= minAlignedBases) && (softClipRatio > minimumLeadingTrailingSoftClippedRatio);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these arguments are really mutually exclusive I wouldn't have both options (minAlignedBases, and minimumLeadingTrailingSoftClippedRatio) apply in every case. Either let them both exist at once (as they are all 3 separate tests).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OBE

@jamesemery
Copy link
Collaborator

Perhaps call the alternative a SoftclipRatioReadFilter or something similar

@jonn-smith
Copy link
Collaborator Author

Yeah - I'm gonna pull this out into a new read filter.

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment about the argument name and then you should feel free to merge.

public static final String SOFT_CLIPPED_RATIO_THRESHOLD = "soft-clipped-ratio-threshold";
public static final String SOFT_CLIPPED_LEADING_TRAILING_RATIO_THRESHOLD = "soft-clipped-leading-trailing-ratio";

public static final String INVERT_FILTER = "invert-filter";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this "invert-soft-clip-ratio" or something like that to make it a little clearer that its associated with THIS filter not just any filter on the command line

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Really, this should be a feature of every reader. I'll add a ticket for it.

@jonn-smith jonn-smith merged commit 744987b into master Jun 14, 2019
@jonn-smith jonn-smith deleted the jts_overclipped_read_filter_ratio branch June 14, 2019 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants