Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multisample segmentation, as an extension of ModelSegments #5524

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,285 @@
package org.broadinstitute.hellbender.tools.copynumber.segmentation;

import htsjdk.samtools.util.Locatable;
import htsjdk.samtools.util.OverlapDetector;
import org.apache.commons.math3.distribution.NormalDistribution;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;
import org.broadinstitute.hellbender.tools.copynumber.formats.collections.AllelicCountCollection;
import org.broadinstitute.hellbender.tools.copynumber.formats.collections.CopyRatioCollection;
import org.broadinstitute.hellbender.tools.copynumber.formats.collections.MultidimensionalSegmentCollection;
import org.broadinstitute.hellbender.tools.copynumber.formats.records.AllelicCount;
import org.broadinstitute.hellbender.tools.copynumber.formats.records.CopyRatio;
import org.broadinstitute.hellbender.tools.copynumber.formats.records.MultidimensionalSegment;
import org.broadinstitute.hellbender.tools.copynumber.utils.segmentation.KernelSegmenter;
import org.broadinstitute.hellbender.utils.SimpleInterval;
import org.broadinstitute.hellbender.utils.Utils;
import org.broadinstitute.hellbender.utils.param.ParamUtils;

import java.util.*;
import java.util.function.BiFunction;
import java.util.function.Function;
import java.util.stream.Collectors;

/**
* Segments copy-ratio and alternate-allele-fraction data using kernel segmentation. Segments do not span chromosomes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation appears to be unchanged from the single-sample MultidimensionalKernelSegmenter.

* Only the first allele-fraction site in each copy-ratio interval is used. The alternate-allele fraction in
* copy-ratio intervals that do not contain any sites is imputed to be balanced at 0.5.
*
* @author Samuel Lee <[email protected]>
* @author Marton Kanasz-Nagy <[email protected]>
*/

public final class MultisampleMultidimensionalKernelSegmenter {
private static final Logger logger = LogManager.getLogger(MultisampleMultidimensionalKernelSegmenter.class);

private static final int MIN_NUM_POINTS_REQUIRED_PER_CHROMOSOME = 10;

//assume alternate-allele fraction is 0.5 for missing data
private static final SimpleInterval DUMMY_INTERVAL = new SimpleInterval("DUMMY", 1, 1);
private static final AllelicCount BALANCED_ALLELIC_COUNT = new AllelicCount(DUMMY_INTERVAL, 1, 1);

//Gaussian kernel for a specified variance; if variance is zero, use a linear kernel
private static final Function<Double, BiFunction<Double, Double, Double>> KERNEL =
standardDeviation -> standardDeviation == 0.
? (x, y) -> x * y
: (x, y) -> new NormalDistribution(null, x, standardDeviation).density(y);

private static final class MultisampleMultidimensionalPoint implements Locatable {
private final SimpleInterval interval;
private final List<Double> log2CopyRatiosPerSample;
private final List<Double> alternateAlleleFractionsPerSample;
private int nSamples;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be final and named numSamples (which is more in line with the naming conventions used elsewhere in the CNV code).


MultisampleMultidimensionalPoint(final SimpleInterval interval,
List<Double> log2CopyRatiosPerSample,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these parameters should be final.

List<Double> alternateAlleleFractionsPerSample) {
Utils.nonNull(log2CopyRatiosPerSample);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a private helper class, it's probably overkill to do argument checking; hopefully, you will always be passing valid arguments to this class and its methods. I would probably remove these checks. However, if you want to keep them, I'd do a non-null check of interval and non-empty checks for the lists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, I now see that you are always constructing this with empty lists and then adding the points from each sample one by one. This is not really a good pattern for various reasons. Why not try to make this class more "immutable" and just add all of the samples to both lists upon construction? (Although it's probably overkill to actually make the lists immutable, since we can trust the outer class not to muck with them.)

Utils.nonNull(alternateAlleleFractionsPerSample);
Utils.validateArg(log2CopyRatiosPerSample.size() == alternateAlleleFractionsPerSample.size(),
"Number of copy ratio and allele fraciont samples needs to be the same.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in this message.

this.interval = interval;
this.log2CopyRatiosPerSample = log2CopyRatiosPerSample;
this.alternateAlleleFractionsPerSample = alternateAlleleFractionsPerSample;
this.nSamples = log2CopyRatiosPerSample.size();
}

public final void add(SimpleInterval interval, double log2CopyRatio, double alternateAlleleFraction) {
Utils.validateArg(this.interval == interval, "Intervals need to match.");
this.log2CopyRatiosPerSample.add(log2CopyRatio);
this.alternateAlleleFractionsPerSample.add(alternateAlleleFraction);
this.nSamples += 1;
}

@Override
public String getContig() {
return interval.getContig();
}

@Override
public int getStart() {
return interval.getStart();
}

@Override
public int getEnd() {
return interval.getEnd();
}

public int getNumberOfSamples() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These getters are not used (and are probably not necessary, since this is an inner helper class).

return this.nSamples;
}

public List<Double> getLog2CopyRatiosPerSample() {
return this.log2CopyRatiosPerSample;
}

public List<Double> getAlternateAlleleFractionsPerSample() {
return this.alternateAlleleFractionsPerSample;
}
}

private final List<CopyRatioCollection> denoisedCopyRatiosPerSample;
private final List<OverlapDetector<CopyRatio>> copyRatioMidpointOverlapDetectorPerSample;
private final List<AllelicCountCollection> allelicCountsPerSample;
private final List<OverlapDetector<AllelicCount>> allelicCountOverlapDetectorPerSample;
private final List<Comparator<Locatable>> comparatorsPerSample;
private final Map<String, List<MultisampleMultidimensionalPoint>> multisampleMultidimensionalPointsPerChromosome;
private final int numberOfSamples;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably would name this numSamples as well. It's odd to choose something verbose here and something less verbose in the inner class.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you actually do name the corresponding variable numSamples in your test code, which adds to the inconsistency.



public MultisampleMultidimensionalKernelSegmenter(final List<CopyRatioCollection> denoisedCopyRatiosPerSample,
final List<AllelicCountCollection> allelicCountsPerSample) {
Utils.nonNull(denoisedCopyRatiosPerSample);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would do non-empty checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also fail early if the copy-ratio bins and allelic-count sites do not match across all samples.

Utils.nonNull(allelicCountsPerSample);
Utils.validateArg(denoisedCopyRatiosPerSample.size() == allelicCountsPerSample.size()
&& denoisedCopyRatiosPerSample.size() > 0,
"Number of copy ratio and allele fraciont samples needs to be the same and non-zero.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify this accordingly and fix the typo in the message.

for (int i=0; i<denoisedCopyRatiosPerSample.size(); i++) {
Copy link
Contributor

@samuelklee samuelklee Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop pattern appears many times in this class and in your test code. It's not really an elegant pattern and you should replace it with use of streams instead. Furthermore, there are inconsistencies with the style, from use of horizontal white space to naming of the index variable (I would probably choose sampleIndex to be consistent with the rest of the CNV code, and note that Google style doesn't permit underscores in variable names).

Another odd choice is that you use denoisedCopyRatiosPerSample.size() in the termination condition here and numberOfSamples elsewhere---why not just define the latter one line before, instead of directly below?

Utils.nonNull(denoisedCopyRatiosPerSample.get(i));
Utils.nonNull(allelicCountsPerSample.get(i));
Utils.validateArg(denoisedCopyRatiosPerSample.get(i).getMetadata().equals(allelicCountsPerSample.get(i).getMetadata()),
"Metadata do not match.");
}
this.numberOfSamples = denoisedCopyRatiosPerSample.size();
this.denoisedCopyRatiosPerSample = denoisedCopyRatiosPerSample;
Copy link
Contributor

@samuelklee samuelklee Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good practice to make defensive copies when it's not too onerous.

this.copyRatioMidpointOverlapDetectorPerSample = denoisedCopyRatiosPerSample
Copy link
Contributor

@samuelklee samuelklee Jan 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The style of this stream statement doesn't match typical CNV style. Typically, we would put .stream() on the first line and give each of the chained methods its own line. We would also prefer method references (CopyRatioCollection::getMidpointOverlapDetector) to lambdas (cr -> cr.getMidpointOverlapDetector()) when possible. If the statement is succinct enough, as it arguably is in this case, we could just put it all on one line. Please go through your code and check for consistent style; I won't comment on it elsewhere.

.stream()
.map(cr -> cr.getMidpointOverlapDetector()).collect(Collectors.toList());
this.allelicCountsPerSample = allelicCountsPerSample;
this.allelicCountOverlapDetectorPerSample = allelicCountsPerSample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

.stream()
.map(ac -> ac.getOverlapDetector()).collect(Collectors.toList());
final List<Integer> numAllelicCountsToUsePerSample = new ArrayList<>();
for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) {
numAllelicCountsToUsePerSample.add((int) denoisedCopyRatiosPerSample.get(i_sample).getRecords().stream()
.filter(allelicCountOverlapDetectorPerSample.get(i_sample)::overlapsAny)
.count());
logger.info(String.format("Using first allelic-count site in each copy-ratio interval (%d / %d) for multidimensional segmentation...",
numAllelicCountsToUsePerSample.get(i_sample), allelicCountsPerSample.get(i_sample).size()));
}
this.comparatorsPerSample = denoisedCopyRatiosPerSample.stream().map(cr -> cr.getComparator()).collect(Collectors.toList());
List<MultisampleMultidimensionalPoint> multisampleMultidimensionalPointList = denoisedCopyRatiosPerSample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be final, named multisampleMultidimensionalPoints, and constructed using a stream.

.get(0)
.getRecords()
.stream()
.map(cr -> new MultisampleMultidimensionalKernelSegmenter.MultisampleMultidimensionalPoint(
cr.getInterval(),
new ArrayList<Double>(),
new ArrayList<Double>())).collect(Collectors.toList());

for (int i_sample=0; i_sample<denoisedCopyRatiosPerSample.size(); i_sample++) {
for (int i_segment=0; i_segment<denoisedCopyRatiosPerSample.get(i_sample).getRecords().size(); i_segment++) {
CopyRatio cr = denoisedCopyRatiosPerSample.get(i_sample).getRecords().get(i_segment);
multisampleMultidimensionalPointList.get(i_segment).add(
cr.getInterval(),
cr.getLog2CopyRatioValue(),
allelicCountOverlapDetectorPerSample.get(i_sample).getOverlaps(cr).stream()
.min(comparatorsPerSample.get(i_sample)::compare)
.orElse(BALANCED_ALLELIC_COUNT).getAlternateAlleleFraction()
);
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up white space as appropriate once you've made changes to this constructor.

multisampleMultidimensionalPointsPerChromosome = multisampleMultidimensionalPointList.stream()
.collect(Collectors.groupingBy(
MultisampleMultidimensionalKernelSegmenter.MultisampleMultidimensionalPoint::getContig,
LinkedHashMap::new,
Collectors.toList()));
}

/**
* Segments the internally held {@link CopyRatioCollection} and {@link AllelicCountCollection}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this documentation. You should describe how the kernel is constructed.

* using a separate {@link KernelSegmenter} for each chromosome.
* @param kernelVarianceCopyRatio variance of the Gaussian kernel used for copy-ratio data;
* if zero, a linear kernel is used instead
* @param kernelVarianceAlleleFraction variance of the Gaussian kernel used for allele-fraction data;
* if zero, a linear kernel is used instead
* @param kernelScalingAlleleFraction relative scaling S of the kernel K_AF for allele-fraction data
* to the kernel K_CR for copy-ratio data;
* the total kernel is K_CR + S * K_AF
*/
public List<MultidimensionalSegmentCollection> findSegmentation(final int maxNumChangepointsPerChromosome,
final double kernelVarianceCopyRatio,
final double kernelVarianceAlleleFraction,
final double kernelScalingAlleleFraction,
final int kernelApproximationDimension,
final List<Integer> windowSizes,
final double numChangepointsPenaltyLinearFactor,
final double numChangepointsPenaltyLogLinearFactor) {
ParamUtils.isPositiveOrZero(maxNumChangepointsPerChromosome, "Maximum number of changepoints must be non-negative.");
ParamUtils.isPositiveOrZero(kernelVarianceCopyRatio, "Variance of copy-ratio Gaussian kernel must be non-negative (if zero, a linear kernel will be used).");
ParamUtils.isPositiveOrZero(kernelVarianceAlleleFraction, "Variance of allele-fraction Gaussian kernel must be non-negative (if zero, a linear kernel will be used).");
ParamUtils.isPositiveOrZero(kernelScalingAlleleFraction, "Scaling of allele-fraction Gaussian kernel must be non-negative.");
ParamUtils.isPositive(kernelApproximationDimension, "Dimension of kernel approximation must be positive.");
Utils.validateArg(windowSizes.stream().allMatch(ws -> ws > 0), "Window sizes must all be positive.");
Utils.validateArg(new HashSet<>(windowSizes).size() == windowSizes.size(), "Window sizes must all be unique.");
ParamUtils.isPositiveOrZero(numChangepointsPenaltyLinearFactor,
"Linear factor for the penalty on the number of changepoints per chromosome must be non-negative.");
ParamUtils.isPositiveOrZero(numChangepointsPenaltyLogLinearFactor,
"Log-linear factor for the penalty on the number of changepoints per chromosome must be non-negative.");

final BiFunction<MultisampleMultidimensionalPoint, MultisampleMultidimensionalPoint, Double> kernel = constructKernel(
kernelVarianceCopyRatio, kernelVarianceAlleleFraction, kernelScalingAlleleFraction);

for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will log duplicate messages for each sample, which is confusing and unnecessary. Even if you added the sample index or name to each message, this would be extremely misleading, since you log the messages well before you actually perform the action. This would unnecessarily complicate debugging, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the this keyword is unnecessary. (However, as discussed above, ideally you will get rid of these loops and the numberOfSamples field.)

logger.info(String.format("Finding changepoints in (%d, %d) data points and %d chromosomes...",
denoisedCopyRatiosPerSample.get(i_sample).size(),
allelicCountsPerSample.get(i_sample).size(),
multisampleMultidimensionalPointsPerChromosome.size()));
}

//loop over chromosomes, find changepoints, and create allele-fraction segments
final List<ArrayList<MultidimensionalSegment>> segmentsPerSample = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things. It's always better to program to an interface, i.e. specify List<List<MultidimensionalSegment>> rather than List<ArrayList<MultidimensionalSegment>>. Also, as we discussed, since the changepoints are shared across all samples, it's simpler to first find the master list of changepoints/segments and then use this to create the MultidimensionalSegmentCollection for each sample afterwards (rather than try to do both at the same time, as you do here). This should let you get rid of this intermediate list of lists (these are sometimes awkward to work with in Java) and your procedural code will be more cleanly split between the two tasks (which you could even break up into separate methods, if you wanted).

for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) {
segmentsPerSample.add(new ArrayList<>());
}
for (final String chromosome : this.multisampleMultidimensionalPointsPerChromosome.keySet()) {
final List<MultisampleMultidimensionalPoint> multisampleMultidimensionalPointsInChromosome = multisampleMultidimensionalPointsPerChromosome.get(chromosome);
final int numMultisampleMultidimensionalPointsInChromosome = multisampleMultidimensionalPointsInChromosome.size();
logger.info(String.format("Finding changepoints in %d data points in chromosome %s...",
numMultisampleMultidimensionalPointsInChromosome, chromosome));

if (numMultisampleMultidimensionalPointsInChromosome < MIN_NUM_POINTS_REQUIRED_PER_CHROMOSOME) {
logger.warn(String.format("Number of points in chromosome %s (%d) is less than that required (%d), skipping segmentation...",
chromosome, numMultisampleMultidimensionalPointsInChromosome, MIN_NUM_POINTS_REQUIRED_PER_CHROMOSOME));
final int start = multisampleMultidimensionalPointsInChromosome.get(0).getStart();
final int end = multisampleMultidimensionalPointsInChromosome.get(numMultisampleMultidimensionalPointsInChromosome - 1).getEnd();
for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) {
segmentsPerSample.get(i_sample).add(new MultidimensionalSegment(
new SimpleInterval(chromosome, start, end),
this.comparatorsPerSample.get(i_sample),
this.copyRatioMidpointOverlapDetectorPerSample.get(i_sample),
this.allelicCountOverlapDetectorPerSample.get(i_sample)));
}
continue;
}

final List<Integer> changepoints = new ArrayList<>(new KernelSegmenter<>(multisampleMultidimensionalPointsInChromosome)
.findChangepoints(maxNumChangepointsPerChromosome, kernel, kernelApproximationDimension,
windowSizes, numChangepointsPenaltyLinearFactor, numChangepointsPenaltyLogLinearFactor, KernelSegmenter.ChangepointSortOrder.INDEX));

if (!changepoints.contains(numMultisampleMultidimensionalPointsInChromosome)) {
changepoints.add(numMultisampleMultidimensionalPointsInChromosome - 1);
}
int previousChangepoint = -1;
for (final int changepoint : changepoints) {
final int start = this.multisampleMultidimensionalPointsPerChromosome.get(chromosome).get(previousChangepoint + 1).getStart();
final int end = this.multisampleMultidimensionalPointsPerChromosome.get(chromosome).get(changepoint).getEnd();
for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) {
segmentsPerSample.get(i_sample).add(new MultidimensionalSegment(
new SimpleInterval(chromosome, start, end),
this.comparatorsPerSample.get(i_sample),
this.copyRatioMidpointOverlapDetectorPerSample.get(i_sample),
this.allelicCountOverlapDetectorPerSample.get(i_sample)));
}
previousChangepoint = changepoint;
}
}
logger.info(String.format("Found %d segments in %d chromosomes.", segmentsPerSample.get(0).size(), multisampleMultidimensionalPointsPerChromosome.keySet().size()));

List<MultidimensionalSegmentCollection> segmentationsPerSample = new ArrayList<>();
for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) {
segmentationsPerSample.add(new MultidimensionalSegmentCollection(
allelicCountsPerSample.get(i_sample).getMetadata(),
segmentsPerSample.get(i_sample)
));
}
return segmentationsPerSample;
}

private BiFunction<MultisampleMultidimensionalPoint, MultisampleMultidimensionalPoint, Double> constructKernel(final double kernelVarianceCopyRatio,
final double kernelVarianceAlleleFraction,
final double kernelScalingAlleleFraction) {
final double standardDeviationCopyRatio = Math.sqrt(kernelVarianceCopyRatio);
final double standardDeviationAlleleFraction = Math.sqrt(kernelVarianceAlleleFraction);
return (p1, p2) -> {
double kernelSum = 0.;
for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) {
kernelSum += KERNEL.apply(standardDeviationCopyRatio).apply(p1.log2CopyRatiosPerSample.get(i_sample), p2.log2CopyRatiosPerSample.get(i_sample)) +
kernelScalingAlleleFraction * KERNEL.apply(standardDeviationAlleleFraction).apply(p1.alternateAlleleFractionsPerSample.get(i_sample), p2.alternateAlleleFractionsPerSample.get(i_sample));
}
return kernelSum;
};
}
}
Loading