Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multisample segmentation, as an extension of ModelSegments #5524

Closed
wants to merge 1 commit into from

Conversation

MartonKN
Copy link
Contributor

ModelSegments can currently deal with single sample segmentation. This branch contains the backend class (and the corresponding unit tests) that is able to segment based on multiple data samples. The updated version of the front-end class ModelSegments will be addressed in a different branch.

…gments. Also wrote a number of unit tests for it.
@samuelklee
Copy link
Contributor

@MartonKN thanks for turning this around quickly. I'll review today now that I'm back from vacation.

@samuelklee
Copy link
Contributor

The title of this PR is not really accurate, as you even point out in your first comment. Can you change it?

Copy link
Contributor

@samuelklee samuelklee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at the code for the backend class and skimmed the test code. There are some stylistic changes that I think are worth making throughout before I continue the review. While you make these changes, I'll take a more detailed look at your test code.

import java.util.stream.Collectors;

/**
* Segments copy-ratio and alternate-allele-fraction data using kernel segmentation. Segments do not span chromosomes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation appears to be unchanged from the single-sample MultidimensionalKernelSegmenter.

private final SimpleInterval interval;
private final List<Double> log2CopyRatiosPerSample;
private final List<Double> alternateAlleleFractionsPerSample;
private int nSamples;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be final and named numSamples (which is more in line with the naming conventions used elsewhere in the CNV code).

private int nSamples;

MultisampleMultidimensionalPoint(final SimpleInterval interval,
List<Double> log2CopyRatiosPerSample,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both of these parameters should be final.

MultisampleMultidimensionalPoint(final SimpleInterval interval,
List<Double> log2CopyRatiosPerSample,
List<Double> alternateAlleleFractionsPerSample) {
Utils.nonNull(log2CopyRatiosPerSample);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a private helper class, it's probably overkill to do argument checking; hopefully, you will always be passing valid arguments to this class and its methods. I would probably remove these checks. However, if you want to keep them, I'd do a non-null check of interval and non-empty checks for the lists.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, sorry, I now see that you are always constructing this with empty lists and then adding the points from each sample one by one. This is not really a good pattern for various reasons. Why not try to make this class more "immutable" and just add all of the samples to both lists upon construction? (Although it's probably overkill to actually make the lists immutable, since we can trust the outer class not to muck with them.)

Utils.nonNull(log2CopyRatiosPerSample);
Utils.nonNull(alternateAlleleFractionsPerSample);
Utils.validateArg(log2CopyRatiosPerSample.size() == alternateAlleleFractionsPerSample.size(),
"Number of copy ratio and allele fraciont samples needs to be the same.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in this message.

}

/**
* Segments the internally held {@link CopyRatioCollection} and {@link AllelicCountCollection}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this documentation. You should describe how the kernel is constructed.

final BiFunction<MultisampleMultidimensionalPoint, MultisampleMultidimensionalPoint, Double> kernel = constructKernel(
kernelVarianceCopyRatio, kernelVarianceAlleleFraction, kernelScalingAlleleFraction);

for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will log duplicate messages for each sample, which is confusing and unnecessary. Even if you added the sample index or name to each message, this would be extremely misleading, since you log the messages well before you actually perform the action. This would unnecessarily complicate debugging, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the this keyword is unnecessary. (However, as discussed above, ideally you will get rid of these loops and the numberOfSamples field.)

* @author Samuel Lee &lt;[email protected]&gt;
* @author Marton Kanasz-Nagy &lt;[email protected]&gt;
*/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up white space here and elsewhere in this class.

import java.util.stream.IntStream;

/**
* @author Samuel Lee &lt;[email protected]&gt;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a good idea to add some documentation here briefly explaining how these tests are adapted from other tests. You have some documentation below, which you can point to when appropriate, but this will be the first place someone will look when trying to figure out what this class is all about.

}

//loop over chromosomes, find changepoints, and create allele-fraction segments
final List<ArrayList<MultidimensionalSegment>> segmentsPerSample = new ArrayList<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things. It's always better to program to an interface, i.e. specify List<List<MultidimensionalSegment>> rather than List<ArrayList<MultidimensionalSegment>>. Also, as we discussed, since the changepoints are shared across all samples, it's simpler to first find the master list of changepoints/segments and then use this to create the MultidimensionalSegmentCollection for each sample afterwards (rather than try to do both at the same time, as you do here). This should let you get rid of this intermediate list of lists (these are sometimes awkward to work with in Java) and your procedural code will be more cleanly split between the two tasks (which you could even break up into separate methods, if you wanted).

@davidbenjamin davidbenjamin removed their request for review June 6, 2019 14:45
@davidbenjamin davidbenjamin removed their assignment Jun 6, 2019
@samuelklee
Copy link
Contributor

We may revisit this work in the future. Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants