-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multisample segmentation, as an extension of ModelSegments #5524
Conversation
…gments. Also wrote a number of unit tests for it.
@MartonKN thanks for turning this around quickly. I'll review today now that I'm back from vacation. |
The title of this PR is not really accurate, as you even point out in your first comment. Can you change it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at the code for the backend class and skimmed the test code. There are some stylistic changes that I think are worth making throughout before I continue the review. While you make these changes, I'll take a more detailed look at your test code.
import java.util.stream.Collectors; | ||
|
||
/** | ||
* Segments copy-ratio and alternate-allele-fraction data using kernel segmentation. Segments do not span chromosomes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This documentation appears to be unchanged from the single-sample MultidimensionalKernelSegmenter
.
private final SimpleInterval interval; | ||
private final List<Double> log2CopyRatiosPerSample; | ||
private final List<Double> alternateAlleleFractionsPerSample; | ||
private int nSamples; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be final and named numSamples
(which is more in line with the naming conventions used elsewhere in the CNV code).
private int nSamples; | ||
|
||
MultisampleMultidimensionalPoint(final SimpleInterval interval, | ||
List<Double> log2CopyRatiosPerSample, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both of these parameters should be final.
MultisampleMultidimensionalPoint(final SimpleInterval interval, | ||
List<Double> log2CopyRatiosPerSample, | ||
List<Double> alternateAlleleFractionsPerSample) { | ||
Utils.nonNull(log2CopyRatiosPerSample); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a private helper class, it's probably overkill to do argument checking; hopefully, you will always be passing valid arguments to this class and its methods. I would probably remove these checks. However, if you want to keep them, I'd do a non-null check of interval
and non-empty checks for the lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, sorry, I now see that you are always constructing this with empty lists and then adding the points from each sample one by one. This is not really a good pattern for various reasons. Why not try to make this class more "immutable" and just add all of the samples to both lists upon construction? (Although it's probably overkill to actually make the lists immutable, since we can trust the outer class not to muck with them.)
Utils.nonNull(log2CopyRatiosPerSample); | ||
Utils.nonNull(alternateAlleleFractionsPerSample); | ||
Utils.validateArg(log2CopyRatiosPerSample.size() == alternateAlleleFractionsPerSample.size(), | ||
"Number of copy ratio and allele fraciont samples needs to be the same."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in this message.
} | ||
|
||
/** | ||
* Segments the internally held {@link CopyRatioCollection} and {@link AllelicCountCollection} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this documentation. You should describe how the kernel is constructed.
final BiFunction<MultisampleMultidimensionalPoint, MultisampleMultidimensionalPoint, Double> kernel = constructKernel( | ||
kernelVarianceCopyRatio, kernelVarianceAlleleFraction, kernelScalingAlleleFraction); | ||
|
||
for (int i_sample=0; i_sample<this.numberOfSamples; i_sample++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will log duplicate messages for each sample, which is confusing and unnecessary. Even if you added the sample index or name to each message, this would be extremely misleading, since you log the messages well before you actually perform the action. This would unnecessarily complicate debugging, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, the this
keyword is unnecessary. (However, as discussed above, ideally you will get rid of these loops and the numberOfSamples
field.)
* @author Samuel Lee <[email protected]> | ||
* @author Marton Kanasz-Nagy <[email protected]> | ||
*/ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clean up white space here and elsewhere in this class.
import java.util.stream.IntStream; | ||
|
||
/** | ||
* @author Samuel Lee <[email protected]> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably a good idea to add some documentation here briefly explaining how these tests are adapted from other tests. You have some documentation below, which you can point to when appropriate, but this will be the first place someone will look when trying to figure out what this class is all about.
} | ||
|
||
//loop over chromosomes, find changepoints, and create allele-fraction segments | ||
final List<ArrayList<MultidimensionalSegment>> segmentsPerSample = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things. It's always better to program to an interface, i.e. specify List<List<MultidimensionalSegment>>
rather than List<ArrayList<MultidimensionalSegment>>
. Also, as we discussed, since the changepoints are shared across all samples, it's simpler to first find the master list of changepoints/segments and then use this to create the MultidimensionalSegmentCollection
for each sample afterwards (rather than try to do both at the same time, as you do here). This should let you get rid of this intermediate list of lists (these are sometimes awkward to work with in Java) and your procedural code will be more cleanly split between the two tasks (which you could even break up into separate methods, if you wanted).
We may revisit this work in the future. Closing for now. |
ModelSegments can currently deal with single sample segmentation. This branch contains the backend class (and the corresponding unit tests) that is able to segment based on multiple data samples. The updated version of the front-end class ModelSegments will be addressed in a different branch.