-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add DownsampleVcf tool #974
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #974 +/- ##
==========================================
+ Coverage 95.62% 95.64% +0.01%
==========================================
Files 126 127 +1
Lines 7364 7457 +93
Branches 506 521 +15
==========================================
+ Hits 7042 7132 +90
- Misses 322 325 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, lots of little stuff to add sufficient polish for users
|
||
def hasNext: Boolean = iter.hasNext | ||
|
||
def isInOrder(current: Variant, next: Variant, currentIndex: Int, nextIndex: Int): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def isInOrder(current: Variant, next: Variant, currentIndex: Int, nextIndex: Int): Boolean = { | |
private def isInOrder(current: Variant, next: Variant, currentIndex: Int, nextIndex: Int): Boolean = { |
/** Removes variants that are within a specified distance from a previous variant | ||
* The end position of the current variant is compared with the start position of the following variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** Removes variants that are within a specified distance from a previous variant | |
* The end position of the current variant is compared with the start position of the following variant | |
/** Removes variants that are within a specified distance from a previous variant. | |
* The end position of the current variant is compared with the start position of the following variant. |
} | ||
} | ||
|
||
/** Downsamples variants using Allele Depths |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/** Downsamples variants using Allele Depths | |
/** Downsamples variants by randomly sampling the total allele depths at the given proportion. |
* calculated using total base count from the index and a target base count | ||
* @return a new IndexedSeq of allele depths of the same length as `oldAds` | ||
*/ | ||
def downsampleADs(oldAds: IndexedSeq[Int], proportion: Double, random: Random): IndexedSeq[Int] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Function parameters should be the most general type possible. You could change this to IterableOnce[Int]
and then below add .toIndexedSeq
.
} | ||
|
||
/** | ||
* Does the downsampling on a Variant |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Does the downsampling on a Variant | |
* Re-genotypes a variant for each sample after downsampling the allele counts based on the given per-sample proportions. |
|
||
val random = new Random(42) | ||
winnowed.foreach { v => | ||
val ds = downsampleAndRegenotype(v, proportions = proportions, random = random, epsilon = epsilon) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no spaces between equals to be consistent with the rest of the codebase
if (writeNoCall) { | ||
outputVcf += ds | ||
progress.record(ds) | ||
} | ||
else if (!ds.gts.forall(g => g.isNoCall)) { | ||
outputVcf += ds | ||
progress.record(ds) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (writeNoCall) { | |
outputVcf += ds | |
progress.record(ds) | |
} | |
else if (!ds.gts.forall(g => g.isNoCall)) { | |
outputVcf += ds | |
progress.record(ds) | |
} | |
} | |
if (writeNoCall || !ds.gts.forall(g => g.isNoCall)) { | |
outputVcf += ds | |
progress.record(ds) | |
} | |
} |
} | ||
} | ||
|
||
progress.logLast() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably nice to log how many reads were read, how many remained after winnowing, and how many were written. Consider three progress loggers?
outputVcf.close() | ||
} | ||
|
||
def buildOutputHeader(in: VcfHeader): VcfHeader = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
object Sample { | ||
/** Load a set of samples from the 1KG metadata file. */ | ||
def read(path: FilePath): Seq[Sample] = { | ||
val lines = Io.readLines(path).dropWhile(_.startsWith("##")).map(line => line.dropWhile(_ == '#')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behavior is not described in the public API. Consider making this object and function private
- [fgbio](#fgbio) | ||
- [Goals](#goals) | ||
- [Overview](#overview) | ||
- [List of tools](#list-of-tools) | ||
- [Building](#building) | ||
- [Cloning the Repository](#cloning-the-repository) | ||
- [Running the build](#running-the-build) | ||
- [Command line](#command-line) | ||
- [Include fgbio in your project](#include-fgbio-in-your-project) | ||
- [Contributing](#contributing) | ||
- [Authors](#authors) | ||
- [License](#license) | ||
- [Sponsorship](#sponsorship) | ||
- [Become a sponsor](#become-a-sponsor) | ||
- [Sponsors](#sponsors) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub has TOC support so we could consider removing this maintenance burden.
https://github.blog/changelog/2021-04-13-table-of-contents-support-in-markdown-files/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
VS code also automatically updates the TOC. I don't care either way.
Add support for multi-allelic variants
Adds a tool to downsample VCFs based on allele depths. Currently only handles biallelic variants.