Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DownsampleVcf tool #974

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

Add DownsampleVcf tool #974

wants to merge 11 commits into from

Conversation

TedBrookings
Copy link

Adds a tool to downsample VCFs based on allele depths. Currently only handles biallelic variants.

@TedBrookings TedBrookings requested a review from nh13 March 29, 2024 21:03
Copy link

codecov bot commented Mar 29, 2024

Codecov Report

Attention: Patch coverage is 95.69892% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 95.64%. Comparing base (3a74fd2) to head (9a2d510).

❗ Current head 9a2d510 differs from pull request most recent head 75de344. Consider uploading reports for the commit 75de344 to get more accurate results

Files Patch % Lines
.../scala/com/fulcrumgenomics/vcf/DownsampleVcf.scala 95.69% 4 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #974      +/-   ##
==========================================
+ Coverage   95.62%   95.64%   +0.01%     
==========================================
  Files         126      127       +1     
  Lines        7364     7457      +93     
  Branches      506      521      +15     
==========================================
+ Hits         7042     7132      +90     
- Misses        322      325       +3     
Flag Coverage Δ
unittests 95.64% <95.69%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@nh13 nh13 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good, lots of little stuff to add sufficient polish for users


def hasNext: Boolean = iter.hasNext

def isInOrder(current: Variant, next: Variant, currentIndex: Int, nextIndex: Int): Boolean = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def isInOrder(current: Variant, next: Variant, currentIndex: Int, nextIndex: Int): Boolean = {
private def isInOrder(current: Variant, next: Variant, currentIndex: Int, nextIndex: Int): Boolean = {

Comment on lines 18 to 19
/** Removes variants that are within a specified distance from a previous variant
* The end position of the current variant is compared with the start position of the following variant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/** Removes variants that are within a specified distance from a previous variant
* The end position of the current variant is compared with the start position of the following variant
/** Removes variants that are within a specified distance from a previous variant.
* The end position of the current variant is compared with the start position of the following variant.

}
}

/** Downsamples variants using Allele Depths
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/** Downsamples variants using Allele Depths
/** Downsamples variants by randomly sampling the total allele depths at the given proportion.

* calculated using total base count from the index and a target base count
* @return a new IndexedSeq of allele depths of the same length as `oldAds`
*/
def downsampleADs(oldAds: IndexedSeq[Int], proportion: Double, random: Random): IndexedSeq[Int] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Function parameters should be the most general type possible. You could change this to IterableOnce[Int] and then below add .toIndexedSeq.

}

/**
* Does the downsampling on a Variant
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Does the downsampling on a Variant
* Re-genotypes a variant for each sample after downsampling the allele counts based on the given per-sample proportions.


val random = new Random(42)
winnowed.foreach { v =>
val ds = downsampleAndRegenotype(v, proportions = proportions, random = random, epsilon = epsilon)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no spaces between equals to be consistent with the rest of the codebase

Comment on lines 217 to 225
if (writeNoCall) {
outputVcf += ds
progress.record(ds)
}
else if (!ds.gts.forall(g => g.isNoCall)) {
outputVcf += ds
progress.record(ds)
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (writeNoCall) {
outputVcf += ds
progress.record(ds)
}
else if (!ds.gts.forall(g => g.isNoCall)) {
outputVcf += ds
progress.record(ds)
}
}
if (writeNoCall || !ds.gts.forall(g => g.isNoCall)) {
outputVcf += ds
progress.record(ds)
}
}

}
}

progress.logLast()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably nice to log how many reads were read, how many remained after winnowing, and how many were written. Consider three progress loggers?

outputVcf.close()
}

def buildOutputHeader(in: VcfHeader): VcfHeader = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

object Sample {
/** Load a set of samples from the 1KG metadata file. */
def read(path: FilePath): Seq[Sample] = {
val lines = Io.readLines(path).dropWhile(_.startsWith("##")).map(line => line.dropWhile(_ == '#'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is not described in the public API. Consider making this object and function private

@jdidion jdidion requested a review from nh13 April 3, 2024 00:17
Comment on lines +17 to +31
- [fgbio](#fgbio)
- [Goals](#goals)
- [Overview](#overview)
- [List of tools](#list-of-tools)
- [Building](#building)
- [Cloning the Repository](#cloning-the-repository)
- [Running the build](#running-the-build)
- [Command line](#command-line)
- [Include fgbio in your project](#include-fgbio-in-your-project)
- [Contributing](#contributing)
- [Authors](#authors)
- [License](#license)
- [Sponsorship](#sponsorship)
- [Become a sponsor](#become-a-sponsor)
- [Sponsors](#sponsors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub has TOC support so we could consider removing this maintenance burden.

https://github.blog/changelog/2021-04-13-table-of-contents-support-in-markdown-files/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VS code also automatically updates the TOC. I don't care either way.

@jdidion jdidion requested review from jdidion and clintval April 4, 2024 17:15
@jdidion jdidion removed their request for review April 5, 2024 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants