Skip to content

Commit

Permalink
feat: implement "seqvars ingest" command (#199)
Browse files Browse the repository at this point in the history
  • Loading branch information
holtgrewe committed Oct 5, 2023
1 parent a188c70 commit 720fac7
Show file tree
Hide file tree
Showing 21 changed files with 1,411 additions and 0 deletions.
1 change: 1 addition & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -47,6 +47,7 @@ thousands = "0.2"
tracing = "0.1"
tracing-subscriber = "0.3"
uuid = { version = "1.4", features = ["v4", "fast-rng", "serde"] }
noodles-vcf = "0.40.0"


[build-dependencies]
Expand Down
58 changes: 58 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,64 @@ $ varfish-server-worker db mk-inhouse \
[--path-input-tsvs @IN/path-list2.txt]
```

## The `seqvars ingest` Command

This command takes as the input a single VCF file from a (supported) variant caller and converts it into a file for further querying.
The command interprets the following fields which are written out by the commonly used variant callers such as GATK UnifiedGenotyper, GATK HaplotypeCaller, and Illumina Dragen.

- `FORMAT/AD` -- allelic depth, one value per allele (including reference0)
- `FORMAT/DP` -- total read coverage
- `FORMAT/GQ` -- genotype quality
- `FORMAT/GT` -- genotype
- `FORMAT/PID` -- physical phasing information as written out by GATK HaplotypeCaller in GVCF workflow
- `FORMAT/PIS` -- physical phasing information as written out by Dragen variant caller
- this field fill be written as `FORMAT/PID`
- `FORMAT/SQ` -- "somatic quality" for each alternate allele, as written out by Illumina Dragen variant caller
- this field will be written as `FORMAT/GQ`

The `seqvars ingest` command will annotate the variants with the following information:

- gnomAD genomes and exomes allele frequencies
- gnomAD-mtDNA and HelixMtDb allele frequencies
- functional annotation following the [VCF ANN field standard](https://pcingola.github.io/SnpEff/adds/VCFannotationformat_v1.0.pdf)

The command will emit one output line for each variant allele from the input and each affected gene.
That is, if two variant alleles affect two genes, four records will be written to the output file.
The annotation will be written out for one highest impact.

Overall, the command will emit the following header rows in addition to the `##contig=<ID=.,length=.>` lines.

```
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=gnomad_exomes_an,Number=1,Type=Integer,Description="Number of samples in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_exomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD exomes">
##INFO=<ID=gnomad_genomes_an,Number=1,Type=Integer,Description="Number of samples in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in gnomAD genomes">
##INFO=<ID=gnomad_genomes_hemi,Number=1,Type=Integer,Description="Number of hemi. alt. carriers in gnomAD genomes">
##INFO=<ID=helix_an,Number=1,Type=Integer,Description="Number of samples in HelixMtDb">
##INFO=<ID=helix_hom,Number=1,Type=Integer,Description="Number of hom. alt. carriers in HelixMtDb">
##INFO=<ID=helix_het,Number=1,Type=Integer,Description="Number of het. alt. carriers in HelixMtDb">
##INFO=<ID=ANN,Number=.,Type=String,Description="Functional annotations: 'Allele | Annotation | Annotation_Impact | Gene_Name | Gene_ID | Feature_Type | Feature_ID | Transcript_BioType | Rank | HGVS.c | HGVS.p | cDNA.pos / cDNA.length | CDS.pos / CDS.length | AA.pos / AA.length | Distance | ERRORS / WARNINGS / INFO'">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##x-varfish-version=<ID=varfish-server-worker,Version=x.y.z>
##x-varfish-version=<ID=orig-caller,Name=Dragen,Version=SW: 07.021.624.3.10.9, HW: 07.021.624>
##x-varfish-version=<ID=orig-caller,Name=GatkHaplotypeCaller,Version=4.4.0.0>
```

> [!NOTE]
> The gnomad-mtDNA information is written to the `INFO/gnomdad_genome_*` fields.
> [!NOTE]
> Future versions of the worker will annotate the worst effect on a MANE select or MANE Clinical transcript.
# Developer Information

This section is only relevant for developers of `varfish-server-worker`.
Expand Down
8 changes: 8 additions & 0 deletions src/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,14 @@ pub struct Args {
pub verbose: Verbosity<InfoLevel>,
}

impl Default for Args {
fn default() -> Self {
Self {
verbose: Verbosity::new(0, 0),
}
}
}

/// Helper to print the current memory resident set size via `tracing`.
pub fn trace_rss_now() {
let me = procfs::process::Process::myself().unwrap();
Expand Down
23 changes: 23 additions & 0 deletions src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
pub mod common;
pub mod db;
pub mod seqvars;
pub mod sv;

use clap::{Args, Parser, Subcommand};
Expand Down Expand Up @@ -33,6 +34,8 @@ enum Commands {
Db(Db),
/// SV filtration related commands.
Sv(Sv),
/// Sequence variant related commands.
Seqvars(Seqvars),
}

/// Parsing of "db *" sub commands.
Expand Down Expand Up @@ -67,6 +70,21 @@ enum SvCommands {
Query(sv::query::Args),
}

/// Parsing of "seqvars *" sub commands.
#[derive(Debug, Args)]
#[command(args_conflicts_with_subcommands = true)]
struct Seqvars {
/// The sub command to run
#[command(subcommand)]
command: SeqvarsCommands,
}

/// Enum supporting the parsing of "sv *" sub commands.
#[derive(Debug, Subcommand)]
enum SeqvarsCommands {
Ingest(seqvars::ingest::Args),
}

fn main() -> Result<(), anyhow::Error> {
let cli = Cli::parse();

Expand Down Expand Up @@ -98,6 +116,11 @@ fn main() -> Result<(), anyhow::Error> {
db::to_bin::cli::run(&cli.common, args)?;
}
},
Commands::Seqvars(seqvars) => match &seqvars.command {
SeqvarsCommands::Ingest(args) => {
seqvars::ingest::run(&cli.common, args)?;
}
},
Commands::Sv(sv) => match &sv.command {
SvCommands::Query(args) => {
sv::query::run(&cli.common, args)?;
Expand Down
113 changes: 113 additions & 0 deletions src/seqvars/ingest/header.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
use noodles_vcf as vcf;

use crate::common::GenomeRelease;

/// Enumeration for the known variant callers.
#[derive(Debug, Clone, PartialEq, Eq, serde::Deserialize, serde::Serialize)]
pub enum VariantCaller {
GatkHaplotypeCaller { version: String },
GatkUnifiedGenotyper { version: String },
Dragen { version: String },
Other,
}

impl VariantCaller {
pub fn guess(header: &vcf::Header) -> Option<Self> {
for (other, collection) in header.other_records() {
if other.as_ref().starts_with("GATKCommandLine")
|| other.as_ref().starts_with("DRAGENCommandLine")
{
use vcf::header::record::value::collection::Collection;
if let Collection::Structured(map) = collection {
for (key, values) in map.iter() {
if let ("HaplotypeCaller", Some(version)) =
(key.as_str(), values.other_fields().get("Version").cloned())
{
return Some(VariantCaller::GatkHaplotypeCaller { version });
}
if let ("UnifiedGenotyper", Some(version)) =
(key.as_str(), values.other_fields().get("Version").cloned())
{
return Some(VariantCaller::GatkUnifiedGenotyper { version });
}
if let ("dragen", Some(version)) =
(key.as_str(), values.other_fields().get("Version").cloned())
{
return Some(VariantCaller::Dragen { version });
}
}
}
}
}
None
}
}

/// Generate the output header from the input header.
pub fn build_output_header(
input_header: &vcf::Header,
genomebuild: GenomeRelease,

Check warning on line 49 in src/seqvars/ingest/header.rs

View workflow job for this annotation

GitHub Actions / clippy

unused variable: `genomebuild`

warning: unused variable: `genomebuild` --> src/seqvars/ingest/header.rs:49:5 | 49 | genomebuild: GenomeRelease, | ^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_genomebuild`
) -> Result<vcf::Header, anyhow::Error> {
let variant_caller = VariantCaller::guess(input_header)

Check warning on line 51 in src/seqvars/ingest/header.rs

View workflow job for this annotation

GitHub Actions / clippy

unused variable: `variant_caller`

warning: unused variable: `variant_caller` --> src/seqvars/ingest/header.rs:51:9 | 51 | let variant_caller = VariantCaller::guess(input_header) | ^^^^^^^^^^^^^^ help: if this is intentional, prefix it with an underscore: `_variant_caller` | = note: `#[warn(unused_variables)]` on by default
.ok_or_else(|| anyhow::anyhow!("Unable to guess variant caller"))?;
todo!()
}

#[cfg(test)]
mod test {
use rstest::rstest;

use super::VariantCaller;

macro_rules! set_snapshot_suffix {
($($expr:expr),*) => {
let mut settings = insta::Settings::clone_current();
settings.set_snapshot_suffix(format!($($expr,)*));
let _guard = settings.bind_to_scope();
}
}

#[rstest]
#[case("tests/seqvars/ingest/example_dragen.07.021.624.3.10.4.vcf")]
#[case("tests/seqvars/ingest/example_dragen.07.021.624.3.10.9.vcf")]
#[case("tests/seqvars/ingest/example_gatk_hc.3.7-0.vcf")]
#[case("tests/seqvars/ingest/example_gatk_hc.4.4.0.0.vcf")]
fn variant_caller_guess(#[case] path: &str) -> Result<(), anyhow::Error> {
set_snapshot_suffix!("{:?}", path.split('/').last().unwrap());

let vcf_header = noodles_vcf::reader::Builder::default()
.build_from_path(path)?
.read_header()?;

insta::assert_yaml_snapshot!(VariantCaller::guess(&vcf_header));

Ok(())
}

#[rstest]
#[case("tests/seqvars/ingest/example_dragen.07.021.624.3.10.4.vcf")]
#[case("tests/seqvars/ingest/example_dragen.07.021.624.3.10.9.vcf")]
#[case("tests/seqvars/ingest/example_gatk_hc.3.7-0.vcf")]
#[case("tests/seqvars/ingest/example_gatk_hc.4.4.0.0.vcf")]
fn build_output_header(#[case] path: &str) -> Result<(), anyhow::Error> {
set_snapshot_suffix!("{:?}", path.split('/').last().unwrap());
let tmpdir = temp_testdir::TempDir::default();

let input_vcf_header = noodles_vcf::reader::Builder::default()
.build_from_path(path)?
.read_header()?;
let output_vcf_header =
super::build_output_header(&input_vcf_header, crate::common::GenomeRelease::Grch37)?;

let out_path = tmpdir.join("out.vcf");
let out_path_str = out_path.to_str().expect("invalid path");
{
noodles_vcf::writer::Writer::new(std::fs::File::create(out_path_str)?)
.write_header(&input_vcf_header)?;
}

insta::assert_snapshot!(std::fs::read_to_string(out_path_str)?);

Ok(())
}
}
103 changes: 103 additions & 0 deletions src/seqvars/ingest/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
//! Implementation of `seqvars ingest` subcommand.
use crate::common::{self, GenomeRelease};
use noodles_vcf as vcf;

pub mod header;

/// Command line arguments for `seqvars ingest` subcommand.
#[derive(Debug, clap::Parser)]
#[command(author, version, about = "ingest sequence variant VCF", long_about = None)]
pub struct Args {
/// The assumed genome build.
#[clap(long)]
pub genomebuild: GenomeRelease,
/// Path to input file.
#[clap(long)]
pub path_in: String,
/// Path to output file.
#[clap(long)]
pub path_out: String,
}

/// Main entry point for `seqvars ingest` sub command.
pub fn run(args_common: &crate::common::Args, args: &Args) -> Result<(), anyhow::Error> {
let before_anything = std::time::Instant::now();
tracing::info!("args_common = {:#?}", &args_common);
tracing::info!("args = {:#?}", &args);

common::trace_rss_now();

tracing::info!("opening input file...");
let mut input_reader = {
let file = std::fs::File::open(&args.path_in)
.map_err(|e| anyhow::anyhow!("could not open input file {}: {}", &args.path_in, e))
.map(std::io::BufReader::new)?;
vcf::reader::Builder::default()

Check warning on line 36 in src/seqvars/ingest/mod.rs

View workflow job for this annotation

GitHub Actions / clippy

use of `default` to create a unit struct

warning: use of `default` to create a unit struct --> src/seqvars/ingest/mod.rs:36:29 | 36 | vcf::reader::Builder::default() | ^^^^^^^^^^^ help: remove this call to `default` | = help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#default_constructed_unit_structs = note: `#[warn(clippy::default_constructed_unit_structs)]` on by default
.build_from_reader(file)
.map_err(|e| anyhow::anyhow!("could not build VCF reader: {}", e))?
};

tracing::info!("analyzing header...");
let input_header = input_reader
.read_header()
.map_err(|e| anyhow::anyhow!("problem reading VCF header: {}", e))?;
let output_header = header::build_output_header(&input_header, args.genomebuild)?;

let mut output_writer = {
let writer = std::fs::File::create(&args.path_out).map_err(|e| {
anyhow::anyhow!(
"could not output file for writing {}: {}",
&args.path_out,
e
)
})?;
let writer = std::io::BufWriter::new(writer);
vcf::writer::Writer::new(writer)
};
output_writer.write_header(&output_header)?;

tracing::info!(
"All of `seqvars ingest` completed in {:?}",
before_anything.elapsed()
);
Ok(())
}

#[cfg(test)]
mod test {
use rstest::rstest;

use crate::common::GenomeRelease;

macro_rules! set_snapshot_suffix {
($($expr:expr),*) => {
let mut settings = insta::Settings::clone_current();
settings.set_snapshot_suffix(format!($($expr,)*));
let _guard = settings.bind_to_scope();
}
}

#[rstest]
#[case("tests/seqvars/ingest/example_dragen.07.021.624.3.10.4.vcf")]
#[case("tests/seqvars/ingest/example_dragen.07.021.624.3.10.9.vcf")]
#[case("tests/seqvars/ingest/example_gatk_hc.3.7-0.vcf")]
#[case("tests/seqvars/ingest/example_gatk_hc.4.4.0.0.vcf")]
fn smoke_test_run(#[case] path: &str) {
set_snapshot_suffix!("{:?}", path.split('/').last().unwrap().replace(".", "_"));

let tmpdir = temp_testdir::TempDir::default();

let args_common = Default::default();
let args = super::Args {
genomebuild: GenomeRelease::Grch37,
path_in: path.into(),
path_out: tmpdir
.join("out.vcf")
.to_str()
.expect("invalid path")
.into(),
};
super::run(&args_common, &args).unwrap();
}
}
Loading

0 comments on commit 720fac7

Please sign in to comment.