Skip to content

Commit

Permalink
feat: transform "db to-bin" to "strucvars txt-to-bin" (#218) (#219)
Browse files Browse the repository at this point in the history
  • Loading branch information
holtgrewe authored Oct 10, 2023
1 parent b51f027 commit ea6e387
Show file tree
Hide file tree
Showing 27 changed files with 121 additions and 233 deletions.
74 changes: 37 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,15 +14,15 @@ They are written in the Rust programming language to speed up the execution of c
At the moment, the following sub commands exist:

- `db` -- subcommands to build binary (protobuf) database files
- `db to-bin` -- convert text files downloaded by [varfish-db-downloader](https://github.com/bihealth/varfish-db-downloader/) to binary for fast use in query sub commands
- `db mk-inhouse` -- compile per-case structural variant into an in-house database previously created by `db compile`
- `seqvars` -- subcommands for processing sequence (aka small/SNV/indel) variants
- `seqvars ingest` -- convert single VCF file into internal format for use with `seqvars query`
- `seqvars query` -- perform sequence variant filtration and on-the-fly annotation
- `seqvars prefilter` -- limit the result of `seqvars prefilter` by population frequency and/or distance to exon
- `seqvars aggregate` -- read through multiple VCF files written by `seqvars ingest` and computes a carrier counts table.
- `seqvars query` -- perform sequence variant filtration and on-the-fly annotation
- `strucvars` -- subcommands for processing structural (aka large variants, CNVs, etc.) variants
- `strucvars ingest` -- convert one or more structural variant files for use with `strucvars query`
- `strucvars aggregate` -- compile per-case structural variant into an in-house database, to be converted to `.bin` with `strucvars txt-to-bin`.
- `strucvars txt-to-bin` -- convert text files downloaded by [varfish-db-downloader](https://github.com/bihealth/varfish-db-downloader/) to binary for fast use in `strucvars query` commands
- `strucvars query` -- perform structural variant filtration and on-the-fly annotation

## Overall Design
Expand All @@ -39,40 +39,6 @@ The worker will create a result file that can be directly imported by the server

Future versions may provide persistently running HTTP/REST servers that provide functionality without startup cost.

## The `db to-bin` Command

Convert output of [varfish-db-downloader](https://github.com/bihealth/varfish-db-downloader/) to a directory with databases to be used by query commands such as `strucvars query`.

```
$ varfish-server-worker db to-bin \
--input-type {ClinvarSv,StrucvarInhouse,...} \
--path-input IN.txt \
--path-output-bin DST.bin
```

## The `db mk-inhouse` Command

Import multiple files created by `strucvars ingest` into a database previously created by `db compile`.
You can specify the files individually.
Paths starting with an at (`@`) character are interpreted as files with lists of paths.
You can mix paths with `@` and without.

```
$ varfish-server-worker db mk-inhouse \
--genome-release {Grch37,Grch38} \
--path-output-tsv OUT.tsv \
--path-input-tsvs IN/file1.gts.tsv.gz \
[--path-input-tsv IN/file1.gts.tsv.gz] \
# OR:
$ varfish-server-worker db mk-inhouse \
--genome-release {Grch37,Grch38} \
--path-output-tsv OUT.tsv \
--path-input-tsvs @IN/path-list.txt \
[--path-input-tsvs @IN/path-list2.txt]
```

## The `seqvars ingest` Command

This command takes as the input a single VCF file from a (supported) variant caller and converts it into a file for further querying.
Expand Down Expand Up @@ -286,6 +252,40 @@ Overall, the command will emit the following header rows in addition to the `##c
> It only merges the input VCF files from multiple callers (all files must have the same samples) and converts them into the internal format.
> The `INFO/annsv` field is filled by `strucvars query`.
## The `strucvars aggregate` Command

Import multiple files created by `strucvars ingest` into a database that can be convered to `.bin` with `strucvars txt-to-bin` and then used by `strucvars query`.
You can specify the files individually.
Paths starting with an at (`@`) character are interpreted as files with lists of paths.
You can mix paths with `@` and without.

```
$ varfish-server-worker strucvars aggregate \
--genome-release {Grch37,Grch38} \
--path-output OUT.tsv \
--path-input IN/file1.vcf.gz \
[--path-input IN/file1.vcf.gz] \
# OR:
$ varfish-server-worker db mk-inhouse \
--genome-release {Grch37,Grch38} \
--path-output OUT.tsv \
--path-input @IN/path-list.txt \
[--path-input @IN/path-list2.txt]
```

## The `strucvars txt-to-bin` Command

Convert output of [varfish-db-downloader](https://github.com/bihealth/varfish-db-downloader/) to a directory with databases to be used by query commands such as `strucvars query`.

```
$ varfish-server-worker strucvars txt-to-bin \
--input-type {ClinvarSv,StrucvarInhouse,...} \
--path-input IN.txt \
--path-output-bin DST.bin
```

# Developer Information

This section is only relevant for developers of `varfish-server-worker`.
Expand Down
46 changes: 46 additions & 0 deletions src/common.rs
Original file line number Diff line number Diff line change
Expand Up @@ -511,6 +511,52 @@ where
Ok(buffer)
}

/// Enum for the supported gene/transcript databases.
#[derive(
serde::Serialize,
serde::Deserialize,
enum_map::Enum,
PartialEq,
Eq,
Clone,
Copy,
Debug,
Default,
strum::EnumString,
strum::Display,
)]
#[serde(rename_all = "lowercase")]
#[strum(serialize_all = "lowercase")]
pub enum Database {
/// RefSeq
#[default]
RefSeq,
/// ENSEMBL
Ensembl,
}

/// Enum for the supported TADs.
#[derive(
serde::Serialize,
serde::Deserialize,
enum_map::Enum,
PartialEq,
Eq,
Clone,
Copy,
Debug,
Default,
strum::EnumString,
strum::Display,
)]
#[serde(rename_all = "lowercase")]
#[strum(serialize_all = "lowercase")]
pub enum TadSet {
/// hESC
#[default]
Hesc,
}

#[cfg(test)]
mod test {
use noodles_vcf as vcf;
Expand Down
129 changes: 0 additions & 129 deletions src/db/conf.rs

This file was deleted.

5 changes: 0 additions & 5 deletions src/db/mod.rs

This file was deleted.

28 changes: 4 additions & 24 deletions src/main.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
//! VarFish Server Worker main executable
pub mod common;
pub mod db;
pub mod seqvars;
pub mod strucvars;

Expand Down Expand Up @@ -30,30 +29,12 @@ struct Cli {
#[allow(clippy::large_enum_variant)]
#[derive(Debug, Subcommand)]
enum Commands {
/// Database building related commands.
Db(Db),
/// Structural variant related commands.
Strucvars(Strucvars),
/// Sequence variant related commands.
Seqvars(Seqvars),
}

/// Parsing of "db *" sub commands.
#[derive(Debug, Args)]
#[command(args_conflicts_with_subcommands = true)]
struct Db {
/// The sub command to run
#[command(subcommand)]
command: DbCommands,
}

/// Enum supporting the parsing of "db *" sub commands.
#[allow(clippy::large_enum_variant)]
#[derive(Debug, Subcommand)]
enum DbCommands {
ToBin(db::to_bin::cli::Args),
}

/// Parsing of "sv *" sub commands.
#[derive(Debug, Args)]
#[command(args_conflicts_with_subcommands = true)]
Expand All @@ -69,6 +50,7 @@ enum StrucvarsCommands {
Aggregate(strucvars::aggregate::cli::Args),
Ingest(strucvars::ingest::Args),
Query(strucvars::query::Args),
TxtToBin(strucvars::txt_to_bin::cli::Args),
}

/// Parsing of "seqvars *" sub commands.
Expand Down Expand Up @@ -111,11 +93,6 @@ fn main() -> Result<(), anyhow::Error> {
let term = Term::stderr();
tracing::subscriber::with_default(collector, || {
match &cli.command {
Commands::Db(db) => match &db.command {
DbCommands::ToBin(args) => {
db::to_bin::cli::run(&cli.common, args)?;
}
},
Commands::Seqvars(seqvars) => match &seqvars.command {
SeqvarsCommands::Aggregate(args) => {
seqvars::aggregate::run(&cli.common, args)?;
Expand All @@ -137,6 +114,9 @@ fn main() -> Result<(), anyhow::Error> {
StrucvarsCommands::Query(args) => {
strucvars::query::run(&cli.common, args)?;
}
StrucvarsCommands::TxtToBin(args) => {
strucvars::txt_to_bin::cli::run(&cli.common, args)?;
}
},
}

Expand Down
2 changes: 2 additions & 0 deletions src/strucvars/mod.rs
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
pub mod aggregate;
pub mod ingest;
pub mod pbs;
pub mod query;
pub mod txt_to_bin;
File renamed without changes.
4 changes: 2 additions & 2 deletions src/strucvars/query/bgdbs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ use strum_macros::{Display, EnumString};
use tracing::info;

use crate::{
common::{trace_rss_now, CHROMS},
db::{conf::GenomeRelease, pbs},
common::{trace_rss_now, GenomeRelease, CHROMS},
strucvars::pbs,
};

use super::{
Expand Down
5 changes: 1 addition & 4 deletions src/strucvars/query/clinvar.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,7 @@ use prost::Message;
use thousands::Separable;
use tracing::{info, warn};

use crate::{
common::{reciprocal_overlap, CHROMS},
db::conf::GenomeRelease,
};
use crate::common::{reciprocal_overlap, GenomeRelease, CHROMS};

use super::{
records::ChromRange,
Expand Down
Loading

0 comments on commit ea6e387

Please sign in to comment.