In order to start using V-pipe, you need to provide three things:
- Samples in a specific directory structure
- (optional) TSV file listing the samples
- Configuration file
The utils subdirectory provides tools that can assist in importing samples files and structuring them.
The V-pipe workflow is customized using a structured configuration file called config.yaml
, config.json
or, for backward compatibility, vpipe.config
(INI-like format).
This configuration file is a text file written using a basic structure composed of sections, properties and values. When using YAML or JSON format use these languages associative array/dictionaries in two levels for sections and properties. When using the older INI format, sections are expected in squared brackets, and properties are followed by corresponding values.
Further more, it is possible to specify additional options on the command line using Snakemake's --configfile
to pass additional YAML/JSON configuration files, and/or using Snakemake's --config
to pass sections and properties in a YAML Flow style/JSON syntax.
Here is an example of config.yaml
:
general:
virus_base_config: hiv
input:
datadir: samples
samples_file: config/samples.tsv
output:
datadir: results
snv: true
local: true
global: false
visualization: true
QA: true
At minimum, a valid configuration MUST provide a reference sequence against which to align the short reads from the raw data. This can be done in several ways:
- by using a virus base config that will provide default presets for specific viruses
- by directly passing a reference .fasta file in the section input -> property reference that will override the default
We provide virus-specific base configuration files which contain handy defaults for some viruses.
Currently, the following virus base config are available:
- hiv: provides HXB2 as a reference sequence for HIV, and sets the default aligner to ngshmmalign.
- sars-cov-2: provides NC_045512.2 as a reference sequence for SARS-CoV-2, sets the default aligner to bwa and sets the variant calling to be done against the reference instead of the cohort's consensus. In addition, a look-up for the recent versions of ARTIC protocol is provided; this makes it possible to set per-sample protocol in the sample table, and to turn on amplicon trimming (see amplicon protocols).
More information about all the available configuration options and an exhaustive list can be found in config.html or online.
If you want to re-use your old configuration from a legacy V-pipe v1.x/2.x installation or sars-cov2 branch it is possible, if you keep in mind the following caveats:
- The older INI-like syntax is still supported for a
vpipe.config
configuration file.- This configuration will be overridden by
config.yaml
orconfig.json
, you might want to delete those files from your working directory if you are not using them.
- This configuration will be overridden by
- V-pipe starting from version 2.99.1 follows the Standardized usage rules of the
Snakemake Workflow Catalog
- This defines a newer directory structure
- samples TSV table is now expected to be in
config/samples.tsv
(use the section input -> property samples_file to override). - the per sample output isn't written in the same
samples/
directory as the input anymore, but in a separate directory calledresults/
(use the section output -> property datadir to override). - the cohort-wide output isn't written in a different
variants/
directory anymore, but at at the base of the output datadir - i.e by default inresults/
(use the section output -> property cohortdir to specify a different path relative to the output datadir).
- samples TSV table is now expected to be in
- Add the following sections and properties to your
vpipe.config
configuration file to bring back the legacy behaviour:
- This defines a newer directory structure
[input]
datadir=samples
samples_file=samples.tsv
[output]
datadir=samples
cohortdir=../variants
As of version 2.99.1, only the analysis of viral sequencing data has been extensively tested and is guaranteed stable. For other more advanced functionality you might want to wait until a future release.
File containing sample unique identifiers and dates as tab-separated values.
Example: here, we have two samples from patient 1 and one sample from patient 2:
patient1 20100113
patient1 20110202
patient2 20081130
By default, V-pipe searches for a file named config/samples.tsv
, if this file does not exist, a list of samples is built by searching the contents of the input datadir.
The samples' read-length is used for critical steps of the pipeline (e.g.: quality filtering). Different possibilities are available to set its value:
-
by default, V-pipe expects a read-length of 250bp
-
this default can be globally overridden in the configuration file in section input -> property read_length
input: read_length: 150
-
the samples TSV file can contain an optional third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths.
patient1 20100113 150 patient1 20110202 200 patient2 20081130 150
The utils subdirectory contain mass-importers tools that can generate this third column while importing samples.
Samples can be the result of PCR amplification. This can require some additional processing, e.g., primers might need trimming:
output:
trim_primers: true
In order to complete these steps, additional information needs to be provided, e.g., a BED file describing the primers to be trimmed.
-
This can be specified globally with several properties in the configuration file in section input:
input: primers_bedfile: references/primers/SARS-CoV-2.primer.bed inserts_bedfile: references/primers/SARS-CoV-2.insert.bed
-
The samples TSV file can contain an optional fourth column specifying the protocol:
- When different samples have been processed with different library protocols, a lookup table with per-protocol specific (primers bed and fasta), can be provided in a YAML file.
references/primers.yaml
:v41: name: SARS-CoV-2 ARTIC V4.1 inserts_bedfile: references/primers/v41/SARS-CoV-2.insert.bed primers_bedfile: references/primers/v41/SARS-CoV-2.primer.bed v4: name: SARS-CoV-2 ARTIC V4 inserts_bedfile: references/primers/v4/SARS-CoV-2.insert.bed primers_bedfile: references/primers/v4/SARS-CoV-2.primer.bed v3: name: SARS-CoV-2 ARTIC V3 inserts_bedfile: references/primers/v3/nCoV-2019.insert.bed primers_bedfile: references/primers/v3/nCoV-2019.primer.bed
- in the configuration file, this look-up can be then specified in section input option protocols_file:
config/config.yaml
:input: protocols_file: references/primers.yaml
- The short name can now be referenced in the fourth column samples TSV table file:
config/samples.tsv
:sample_a 20211108 250 v3 sample_b 20220214 250 v4
This is useful if multiple different amplicon schemes have been used of the lifetime of a long-running project, as new variants appear over time with SNVs that require adapting amplicons.
- When different samples have been processed with different library protocols, a lookup table with per-protocol specific (primers bed and fasta), can be provided in a YAML file.
-
virus base config can provide some defaults for either above e.g.: sars-cov-2 provides BED files for ARTIC v3, v4 and v4.1
V-pipe expects the input samples to be organized in a two-level directory hierarchy.
- The first level can be, e.g., patient samples or biological replicates of an experiment.
- The second level can be, e.g., different sampling dates or different sequencing runs of the same sample.
- Inside that directory, the sub-directory
raw_data/
holds the sequencing data in FASTQ format (optionally compressed with GZip).
For example:
samples
├── patient1
│ ├── 20100113
│ │ └──raw_data
│ │ ├──patient1_20100113_R1.fastq
│ │ └──patient1_20100113_R2.fastq
│ └── 20110202
│ └──raw_data
│ ├──patient1_20100202_R1.fastq
│ └──patient1_20100202_R2.fastq
└── patient2
└── 20081130
└──raw_data
├──patient2_20081130_R1.fastq.gz
└──patient2_20081130_R2.fastq.gz
The utils subdirectory contain mass-importers tools to assist you in generating this hierarchy.