Skip to content

Latest commit

 

History

History
213 lines (162 loc) · 9.53 KB

README.md

File metadata and controls

213 lines (162 loc) · 9.53 KB

Configuring V-pipe

In order to start using V-pipe, you need to provide three things:

  1. Samples in a specific directory structure
  2. (optional) TSV file listing the samples
  3. Configuration file

The utils subdirectory provides tools that can assist in importing samples files and structuring them.

Configuration file

The V-pipe workflow is customized using a structured configuration file called config.yaml, config.json or, for backward compatibility, vpipe.config (INI-like format).

This configuration file is a text file written using a basic structure composed of sections, properties and values. When using YAML or JSON format use these languages associative array/dictionaries in two levels for sections and properties. When using the older INI format, sections are expected in squared brackets, and properties are followed by corresponding values.

Further more, it is possible to specify additional options on the command line using Snakemake's --configfile to pass additional YAML/JSON configuration files, and/or using Snakemake's --config to pass sections and properties in a YAML Flow style/JSON syntax.

Here is an example of config.yaml:

general:
  virus_base_config: hiv

input:
  datadir: samples
  samples_file: config/samples.tsv

output:
  datadir: results
  snv: true
  local: true
  global: false
  visualization: true
  QA: true

At minimum, a valid configuration MUST provide a reference sequence against which to align the short reads from the raw data. This can be done in several ways:

  • by using a virus base config that will provide default presets for specific viruses
  • by directly passing a reference .fasta file in the section input -> property reference that will override the default

virus base config

We provide virus-specific base configuration files which contain handy defaults for some viruses.

Currently, the following virus base config are available:

  • hiv: provides HXB2 as a reference sequence for HIV, and sets the default aligner to ngshmmalign.
  • sars-cov-2: provides NC_045512.2 as a reference sequence for SARS-CoV-2, sets the default aligner to bwa and sets the variant calling to be done against the reference instead of the cohort's consensus. In addition, a look-up for the recent versions of ARTIC protocol is provided; this makes it possible to set per-sample protocol in the sample table, and to turn on amplicon trimming (see amplicon protocols).

configuration manual

More information about all the available configuration options and an exhaustive list can be found in config.html or online.

legacy V-pipe 1.xx/2.xx users

If you want to re-use your old configuration from a legacy V-pipe v1.x/2.x installation or sars-cov2 branch it is possible, if you keep in mind the following caveats:

  • The older INI-like syntax is still supported for a vpipe.config configuration file.
    • This configuration will be overridden by config.yaml or config.json, you might want to delete those files from your working directory if you are not using them.
  • V-pipe starting from version 2.99.1 follows the Standardized usage rules of the Snakemake Workflow Catalog
    • This defines a newer directory structure
      • samples TSV table is now expected to be in config/samples.tsv (use the section input -> property samples_file to override).
      • the per sample output isn't written in the same samples/ directory as the input anymore, but in a separate directory called results/ (use the section output -> property datadir to override).
      • the cohort-wide output isn't written in a different variants/ directory anymore, but at at the base of the output datadir - i.e by default in results/ (use the section output -> property cohortdir to specify a different path relative to the output datadir).
    • Add the following sections and properties to your vpipe.config configuration file to bring back the legacy behaviour:
[input]
datadir=samples
samples_file=samples.tsv

[output]
datadir=samples
cohortdir=../variants

As of version 2.99.1, only the analysis of viral sequencing data has been extensively tested and is guaranteed stable. For other more advanced functionality you might want to wait until a future release.

samples tsv

File containing sample unique identifiers and dates as tab-separated values.

Example: here, we have two samples from patient 1 and one sample from patient 2:

patient1	20100113
patient1	20110202
patient2	20081130

By default, V-pipe searches for a file named config/samples.tsv, if this file does not exist, a list of samples is built by searching the contents of the input datadir.

read-lenght

The samples' read-length is used for critical steps of the pipeline (e.g.: quality filtering). Different possibilities are available to set its value:

  • by default, V-pipe expects a read-length of 250bp

  • this default can be globally overridden in the configuration file in section input -> property read_length

    input:
      read_length: 150
  • the samples TSV file can contain an optional third column specifying the read length. This is particularly useful when samples are sequenced using protocols with different read lengths.

    patient1	20100113	150
    patient1	20110202	200
    patient2	20081130	150

    The utils subdirectory contain mass-importers tools that can generate this third column while importing samples.

amplicon protocols

Samples can be the result of PCR amplification. This can require some additional processing, e.g., primers might need trimming:

output:
  trim_primers: true

In order to complete these steps, additional information needs to be provided, e.g., a BED file describing the primers to be trimmed.

  • This can be specified globally with several properties in the configuration file in section input:

    input:
      primers_bedfile: references/primers/SARS-CoV-2.primer.bed
      inserts_bedfile: references/primers/SARS-CoV-2.insert.bed
  • The samples TSV file can contain an optional fourth column specifying the protocol:

    • When different samples have been processed with different library protocols, a lookup table with per-protocol specific (primers bed and fasta), can be provided in a YAML file. references/primers.yaml:
      v41:
        name: SARS-CoV-2 ARTIC V4.1
        inserts_bedfile: references/primers/v41/SARS-CoV-2.insert.bed
        primers_bedfile: references/primers/v41/SARS-CoV-2.primer.bed
      v4:
        name: SARS-CoV-2 ARTIC V4
        inserts_bedfile: references/primers/v4/SARS-CoV-2.insert.bed
        primers_bedfile: references/primers/v4/SARS-CoV-2.primer.bed
      v3:
        name: SARS-CoV-2 ARTIC V3
        inserts_bedfile: references/primers/v3/nCoV-2019.insert.bed
        primers_bedfile: references/primers/v3/nCoV-2019.primer.bed
    • in the configuration file, this look-up can be then specified in section input option protocols_file: config/config.yaml:
      input:
        protocols_file: references/primers.yaml
    • The short name can now be referenced in the fourth column samples TSV table file: config/samples.tsv:
      sample_a  20211108  250 v3
      sample_b  20220214  250 v4

    This is useful if multiple different amplicon schemes have been used of the lifetime of a long-running project, as new variants appear over time with SNVs that require adapting amplicons.

  • virus base config can provide some defaults for either above e.g.: sars-cov-2 provides BED files for ARTIC v3, v4 and v4.1

samples

V-pipe expects the input samples to be organized in a two-level directory hierarchy.

  • The first level can be, e.g., patient samples or biological replicates of an experiment.
  • The second level can be, e.g., different sampling dates or different sequencing runs of the same sample.
  • Inside that directory, the sub-directory raw_data/ holds the sequencing data in FASTQ format (optionally compressed with GZip).

For example:

samples
├── patient1
│   ├── 20100113
│   │   └──raw_data
│   │      ├──patient1_20100113_R1.fastq
│   │      └──patient1_20100113_R2.fastq
│   └── 20110202
│       └──raw_data
│          ├──patient1_20100202_R1.fastq
│          └──patient1_20100202_R2.fastq
└── patient2
    └── 20081130
        └──raw_data
           ├──patient2_20081130_R1.fastq.gz
           └──patient2_20081130_R2.fastq.gz

The utils subdirectory contain mass-importers tools to assist you in generating this hierarchy.