diff --git a/docs/ARGUMENTS.md b/docs/ARGUMENTS.md index 9764330..7c0a69a 100644 --- a/docs/ARGUMENTS.md +++ b/docs/ARGUMENTS.md @@ -45,7 +45,6 @@ * **_--vsearchMinQual_** Minimum Phred base quality score required to retain read or read pair (default:30) * **_--vsearchMaxee_** Maximum number of expected errors tolerated to retain read or read pair (default:0.5) -* **_--vsearchMinlen_** Discard read (or read pair) if its length is shorter than this (default:64) * **_--vsearchMinovlen_** Discard read pair if the alignment length is shorter than this (default:10) ## PROCESS Arguments @@ -55,6 +54,7 @@ * **_--permittedSequences_** Nucleotide sequence of IUPAC ambiguity codes (A/C/G/T/R/Y/S/W/K/M/B/D/H/V/N) with length matching the number of mutated positions (i.e upper-case letters) in '_--wildtypeSequence_' (default:N i.e. any substitution mutation allowed) * **_--sequenceType_** Coding potential of sequence: either 'noncoding', 'coding' or 'auto'. If the specified wild-type nucleotide sequence ('_--wildtypeSequence_') has a valid translation without a premature STOP codon, it is assumed to be 'coding' (default:'auto') * **_--mutagenesisType_** Whether mutagenesis was performed at the nucleotide or codon/amino acid level; either 'random' or 'codon' (default:'random') +* **_--indels_** Indel variants to be retained: either 'all', 'none' or a comma-separated list of sequence lengths (default:'none') * **_--maxSubstitutions_** Maximum number of nucleotide or amino acid substitutions for coding or non-coding sequences respectively (default:2) * **_--mixedSubstitutions_** For coding sequences, are nonsynonymous variants with silent/synonymous substitutions in other codons allowed? (default:F) diff --git a/docs/FILEFORMATS.md b/docs/FILEFORMATS.md index ee4fd43..e55741d 100644 --- a/docs/FILEFORMATS.md +++ b/docs/FILEFORMATS.md @@ -71,11 +71,11 @@ Primary output files: Additional output files: * **fitness_wildtype.txt** Wild-type fitness score and associated error. -* **fitness_singles.txt** Single amino acid or nucleotide variant fitness scores and associated errors. -* **fitness_doubles.txt** Double amino acid or nucleotide variant fitness scores and associated errors. -* **fitness_silent.txt** Silent (synonymous) variant fitness scores and associated errors (for coding sequences only). -* **fitness_singles_MaveDB.csv** [MaveDB](https://www.mavedb.org/) compatible .csv file with single amino acid or nucleotide variant fitness scores and associated errors. +* **fitness_singles.txt** Single amino acid or nucleotide substitution variant fitness scores and associated errors. +* **fitness_doubles.txt** Double amino acid or nucleotide substitution variant fitness scores and associated errors. +* **fitness_silent.txt** Silent (synonymous) substitution variant fitness scores and associated errors (for coding sequences only). +* **fitness_singles_MaveDB.csv** [MaveDB](https://www.mavedb.org/) compatible .csv file with single amino acid or nucleotide substitution variant fitness scores and associated errors. * **DiMSum_Project_variant_data_merge.tsv** Tab-separated plain text file with variant counts and statistics. * **DiMSum_Project_nobarcode_variant_data_merge.tsv** Tab-separated plain text file with sequenced barcodes that were not found in the variant identity file. -* **DiMSum_Project_indel_variant_data_merge.tsv** Tab-separated plain text file with indel variants. -* **DiMSum_Project_rejected_variant_data_merge.tsv** Tab-separated plain text file with rejected variants (internal constant region mutants, mutations inconsistent with the library design or variants with too many substitutions). +* **DiMSum_Project_indel_variant_data_merge.tsv** Tab-separated plain text file with rejected indel variants. +* **DiMSum_Project_rejected_variant_data_merge.tsv** Tab-separated plain text file with remaining rejected variants (internal constant region mutants, mutations inconsistent with the library design or variants with too many substitutions). diff --git a/docs/INSTALLATION.md b/docs/INSTALLATION.md index cdd341e..132a5a8 100644 --- a/docs/INSTALLATION.md +++ b/docs/INSTALLATION.md @@ -26,7 +26,7 @@ curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh sh Miniconda3-latest-Linux-x86_64.sh ``` -**IMPORTANT:** If in doubt, respond with "yes" when prompted during installation. +**IMPORTANT:** If in doubt, respond with "yes" to the following question during installation: "Do you wish the installer to initialize Miniconda3 by running conda init?". In this case Conda will modify your shell scripts (*~/.bashrc* or *~/.bash_profile*) to initialize Miniconda3 on startup. Ensure that any future modifications to your *$PATH* variable in your shell scripts occur **before** this code to initialize Miniconda3. After installing Conda you will need to add the bioconda channel as well as the other channels bioconda depends on. Start a new console session (e.g. by closing the current window and opening a new one) and run the following: ``` @@ -35,7 +35,7 @@ conda config --add channels bioconda conda config --add channels conda-forge ``` -Next, optionally, create a dedicated environment for DiMSum and it's dependencies. This is recommended if you already have _R_ and/or _Python_ installations that you need to maintain separately. +Next, optionally, create a dedicated environment for DiMSum and it's dependencies. This is recommended if you already have _R_ and/or _Python_ installations that you would like to maintain in a separate environment. ``` conda create --name dimsum conda activate dimsum diff --git a/docs/PIPELINE.md b/docs/PIPELINE.md index ec4b3e0..171022c 100644 --- a/docs/PIPELINE.md +++ b/docs/PIPELINE.md @@ -32,15 +32,15 @@ Align overlapping read pairs using *[VSEARCH](INSTALLATION.md)* and filter resul Combine sample-wise variant counts and statistics to produce a unified results data.table. After aggregating counts across technical replicates, variants are processed and filtered according to user specifications (see [stage-specific arguments](ARGUMENTS.md#process-arguments)): * **4.1** For [Barcoded library designs](ARGUMENTS.md#barcoded-library-design), read counts are aggregated at the variant level for barcode/variant mappings specified in the [Variant Identity File](FILEFORMATS.md#variant-identity-file). Undefined/misread barcodes are ignored. -* **4.2** Indel variants (defined as those not matching the wild-type nucleotide sequence length) are removed. -* **4.3** If internal constant region(s) are specified, these are excised from all variants if a perfect match is found (see ['_--wildtypeSequence_' argument](ARGUMENTS.md#process-arguments)). -* **4.4** Variants with mutations inconsistent with the library design are removed (see ['_--permittedSequences_' argument](ARGUMENTS.md#process-arguments)). -* **4.5** Variants with more substitutions than desired are also removed (see ['_--maxSubstitutions_' argument](ARGUMENTS.md#process-arguments)). -* **4.6** Finally, nonsynonymous variants with synonymous substitutions in other codons are removed if necessary (see ['_--mixedSubstitutions_' argument](ARGUMENTS.md#process-arguments)). +* **4.2** Indel variants (defined as those not matching the wild-type nucleotide sequence length) are removed if necessary (see ['_--indels_' argument](ARGUMENTS.md#process-arguments)). +* **4.3** If internal constant region(s) are specified, these are excised from all substitution variants if a perfect match is found (see ['_--wildtypeSequence_' argument](ARGUMENTS.md#process-arguments)). +* **4.4** Substitution variants with mutations inconsistent with the library design are removed (see ['_--permittedSequences_' argument](ARGUMENTS.md#process-arguments)). +* **4.5** Substitution variants with more substitutions than desired are also removed (see ['_--maxSubstitutions_' argument](ARGUMENTS.md#process-arguments)). +* **4.6** Finally, nonsynonymous substitution variants with synonymous substitutions in other codons are removed if necessary (see ['_--mixedSubstitutions_' argument](ARGUMENTS.md#process-arguments)). ## Stage 5: **ANALYSE** counts (_STEAM_) -Calculate fitness and error estimates for a user-specified subset of substitution variants (see [stage-specific arguments](ARGUMENTS.md#analyse-arguments)): +Calculate fitness and error estimates for a user-specified subset of variants (see [stage-specific arguments](ARGUMENTS.md#analyse-arguments)): * **5.1** Optionally remove low count variants according to user-specified soft/hard thresholds to minimise the impact of "fictional" variants from sequencing errors. * **5.2** Calculate replicate normalisation parameters (scale and shift) to minimise inter-replicate fitness differences. * **5.3** Fit the error model to a high confidence subset of variants to determine additive and multiplicative error terms.