Skip to content

Nadolina/decontam_pipe_2

Repository files navigation

decontam_pipe.pdf

Following the VGP genome assembly pipeline 2.1, scaffolded assemblies require decontamination to remove any terminal gaps, non-target contaminants (i.e./ bacterial, human contamination) and sequences originating from the mitochondria. This pipeline has been tested on vertebrate genomes only.

Dependencies

The pipe is written in shell and python scripts.

Inputs and running the pipeline

All scripts/programs are submitted via the shell pipeline. Note that the pipe was developed for internak VGL/VGP use and so is based on submission to a slurm queueing system. The pipe can be run from the commandline:

sbatch -p VGP_decontamination_pipe.sh

Fasta files must be decompressed. The unique ID can be anything, it is just used for naming throughout the pipe; for VGP purposes, we use the TOLID (i.e./ bTaeGut2).

Outputs

  1. class-/ unclass_bTaeGut2_ - two separate files containing the classified (contaminant) and unclassified (target) scaffolds; a future update will include the removal of the unclassified file since a final fasta is generated at the end of the pipe.
  2. contam_scaffs_.txt - compiled list of contaminant scaffolds from the kraken2 and mito-blast subprocesses
  3. mito_blast_.report - from the parse_mito_blast.py script which summarizes the blast output table, listing the highest coverage scaffold-accession number pairs (high coverage = mito-contaminant)
  4. N_sub_masked_ - fasta with hard masking after dustmasker + sub_soft_hard_mask.py
  5. trimmed_ - THE FINAL OUTPUT; a scaffolded assembly from which terminal gaps and all contaminant (non-target and mito) have been removed

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published