CryProcessor is a high-troughtput tool for the Cry toxins mining from the fasta-files or directly from the illumina reads.
CryProcessor is a python-written tool for searching and extracting Cry toxins from illumina sequence data or from the protein fasta files. It includes several parts: an hmm-based scanning for potential Cry toxins, obtaining information about the domains, extracting Cry toxins with 3 domains only and comparing found toxins with Bt nomenclature.
The mode for performing the toxins search directly from the illumina reads implies building an assembly graph (using SPAdes) and the subsequent mining toxins directly from the obtained assebmly graph.
The following text stands for the full pipeline description (for the illumina reads). To start, SPAdes (http://cab.spbu.ru/software/spades/) or metaSPAdes (http://cab.spbu.ru/software/meta-spades/) are implemented to get the assembly graph from the fastq-files. After that, the potential Cry toxins (with at least 30% identity to the hmm-consensus) are extracted from the assembly paths via PathRacer (http://cab.spbu.ru/software/pathracer/). Then hmmsearch (http://hmmer.org/) is used to find Cry toxin domains in the obtained sequences. In the next step, the results of hmmsearch are combined to get the toxins that posses all three domains.
The coordinates of the domains are used to cut flanking sequences and save the domains with the corresponding linkers. The full sequences (without processing procedure) are used to compare the obtained toxins with the Bt nomenclature database via diamond blastp (https://github.com/bbuchfink/diamond). The non-identical sequences are extracted and marked as the potentially new toxins.
For all the found sequences (both identical to presented in Bt nomenclature and the novel sequences) an online ipg-annotation (Identical Protein Group) is performed (to see the annotation output read the annotation output section below). Finally, nucleotide sequences, corresponding to the protein sequences of the found toxins, are downloaded. Metadata will be uploaded only if the accession numbers are present in the query.
- python (3.7 or higher);
- Biopython (1.66 or higher).
To install Biopython use the following command:
~$ pip install biopython
To install CryProcessor clone git repository to your PC:
~$ git clone https://github.com/lab7arriam/cry_processor
After downloading, CryProcessor is ready to use. You can also add CryProcessor to the PATH typing:
~$ PATH=$PATH:/path/to/install
If you want to use prebuild Docker container with CryProcessor, pull it using the following command:
~$ docker pull lab7arriam/cry_processor
To extract Cry toxins from the protein fasta file simply execute the following command:
~$ /path/to/install/cry_processor.py -fi input.faa -od output_dir
This command will automatically search for the Cry toxins in the fasta file with amino acid sequences.
If you use Docker container, you should use following command:
~$ docker run --rm -v /path/to/data:/data lab7arriam/cry_processor -fi /data/<input file> -od /data/<output_dir> [args]
- fasta files with protein sequences;
- gfa files (representing genome assebly graph);
- forward and reverse illumina reads.
The full list of tool options:
-fi <input.fasta> or <input.gfa> an input file in the fasta format of in the gfa format
-hm <path to hmmer directory> path to the hmmer directory if you want to use local hmmer
-pr <1 or 2> the processig type: 1 for extracting all the domains, 2 for extrating 2-3 domains only (default 1)
-th the nubmer of threads for hmmer/SPAdes/PathRacer (default 8)
-ma <e-mail> an e-mail address for the connecting to NCBI
-od <output dir> the output directory
-r <do or fd> the running mode: do - the domain only search only; fd - the full toxins mining with the subsequent domain search
-a (--annotate) perform the data anotation with ipg (only if the accession numbers are present)
-nu <fn or pn or an> upload the nucleotide sequences: fn - the full sequences, pn - the processed sequences, an - the both variants
-pa (--pathracer) - launching PathRacer on the gfa file
-fo <input_1.fastq> - forward illumina reads
-re <input_2.fastq> - reverse illumina reads
-n (--meta) - the flag for specifying the metagenomic mode forSPAdes
-k <number> - the K-mer size for PathRacer (default 21)
-s (--silent) - disable the console output
-f --force - forced directory overwriting
To use the tool for the files in the fasta format execute the command, presented in quick usage, you can also specify the annotation (writing an e-mail address is strongly recommended):
~$ /path/to/install/cry_processor.py -fi input.faa -od output_dir -ma <e-mail address> --annotate
Use the -nu flag to download nucleotide sequences:
~$ /path/to/install/cry_processor.py -fi input.faa -od output_dir -ma <e-mail address> --annotate -nu pn
The pipeline of searching could be performed in two modes:
- do - the domain only mode. Searches the Cry toxin domains from the full queiry, then combines this data to extract the toxins with 3 domains;
- fd - the find domains mode. At the begining, hmmsearch with the full Cry toxin model is performed. Next, the domains are extracted and combined. This mode is quicker, as far as the domain search is performed on previously extracted sequences.
You can apply the Cry toxins search directly from the assembly graph in the gfa format with the following commad:
~$ /path/to/install/cry_processor.py -fi input.gfa -od output_dir --path_racer
This mode includes the reads assembly with SPAdes and the subsequent hmm-based toxins mining. To implement this use the following command:
~$ /path/to/install/cry_processor.py -fo forward_reads.fastq -re reverse_reads.fastq -od output_dir
If you want to search for the Cry toxins from metagenomic reads specify --meta flag:
~$ /path/to/install/cry_processor.py -fo forward_reads.fastq -re reverse_reads.fastq -od output_dir --meta
Note that you cannot mix modes:
- Do not use the --pathracer flag with the illumina fastq files query flags (-fo and -re);
- Do not mix the -fi agrument with -fo and -re arguments;
- Do not mix the --pathracer flag and the --meta flag;
- Do not specify the --meta mode with the -fi agrument;
- You should use both the -fo and -re argumens together;
Note that performing the anotation is not recommended for the gfa and assembly modes, because the online annotation is impossible without the accession numbers in the query.
Using the --annotate flag will perform the NCBI-search in the ipg database for the submitted accession numbers within the query and return gathered information in tsv-format with the following structure:
protein_id | initial_description | top_cry_hit | cry_identity | source | nucl_accession | start | stop | strand | ipg_prot_id | ipg_prot_name | organism | strain | assembly |
AFU17015.1 | AFU17015.1 pesticidal crystal proteinBacillus thuringiensis MC28 | Cry4Cc1 | 99.4 | INSDC | CP003690.1 | 58993 | 62628 | + | AFU17015.1 | pesticidal crystal protein | Bacillus thuringiensis MC28 | MC28 | GCA_000300475.1 |
Columns description:
- protein_id - initial sequence id in the query
- initial_description - protein description in the query
- top_cry_hit - Cry protein from Bt nomenclature with the best identity score according to diamond
- cry_identity - identity score with the closest Cry protein from Bt nomenclature
- source - database source of the sequence
- nucl_accession - accession number for the nucleotide sequence
- start - starting position in the full nucleotide sequence
- stop - ending position in the full nucleotide sequence
- strand - DNA strand orientation for the nucleotide sequence
- ipg_prot_id - protein id according to the ipg database
- ipg_prot_name - protein description to the ipg database
- organism - taxon description
- strain - strain of the organism
- assembly - assembly ids where the nucleotide sequence is present
Here's an expamle of such rows:
-- | -- | -- | -- | RefSeq | NZ_CP010111.1 | 142760 | 146194 | - | WP_080989235.1 | pesticidal protein | Bacillus thuringiensis serovar indiana | HD521 | GCF_001183785.1 |
In the output directory, specified with the -od flag, the following subdirectories with the following files could be generated:
- assembly - this directory is created if the assembly mode is enabled. It contains the files and directories that refer to the SPAdes output, for instanse, the assembly graph with the contigs in the gfa format. To see the full list of the SPAdes output look at the SPAdes manual (http://cab.spbu.ru/software/spades/);
- pathracer - this directory includes the PathRacer output. It will be created if the pathracer mode or the assembly mode are enabled. To see the full list of the PathRacer output read the manual (http://cab.spbu.ru/software/pathracer/);
- full_toxins -this directory is created if the fd mode is enabled. It contains the following files:
- <input>_full_extracted.sto - the result of hmmsearch with the full-toxin model in the sto format;
- <input>_full_extracted.fasta - the result of hmmsearch with the full-toxin model in the fasta format;
- domains. This directory contains the results of hmmsearch for all the domain models in the sto and fasta formats:
- <input>_D1_extracted.sto;
- <input>_D1_extracted.fasta;
- <input>_D2_extracted.sto;
- <input>_D2_extracted.fasta;
- <input>_D3_extracted.sto;
- <input>_D3_extracted.fasta;
- logs. This directory contains the log files for the different stages of the pipeline:
- spades.log - the log file for the assembly process;
- pathracer.log - the log file for the hmm-based search for the Cry toxins from the gfa file;
- full_extraction.log - the log file for performing hmmsearch on the full-toxin model;
- domains_extraction.log - the log file for performing hmmsearch on the domain models;
- coordinate_matches_<input>.txt - the dictionary of the domain mappings for the query;
- diamond.log - the log file for performing diamond blastp agaist the Bt nomenclature database;
- aligned_<input>.fa - the aligned sequences from diamond blastp;
- unaligned_<input>.fa - the unligned sequences from diamond blastp;
- diamond_matches_<input>.txt - the results of diamond blastp in the table form;
- cry_processor.log - the main CryProcessor logfile;
- first_search_<input>.fasta - the result of hmmsearch with the full model on the query in the fasta format. This file is later used as a queiry for the domains search;
- raw_full_<input>.fasta. This file contains the full protein sequences of the Cry toxins possesing all three domains;
- raw_processed_<input>.fasta -this file contains the processed protein sequences (without left and right lanking parts) of the Cry toxins possesing all three domains;
- proteins_domain_mapping_full_<input>.bed - the file in the bed format, containing the domain coordinate mappings for the uprocessed protein sequences;
- proteins_domain_mapping_processed_<input>.bed - the file in the bed format, containing the domain coordinate mappings for the processed protein sequences;
- unique_<input>.fasta - this file contains the potentially new Cry toxins, differing from Bt nnomenclature afrer diamond blastp;
- annotation_table_<input>.tsv - the result of the ipg annotation of the found toxins (the structure of the file is described above);
- <input>_full_nucl.fna - the full nucleotide sequences for the found toxins;
- <input>_processed_nucl.fna - the processed nucleotide sequences for the found toxins;
- nucl_domain_mapping_full_<input>.bed - the file in the bed format, containing the domain coordinate mappings for the uprocessed nucleotide sequences;
- nucl_domain_mapping_processed_<input>.bed - the file in the bed format, containing the domain coordinate mappings for the processed nucleotide sequences;
Note, that in the files, marked with the raw prefix the initial accession numbers from the query are used as ids, while in the files, obtained after diamond blastp (<input>_full_nucl.fna, <input>processed_nucl.fna and unique<input>.fasta) the id structure is modified in the following way:
- Cry53Aa1(40.7)_WP_103591149.1
- <top Cry protein hit from Bt nomenclature>(<the identity score with this hit>)_<the initial accession number>
- Bankevich A., Nurk S., Antipov D., Gurevich A., Dvorkin M., Kulikov A. S., Lesin V., Nikolenko S., Pham S., Prjibelski A., Pyshkin A., Sirotkin A., Vyahhi N., Tesler G., Alekseyev M. A., Pevzner P. A. SPAdes: A New Genome Assembly Algorithm and Its Applications to Single-Cell Sequencing. Journal of Computational Biology, 2012
- Buchfink B., Xie C. & Huson D.H. Fast and Sensitive Protein Alignment using DIAMOND. Nature Methods, 2015
- Finn R.D., Clements J., Eddy S.R. HMMER Web Server: Interactive Sequence Similarity Searching. Nucleic Acids Research, 2011.
- Nurk S., Meleshko D., Korobeynikov A., Pevzner P. A. metaSPAdes: a new versatile de novo metagenomics assembler. Genome Research, 2017
- Anton Shikov - Initial work - anton-shikov