The Distribution of Several Genomic Virulence Determinants Does Not Corroborate the Established Serotyping Classification of Bacillus thuringiensis.
A repository with working scripts for IJMS (MDPI) 2020 paper
If you use the code/data from this repotitoy please cite: Shikov, A.E.; Malovichko, Y.V.; Lobov, A.A.; Belousova, M.E.; Nizhnikov, A.A.; Antonets, K.S. The Distribution of Several Genomic Virulence Determinants Does Not Corroborate the Established Serotyping Classification of Bacillus thuringiensis. Int. J. Mol. Sci. 2021, 22, 2244. https://doi.org/10.3390/ijms22052244
This repository contains scripts used for data preparation for the manuscript. Please consult the Methods section in the paper for extra details.
- agregate_cry_data.py - summarizes the results about 3-D cry proteins spectra in the assemblies based on CryProcessor results (Table S15 in the article);
- agregate_flagelin_data.py, compare_flagellin_sets.py – aggregating the distribution of lengths and abundances of the flagellin sequences clustered via Roary (Table S10 in the article);
- annotate_bt_assemblies.py – summarizing metadata of the assemblies inspected (Table S4 in the article);
- bt_pangenome_tree.R – visualizing trees and heatmaps, performing PCA (Figures 3-5 in the article);
- calculate_mash_dist.py – constructing a heatmap with paired mash distance scores for all analyzed genomes (Table S12 and Figure S6a in the article);
- calculate_mean_support.py – calculating mean supporting values for phylogenetic trees in Newick format (for Table S8 in the article);
- check_lengths.py, CheckLengths.py - calculates sequence length for each sequence in the fasta and prints it in sequence ID - sequence length notation;
- check_lengths_massive.py - performs check_lengths.py over a directory containing multifasta files solely;
- cluster_pivot_PCR.py - adds initial cluster names and cluster names for the obtained amplicons (not referred to in the final version of the manuscript);
- compHmmToAnnot.py - parses a table containing names of Roary-deduced orthologs, then compares them to the entries stored in the HMM output folder and extracts sequences matching the original cluster sequence names from the HMM outputs folder to a separate directory;
- compare_genomes_full.py - constructing a heatmap with paired genome identity values using minimap2 for all analyzed genomes (Table S12 and Figure S6b in the article);
- download_hags.py - downloads the hag gene sequences from Xu and Côté, 2006 (DOI: 10.1128/AEM.00328-06);
- downloading.sh – downloading Bt assemblies the NCBI assembly database;
- extract_proteins_for_trees.py – extracting protein sequences from Roary-emanated pangenome for a specific gene cluster;
- extract_proteins.py - extracts sequences from the fasta files based on file containing a list of identifiers
- extract_proteins_blast.py: extracts query identifiers from the previously filtered BLAST outputs and uses them to fetch protein sequences from the files;
- ExtractByClusterName.py - extracts nucleotide sequences by the accessions stored in the Roary cluster table and fetches sequences from the cluster fasta files;
- ExtractByLength.py - finds the longest/shortest sequence in the sequence file
- fetch_nucleotide.py': assigns sequences from the HMMer or BLAST output to the Roary cluster reprsentatives adn
- get_mean_identity.py – evaluating mean paired sequence identity in the fasta file (for Table S8 in the article);
- parse_tree_topology.py – assessing the lengths of subtrees containing representatives of Bt serovars (Table S14 in the article);
- roary_stat.py, summ_roary_stat.py – excluding assemblies from the Roary-generated pangenome based on the abundance of common gene clusters;
- summarize_proteins_from_dige.py – gathering the gene presence/absence results based on diamond blastp results (Table S6 in the article).