Cry phylogeny

The repository with working codes and supplementary materials for the study of Cry toxins evolution

Figures

Figures from the main text are available in the pics/ directory. For the description, please consult the Results section of the article.

Supplementary

All supplementary material is located in the supplementary/ directory. Supplementry_figures_and_text.docx is a Microsoft Word document with a description of all supplementary figures and supplemental text of the article. Supplementary_tables.xlsx is an Exel table with all supplementary tables.

The supplementary/pics/ folder contains individual supplementary figures in various formats, namely SVG, PDF, and PNG.
For convenience, in the supplementary/tables/ folder individual supporting tables in CSV and XLSX formats are provided. For a detailed description of the tables, please consult the Supplementary_tables.xlsx file.

Data

Analyzed data are included in the data/ directory. The directory contains sequences and phylogenetic trees. Other essential tables obtained when using data processing scripts are presented in the data_for_scripts/ directory.

Sequences

The data/fastas directory contains nucleotide and protein sequences, both full and domain-wise. The raw set of 3-D Cry protein sequences is provided in the source_Cry_proteins.fasta file. The file contains full-length protein sequences inferred from four datasets: BPPRC (Bacterial Pesticidal Protein Resource Center), NCBI IPG database (Bacillus thuringiensis), 467 Bt assemblies from NCBI Assembly, and NCBI Genbank (bacteria non-redundant). Three-domain sequences were obtained with CryProcessor in the “-do” mode. The presented set of sequences is redundant and was further deduplicated.

The directory data/fastas/all_sequences includes non-redundant set from the source_Cry_proteins.fasta file. The duplicates with a 100% identity threshold are discarded. Cry toxins absent in the natural strains (presented only in patents) and subjected to artificial mutagenesis are also excluded;
The directory data/fastas/ref_sequences includes reference sequences obtained by applying CD-HIT software with a 95% identity threshold on the non-redundant filtered set of sequences in the all_sequences directory. All the files in two subdirectories follow the same rules for prefixes and postfixes:
The nucl_ prefix corresponds to nucleotide sequences;
The prot_ prefix marks protein sequences;
Sequences of individual domains are marked with the _domain postfix;
The _full postfix indicates that the sequences contain all regions, including N- and C-terminal ones;
The _ processed postfix indicates that the sequences contain only three main functional domains and are cropped from the beginning of the first domain to the end of the third domain accordingly.

Phylogenetic trees

The data/phylogeny directory includes reconstructed phylogenies and multiple sequence alignments used for building phylogenetic trees. The directory contains the following subdirectories and files:

The phylogeny/data subdirectory comprises source files used for phylogeny reconstruction:

The phylogeny/data/concat_msa.fasta file includes concatenated domain-wise alignments;
The phylogeny/data/concat_msa_trimmed.fasta file is the source data for reconstructing reference phylogenetic tree derived from concat_msa.fasta using the trimAl utility;
The phylogeny/data/concat_msa.fasta file is a set of partitions for trimmed sequences of the individual domains within the concatenated alignment;
The phylogeny/data/domains_msa subdirectory includes both raw and trimmed alignments of the nucleotide sequences of individual domains;

The phylogeny/init_trees directory contains both domain-wise trees and the tree inferred from the concatenated partitioned alignment;
The phylogeny/collapsed_trees comprises domain-wise phylogenetic trees corrected for recombination events with parents and recombinants grouped in case of multifurcation. A detailed description of the filtering procedure is presented in the Methods section of the article.

Data for scripts

For better reproducibility, the data needed for obtaining all the results described in the article is provided in the data_for_scripts/ directory. The subdirectories are divided according to the subsections in the Results section of the article. The directory data_for_scripts/for_viz contains source data required for producing figures in the main text and supplementary material. The presented data could be generated by scripts in the scripts/ directory.

Data description

In this part of the study, Cry sequences’ properties were analyzed. The following characteristics encompassed the number of putative novel toxins, the lengths, pair-wise identities, and conservation of the sequences as well as the patterns of clusterisations and the distribution of species encoding cry genes. The respective data_for_scripts/general_props directory contains the following files and subdirectories.

In the general_props/domain_sequences directory the following nucleotide and protein sequences of the individual domains are presented:
- domain_sequences/all_seqs_domains - nucleotide domain sequences form the deduplicated dataset with a 100% identity threshold;
- domain_sequences/unique_domain_sequences - nucleotide domain sequences in the above-described group after domain-wise deduplication;
- domain_sequences/ref_nucleotides - nucleotide domain sequences of the reference clusters obtained by applying CD-HIT software with a 95% identity threshold;
- domain_sequences/ref_proteins - protein sequences of the domains for the above-mentioned reference clusters;
In general_props/sequences_with_patents the processed nucleotide sequences (cropped from the beginning of the first domain to the end of the third domain accordingly) for the sets of sequences with 100% and 95% identity thresholds are present. Unlike other files, the respective sequences include toxins from patents only, including those subjected to artificial mutagenesis. These sequences were discarded for further analysis;
In the general_props/multiclusters directory the following domain-wie nucleotide and protein sequences of the clusters obtained by applying CD-HIT software with a 95% identity threshold with more than 10 toxins are presented:
- nucl - nucleotide domain sequences for each cluster;
- prot - nucleotide domain sequences for each cluster;
- msa_prot - alignments of the domain-wise protein sequences for each cluster;
- nucl - codon-wise alignments of the domain-wise nucleotide sequences for each cluster;
all_prot_95_full_names.fasta.clstr - the results of CD-HIT software with clustering patterns applied on full protein Cry sequences with a 95% identity threshold;
all_orfs_processed_nucl.fasta - nucleotide processed (cropped from the beginning of the first domain to the end of the third domain) sequences form the deduplicated dataset clustered with a 100% identity threshold;
ref_nucl_processed.fasta - the set of processed nucleotide sequences for reference clusters obtained with a 95% identity threshold;
all_sequences_nucl.aln.fasta - multiple sequence alignments of the processed nucleotide sequences from the deduplicated dataset with a 100% identity threshold applied;
no_pat_cd_hit_clusters.tsv - descriptive characteristics of the CD-HIT clusters with a 95% identity threshold (the names of the reference sequence, the number of toxins in the cluster, both raw and unique according to processed nucleotide sequences). Cry toxins absent in the natural strains (presented only in patents) and subjected to artificial mutagenesis are discarded;
bt_nomenclature_table.csv - the names of known Cry proteins with the respective accession numbers (nucleotide and proteins) from BPPRC (Bacterial Pesticidal Protein Resource Center);
merged.filtered_events.s70.l3.csv - the properties of filtered recombination events. The table here is needed for the annotate_clusters_consistency.py script to assess the presence of sequences from patents in recombination events;
clusters_table.csv - the composition of reference clusters with a 95% identity threshold applied;
all_annotations.tsv - the data from the IPG database (genome assemblies, species, strain, protein, and nucleotide accessions) for each Cry sequence in the deduplicated dataset with a 100% identity threshold for clusterization;
mearged_length_and_id.csv - the lengths and mean pair-wise identity for individual domains in deduplicated datasets with 100% and 95% identity thresholds, respectively;
pairs_identity_no_pat.tsv - domain-wise pair-wise comparisons of the domain sequences within the reference dataset with a 95% identity threshold applied;
full_contat.nwk - the reference phylogenetic tree inferred from the concatenated partitioned alignment.

Recombination detection

The following section includes source files used for inferencing and characterizing recombination events, including the filtration procedure, comparing identities between parents and recombinants, reconstructing the recombination graph, etc.

all_domains - nucleotide domain sequences form the deduplicated dataset with a 100% identity threshold;
ref_domains - nucleotide domain sequences of the reference dataset with a 95% identity threshold for clustering;
no_support_trees - domain-wise phylogenetic trees and the tree based on the concatenated partitioned alignment without supporting values and branch lengths;
parsed_trees - the node-wise structure of phylogenetic trees. In the respective tables, the content of toxins, depth level, and the number of sequences for each node are presented;
collapsed_trees - domain-wise phylogenetic trees corrected for recombination events with parents and recombinants grouped in case of multifurcation. A detailed description of the filtering procedure is presented in the Methods section of the article;
domain_mapings.bed - coordinates of the domain mappings of the processed sequences (cropped from the beginning of the first domain to the end of the third domain) of the reference dataset with a 95% identity threshold applied;
full_contat.nwk - the reference phylogenetic tree inferred from the concatenated partitioned alignment;
RDP_raw_signals.csv - the output of the RDP software used for detecting recombination events for the alignment of processed nucleotide sequences;
merged.filtered_events.s70.l3.csv - the characteristics of recombination events inferred from the RDP tool. Presented are the lists of parents and recombinant, coordinates of the breakpoints, p-value levels of detection tests, and pair-wise identity of the domain sequences of parents and children. The list of the events is filtered based on congruence in phylogenetic trees. A detailed scheme for the filtration procedure is present in the Methods section of the article;
pairs_identity_no_pat.tsv - domain-wise pair-wise comparisons of the domain sequences within the reference dataset with a 95% identity threshold applied;
unique_multiclusters.tsv - the domain-wise lists of sequences within the reference clusters obtained using a 95% identity threshold. Unique nucleotide sequences for each domain are selected.
data_for_filtration is a folder containing raw recombination detection predictions coupled with all intermediate and final results of the filtering procedure. The directory includes the following files:
- Trees with the ML prefix represent unamended initial phylogenies based on the sequences of individual domains;
- Files with the ml_dom prefix are domain-wise trees with collapsed branches with a 70% bootstrap threshold specified;
- The filtered.recomb.raxml postfix points on the trees of the individual domains with collapsed branches followed by rearrangement of the leaves through combining the leaves with parents and children in an event-wise way;
- Phylogenies with the zero_fixed infix are the aforementioned transformed trees corrected for erroneous branches with zero length occurring after launching the recomb2tree.py script;
- RDP_pre_filtered_fixed.tsv is a parsed RDP4-generated table with the events classified according to the transferred domain devoid of exchanges putatively caused by mechanisms other than the recombination process;
- rdp.unpivot.csv - the transformed RDP4-based table with the new numeration of recombination events relative to children;
- recomb.unpivot.filtered.s70.l3.csv - the list of corrected recombination events in which the sets of parents are either increased or reduced according to the clades containing children and parents within the domain-wise trees;
- recomb.unpivot.filtered_events.s70.l3.csv - the table with filtered recombination events according to the congruence of children and parents in the context of phylogenetic relationships devoid of those in which the size of clades exceeds the depth threshold of three steps;
- merged.filtered_events.s70.l3.csv - the table of the events formatted as the initial list presented in the RDP_pre_filtered_fixed.tsv file devoid of incongruent events and with novel columns that include parents transferring a single domain for each phylogeny;
- filtering.log - the log file tracing the number of events excluded during the filtering procedure.

Analysis of recombination mechanisms

In this section, putative mechanisms of recombination were analyzed. Multiple approaches were applied, namely, assessing the overall homologous recombination rate in genome assemblies with loci coding for Cry toxins, studying the genomic context of cry genes, and comparing sequence identity between parents in regions surrounding recombination breakpoints. The content of the data_for_scripts/mechanisms directory goes as follows:

MGE_beds is a directory containing coordinates in the bed format of MGEs (mobile genetic elements) and cry genes within Bacillus thuringiensis genome assemblies. The directory has the following substructure:
- MGE_beds/recs - coordinates of ‘cry’ genes encoding toxins subjected to recombination-driven domain exchanges;
- MGE_beds/all_crys - coordinates of all ‘cry’ genes;
- MGE_beds/CRISPRs - coordinates of CRISPR loci;
- MGE_beds/GIs - coordinates of genomic islands;
- MGE_beds/IS - coordinates of insertion sequences. Each directory except for ‘recs’ and ‘all_crys’ includes source, crys_inter, and recs_inter subdirectories with the coordinates of genomic features and their intersections with loci encoding Cry toxins and toxins subjected to recombination, respectively.
asmbl_types.csv - the classification of genome assemblies according to the presence of cry genes and cry loci encoding toxins with recombination signals;
bt_assemblies.tsv - characteristics of genome Bt genome assemblies, including level, serovar (if present in the name of the organism), and links to the FTP repository for downloading;
cry_asmbl_stat.tsv - genomic data for cry loci in the Bt assemblies, namely, accession number, protein name, identity with the closest homolog from the BPPRC database, genomic coordinates, strand orientation, and assembly level;
Mappings_for_breakpoints_de_novo_search.csv - coordinates of breakpoints for each parent and recombinant within recombination events as well as the coordinates of the domains. The coordinates are presented both for full and processed (cropped from the beginning of the first domain to the end of the third domain) nucleotide Cry sequences;
merged.filtered_events.s70.l3.csv - the filtered set of recombination events inferred from the RDP tool based on the congruence in phylogenetic trees. A detailed scheme for the filtration procedure is present in the Methods section of the article;
no_pat_cd_hit_clusters.tsv - descriptive characteristics of the CD-HIT clusters with a 95% identity threshold (the names of the reference sequence, the number of toxins in the cluster, both raw and unique according to processed nucleotide sequences). Cry toxins absent in the natural strains (presented only in patents) and subjected to artificial mutagenesis are discarded;
RDP_pre_filtered_fixed.tsv - properties pf recombination events obtained from the RDP tool with the first step of filtration (events that are marked as dubious are discarded);
RDP_raw_signals.csv - the output of the RDP software used for detecting recombination events for the alignment of processed nucleotide sequences;
ref_domains - nucleotide sequences of the domains for reference representative of the clusters;
Reference_domain_mapppings_full.bed - coordinates of the domains in full nucleotide sequences in the reference dataset;
Reference_domain_mapppings_processed.bed - coordinates of the domains in processed nucleotide sequences in the reference dataset;
ref_all_processed_nucl_aln.fasta - multiple sequence alignments of the processed nucleotide sequences from the reference dataset;
ref_full_nucl_corrected.fasta - full nucleotide sequences from the reference dataset;
ref_nucl_processed.fasta - processed nucleotide sequences from the reference dataset.

Studying the impact of recombination on host specificity

The code in this directory was used to reveal the consequences of recombination events on the list of affected host species. In addition, the differences in the composition of toxin-encoding genes in parents and recombinants are analyzed. The directory contains the following files:

Cry_asmbl_stat.csv - genomic data for cry loci in the Bt assemblies, namely, accession number, protein name, identity with the closest homolog from the BPPRC database, genomic coordinates, strand orientation, and assembly level;
Cry_strains_fixed.csv - the list of Cry toxins presented in different strains;
merged.filtered_events.s70.l3.csv - the filtered set of recombination events inferred from the RDP tool based on the congruence in phylogenetic trees. A detailed scheme for the filtration procedure is present in the Methods section of the article;
pairs_identity_no_pat.tsv - domain-wise pair-wise comparisons of the domain sequences within the reference dataset with a 95% identity threshold applied;
Plasmids_chromosomes_per_accession.csv - genomic regions encoding cry loci in Bt assemblies of complete and chromosome levels. Presented are accession numbers of Cry toxins, genomic contigs, and the type of the contig;
Ref_cry_prots_clust.csv - the list of reference Cry toxins obtained by clusterization with a 95% identity threshold;
species_to_orders.csv - the correspondence between affected host species and their orders;
toxicity_mearged_fixed_non_redundant.csv - toxicity spectrum for Cry toxins and bacterial strains. In the case of toxins, the activity is presented following the BPPRC nomenclature ranks.

Revealing relationships between recombination and evolutionary selection

In this section, the effect of domain exchanges on evolutionary selection was analyzed. The evolutionary selection signals were revealed using three models, namely, branch, site, and branch-site. The directory includes the following files and subdirectories:

calculated_evol - results of evolutionary selection analysis using the ete3 utility. For each event, sequences of the individual domains were analyzed separately. The subset of sequences for calculating background are inferred from domain-wise phylogenetic trees. The subdirectories are named according to the parent types, i.e., monir parents start with the minor prefix, while major parents – major followed by the number based on the ascending order of the domains. For a detailed description of the procedure for preparing the data, consult the Methods section of the article;
collapsed_trees - domain-wise phylogenetic trees corrected for recombination events with parents and recombinants grouped in case of multifurcation. A detailed description of the filtering procedure is presented in the Methods section of the article;
evol_preparation_data.tsv - the properties of the data used for domain-wise calculations of evolutionary selection signals. Presented are the lists of parents, the number of the respective toxins, and the number of toxins in the background subtree;
Hosts_for_heatmap_stat_per_species.csv - toxin-wise attribution of the affected host orders for each recombination event. The data is given for toxins following the BPPRC nomenclature ranks and/or bacterial strains;
merged.filtered_events.s70.l3.csv - the filtered set of recombination events inferred from the RDP tool based on the congruence in phylogenetic trees. A detailed scheme for the filtration procedure is present in the Methods section of the article;
nucl_domains - nucleotide sequences of the individual domains for the non-redundant set of Cry toxins with a 100% identity threshold for clusterization;
pairs_identity_no_pat.tsv - domain-wise pair-wise comparisons of the domain sequences within the reference dataset with a 95% identity threshold applied;
prot_domains - protein sequences of the individual domains for the non-redundant set of Cry toxins with a 100% identity threshold for clusterization;
selection_results_summary_with_relaxed.csv - the overall summary of revealed evolutionary selection signals using all three models, namely, branch, site, and branch-site. Presented are lists of parents, transferred domains, and domains for which the estimates were obtained, LRT (log-likelihood ratio test) when comparing evolutionary models, omega values for background and foreground, and coordinates of the sites with significant signals;
Simpson_stat_for_events.csv - the overlap between the lists of affected hots (species and orders) between parents and recombinants for each recombination event;
unique_multiclusters.tsv - the domain-wie lists of sequences within the reference clusters obtained using a 95% identity threshold. Unique nucleotide sequences for each domain are selected.

Vizualizing results

The directory includes all files required for obtaining figures (both main and supplementary) presented in the article except for recombination and strain graphs which were generated by Python scripts. The subdirectories follow the same structure as the aforementioned data for processing.

Data description (the descriptive_stat directory):
- The files with the id_ prefix contain the results of pair-wise comparisons of the domain sequences for the deduplicated dataset and reference dataset obtained by clusterization with 100% (all_ interfix) and 95% (ref _ interfix) identity thresholds, respectively;
- The files with the cons_scores postfix include coordinate-wise conservation estimates (the proportion of the most frequent symbol in the alignment) for the deduplicated dataset (all_ prefix) and reference dataset (ref_ prefix);
- The files with the NOV_ids postfix contain identity estimates with the closest known homolog from the BPPRC database for novel putative Cry toxins within the deduplicated dataset (all_ prefix) and reference dataset (ref_ prefix);
- clusters_identity.csv - pair-wise identity scores for domain sequences within reference CD-HIT clusters obtained using a 95% identity threshold;
- Clusters_len_and_ids.csv - the composition of reference clusters with a 95% identity threshold applied;
- lengths_combined.csv - lengths of individual domains for nucleotide sequences of Cry toxins in the deduplicated and reference datasets;
- mearged_domain_props.csv - lengths, mean-pair-wise identity estimates, and species attributions for Cry toxins in the deduplicated and reference datasets;
- multiclusters_id_per_domain.csv - pair-wise identity between domain sequences in reference clusters containing more than 10 toxins;
- multiclusters_mismacth_sites.csv - domain-wise relative coordinates of substitutions (synonymous and non-synonymous) and indels in the alignments within reference clusters containing more than 10 toxins.
Phylogeny reconstruction (the phylogeny directory):
- The files with the msa postfix contain nucleotide sequence alignments of individual domains and the concatenated alignment of all domains (the All_ prefix);
- The files with the _heatmap postfix are pair-wise identity comparisons of nucleotide sequences of the domains arranged according to the order of toxins in the reference phylogeny inferred from the concatenated partitioned alignment;
- The files with the ML_ are reconstructed phylogenies obtained from trimmed nucleotide alignments of individual domains or the concatenated partitioned alignment (with the Full_ interfix);
- no_support - domain-wise phylogenetic trees and the tree based on the concatenated partitioned alignment without supporting values and branch lengths.
Recombination detection (the recombination directory):
- base_for_circus_plot.tsv - the order and colors of sectors in the Circos plot arranged by the order of the reference phylogeny obtained from the concatenated partitioned alignment;
- type_toxins_by_events.csv - the data on the types of sectors (toxins) based on their role in recombination events (parents, recombinants, or both);
- links_for_circus_plot.tsv - the links in the Circos plot linking parents with recombinants colored according to transferred domains;
- Comparisions_between_parents_filtered_reshaped.csv - the number of recombinants and parents in the sets of recombination events obtained using the RDP4 software. The data includes raw events and those filtered based on the transferred region length as well as the congruence in domain-wise phylogenetic trees. A detailed description of the filtering procedure is presented in the Methods section of the article;
- Comparisions_between_number_of_recs_per_filtration_reshaped.csv - the number of parents and recombinants in the final set of recombination events before and after applying filtration;
- fastgear_events.csv - the output of the fastGEAR software with the list of predicted recombination events;
- fastgear_res_per_coords_all.tsv - coordinate-wise description of sites according to fastGEAR predictions for visualizing regions with recombination signals;
- The files with the events_by_domains postfix contain the list of transferred domains for raw (raw_ prefix) recombination events, those filtered by the length (flterd_ prefix) and congruence on domain-wise phylogenetic trees (unpivot_ prefix);
- The files with the recomb_manhattan prefix contain coordinates of the breakpoints with p-values of recombination detection tests for raw (_with_partials postfix) recombination events, those filtered by the length (_filtered postfix) and congruence on domain-wise phylogenetic trees (filtered_unpivot postfix);
- Identity_per_dataset.csv - pair-wise comparisons between domain sequences of parents and recombinant in raw and filtered sets of recombination events;
- Mismathces_with_len.csv - the number of mismatches and the length of the alignment when comparing domain sequences of parents and recombinants in the final filtered set of recombination events;
- ML_Full_nuc.raxml.support.nwk - the reference phylogenetic tree inferred from the concatenated partitioned alignment;
- Num_recs_per_event_for_heatmap.csv - the number of recombinants and parents in raw and filtered sets of recombination events;
- Num_rec_filtered_heatmap.csv - the number of recombinants and parents in the final filtered set of recombination events;
- Rec_components_stat_long.csv - the properties of connected components in the graph of recombination events, including the number of nodes, edges, and recombination events in each component.
Analysis of recombination mechanisms (the mechanisms directory):
- The files with the presence postfix are presence/absence patterns of orthologous gene clusters in all analyzed Bt assemblies (all_ prefix) and those containing cry genes (cry_ prefix or loci encoding Cry toxins subjected to recombination (rec_ prefix);
- clusters_assignments_final.csv - the assignment of recombination IDs to clusters obtained when using hierarchical clusterization with a cosine distance metrics on the vectors of coordinate-wise identity between parents in the regions surrounding breakpoints. For a detailed description of the clusterization, consult the Methods section of the article;
- flanks_props_for_model.csv - identity between parents in regions flanking recombination breakpoints as well as the identity between transferred and non-transferred domains;
- MGES_num_summary_non_redundant.csv - the number of MGEs (mobile genetic elements), namely, prophages, genomic islands and insertions, in Bt assemblies containing cry loci or devoid of them;
- Parents_only_Distribution_of_flank_identities_merged.csv - coordinate-wise identity between sequences of parents in regions surrounding recombination breakpoints.
Impact of recombination of host specificity (the specificity directory):
- Components_heatmap_only_annotated.csv - the number of toxicity assays per toxins and strains reduced to the orders in the connected components of the strain graph containing parents and/of recombinants from certain recombination events;
- Graph_components_per_event_rank3_long.csv - the overlap coefficients in the composition of toxins in connected components of the strain graph containing parents and/of recombinants from certain recombination events;
- host_num_per_components.csv - properties of the connected components of the strain graph, including the number of toxins, strains, affected species, and recombination events;
- Hosts_for_heatmap_stat_per_species.csv - the distribution of affected host orders for parents and recombinants in recombination events. The toxicity data is given for toxins following the BPPRC nomenclature ranks and/or bacterial strains containing the respective toxins;
- Identity_per_dataset_with_unknowns.csv - pair-wise comparisons between domain sequences of parents and recombinant in the filtered set of recombination events;
- Num_common_tox_pars_recs.csv - the overlap coefficients in the composition of toxins in strains containing recombinants and parents within recombination events;
- num_hosts_per_dataset_in_toxin_group.csv - the number of affected species and orders in Bt genome assemblies and strains regarding the types of toxins they contain (recombinants, parents, both, or unaffected by recombination);
- num_hosts_per_toxin_group.csv - the number of affected species and orders for different types of toxins (recombinants, parents, both, or unaffected by recombination);
- Num_tox_for_assemblies_per_rank.csv - the number of toxins in Bt genome assemblies regarding the BPPRC nomenclature ranks;
- Num_tox_for_strains_per_rank.csv - the number of toxins in bacterial strains regarding the BPPRC nomenclature ranks;
- Num_tox_types_per_dataset_with_both.csv - the number of toxins in in Bt genome assemblies and strains concerning the types of toxins they contain (recombinants, parents, both, or unaffected by recombination);
- Overall_toxicity_stat.csv - the total summary of toxicity assays gathered for the study. The toxicity data is given for toxins following the BPPRC nomenclature ranks and/or bacterial strains containing the respective toxins;
- Overlap_species_components.csv - the overlap between the lists of affected host species and orders for parents and recombinants in recombination events;
- Simpson_stat_for_events.csv - the summary table with the overlap between the lists of affected host species and orders for parents and recombinants in recombination events; the number of toxins, and pairwise identity of the domain sequences between parents and recombinants;
- summary_of_cry_nums_per_dataset.csv - the number of Cry toxins in genome assemblies, contigs, plasmids, chromosomes, and bacterial strains;
- Tox_for_recs_and_refs.csv - the distribution of toxin types (recombinants, parents, both, or unaffected by recombination) in bacterial trains and Bt assemblies;
Evolutionary selection (the evol_selection directory):
- Branch_models_extended.csv - the summary of event-wise estimates of evolutionary selection using branch models. Presented are omega values for background and foreground, the best models chosen, the number of toxins, selection type, etc.;
- Identity_per_dataset_with_unknowns.csv - pair-wise comparisons between domain sequences of parents and recombinant in the filtered set of recombination events;
- Site_models_classified_res.csv - the per-site characteristics of the domain sequences of recombinants and parents using the site and branch-site models. The properties include the best models chosen, significance levels, type of selection in the site, and the probability of selection signals.

Scripts

The scripts/ directory includes all code used for pangenome analysis. The scrips are grouped into two categories, namely, those used for data processing and visualization. These groups are further subdivided following the subsections in the Results section of the article. For convenience, required data are presented in the data_for_script / directory. Here, commands to run the scripts are presented. The description of input files is given above. For convenience, the data presented in the data_for_scripts/ is also divided into subdirectories according to sections in the manuscript. The paths for the input files are given relative to the scripts/ directory. All the input files are described above. Here, only previously undescribed parameters are mentioned.

Data description

annotate_clusters_consistency.py - the script for summarizing clustering patterns obtained from the CD-HIT software. As a result, redundant sequences are reduced, and a summary table is generated. The toxins in the clusters are also checked for their presence in natural isolates by gathering the metadata from the IPG database. The toxins found only in patents were then manually inspected and removed from the analysis in case they were subjected to artificial mutagenesis. For faster access to the database, one can specify the email -m <e-mail> and NCBI API key -ap To run the script use the following command:

python3 annotate_clusters_consistency.py -o <out_dir> -cf ../data_for_scripts/general_props/all_prot_95_full_names.fasta.clstr -r ../data_for_scripts/general_props/merged.filtered_events.s70.l3.csv -rn ../data_for_scripts/general_props/sequences_with_patents/nucl_processed.fasta -cn ../data_for_scripts/general_props/sequences_with_patents/all_orfs_processed_nucl.fasta -ct ../data_for_scripts/general_props/no_pat_cd_hit_clusters.tsv -bt ../data_for_scripts/general_props/bt_nomenclature_table.csv -m <e-mail> -ap <NCBI ap>

fasta_len_count.py - the script for calculating the length of the records in the fasta file. To use it, run the following command:

python3 fasta_len_count.py <fasta_file>

calculte_within_clusters_identity_refined.py - the script for calculating domain-wise pair-wise identity between domain sequences within CD-HIT clusters. Use the following command to run the script:

python3 calculte_within_clusters_identity_refined.py -c ../data_for_scripts/general_props/clusters_table.csv -n ../data_for_scripts/general_props/domain_sequences/unique_domain_sequences

per_domain_clusters_identity.py - the script with the same function as calculte_within_clusters_identity_refined.py but for clusters containing more than 10 toxins. The script could be launched with the following command:

python3 per_domain_clusters_identity.py -n ../data_for_scripts/general_props/all_seqs_domains -cf ../data_for_scripts/general_props/no_pat_cd_hit_clusters.tsv -o <output_dir>

calculate_identity_between_sequence_pairs.py - calculates par-wise identity for records in the inputted fasta file. To run the script, use the following command:

python3 calculate_identity_between_sequence_pairs.py -f <fasta_file> -o <output_dir> -t <num_threads>

summarize_cluster_alignments.py - the script for parsing protein and nucleotide sequence alignments of toxins within the clusters containing more than 10 toxins with subsequent identification of substitutions (synonymous and non-synonymous) and indels. The script also performs the calculation of conservation scores (the proportion of the most frequent non-gap symbol) for an inputted alignment. The script requires the following command:

python3 summarize_cluster_alignments.py -al ../data_for_scripts/general_props/multiclusters -n ../data_for_scripts/general_props/

descriptive_tox_stats_species_ids.py - the script generates a summary table with the properties of the Cry toxins, including domain lengths, mean pair-wise identity with other toxins, identity with the closest homolog from the BPPRC database, and species’ attributions. To use the script, run the following command:

python3 descriptive_tox_stats_species_ids.py -a ../data_for_scripts/general_props/all_annotations.tsv -n ../data_for_scripts/general_props/domain_sequences/all_seqs_domains -p ../data_for_scripts/general_props/mearged_length_and_id.csv -l ../data_for_scripts/general_props/no_pat_cd_hit_clusters.tsv -c ../data_for_scripts/general_props/uniq_proteins_names.txt

make_data_for_heatmap_universal.py - the script performs domain-wise comparisons of domain sequences in the reference dataset of toxins obtained with a 95% identity threshold for clusterization. The script could be launched with the following command:

python3 make_data_for_heatmap_universal.py -o <output_dir> -p ../data_for_scripts/general_props/domain_sequences/ref_proteins -n /data_for_scripts/general_props/domain_sequences/ref_nucleotides

create_identity_heatmap.py - the script for obtaining the matrix with pair-wise similarities between sequences in the order corresponding to the reference phylogeny based on concatenated partitioned alignment. The elected domain is set after the -d parameter. To run the script, use the following command:

python3 create_identity_heatmap.py -i ../data_for_scripts/general_props/pairs_identity_no_pat.tsv -t ../data_for_scripts/general_props/full_contat.nwk -d <domain> -o <output_name>

Phylogeny reconstruction

calculate_mean_support.py - the script calculates mean support values for all branches in the inputted phylogenetic tree. The script is run with the following command:

python3 calculate_mean_support.py -t <tree_file>

Recombination detection

parse_tree.py - the script for transforming a phylogenetic tree into a table with the content of subtrees of a certain node. The script is launched with the following command:

python3 parse_tree.py <tree_file>

RDP_filter_combined.py - the script for processing the raw table with predicted recombination events using the RDP4 tool. The script deletes events marked as dubious and creates a summary table with the list of parents and recombinants, mean pair-wise identity between them, coordinates of the domains and breakpoints, the total p-value for detection tests, and the domain transferred from the minor parent. The transferred domain is determined by intersecting the coordinates of the breakpoints and domain mappings for the recombinant. The script also generates coordinate-wise correspondence of the breakpoints to p-values for visualization. To run the script, use the following command:

python3 RDP_filter_combined.py -r ../data_for_scripts/recombination_detection/RDP_raw_signals.csv -p ../data_for_scripts/recombination_detection/parsed_trees -s ../data_for_scripts/recombination_detection/pairs_identity_no_pat.tsv -m ../data_for_scripts/recombination_detection/domain_mapings.bed

filter_events_by_id.py - the script for filtering events according to the pair-wise identity between parents and recombinants. Events in which the identity between non-transferred domains is higher than the transferred one are discarded. The script is launched with the following command:

python3 filter_events_by_id.py -r ../data_for_scripts/recombination_detection/merged.filtered_events.s70.l3.csv -o <output_dir>

The recombination_correction.sh script performs all the necessary steps to filter the recombination events according to the phylogenetic congruence of parents and children (see the Methods section in the Supplementry_figures_and_text.docx file for extra details). The script encompasses seven scripts given below. To run the filtration pipeline, simply run the following command:

bash recombination_correction.sh

python3 filter_tree.py - the script for collapsing branches of the phylogenetic tree with a given threshold of the bootstrap support specified in the -s parameter. To run the script, use the following command:

python3 filter_tree.py -t <tree_file> -s <bootstrap support threshold> -o <output>

python3 recomb2tree.py - the script corrects the domain-wise phylogenetic tree in the context of recombination events. The algorithm implies collapsing branches with a certain bootstrap threshold (the -s flag) for a certain domain specified in the -d parameter. The script requires the following command:

python3 recomb2tree.py -r ../data_for_scripts/recombination_detection/data_for_filtration/RDP_pre_filtered_fixed.tsv -t <tree_file> -s <bootstrap support threshold> -d <domain index> -o <output name>

python3 unpivot_table.py is the script to transform the parsed RDP4-based table RDP_pre_filtered_fixed.tsv by swapping the rows and columns and re-naming the events according to the recombination children. The script is launched as given below:

python3 unpivot_table.py -r ../data_for_scripts/recombination_detection/data_for_filtration/RDP_pre_filtered_fixed.tsv -o ../data_for_scripts/recombination_detection/data_for_filtration/rdp.unpivot.csv

python3 filter_parents.py - updates the events from in the transformed table by selecting the clades in the recombination-wise corrected phylogenetic trees per each domain. Goes to the upper nodes with a given threshold (the -l parameter) for each set of children and parents. To run the script, use the command below:

python3 filter_parents.py -r ../data_for_scripts/recombination_detection/data_for_filtration/rdp.unpivot.csv -t ../data_for_scripts/recombination_detection/data_for_filtration/dom1.filtered.recomb.zero_fixed.raxml -t ../data_for_scripts/recombination_detection/data_for_filtration/dom2.filtered.recomb.zero_fixed.raxml -t ../data_for_scripts/recombination_detection/data_for_filtration/dom3.filtered.recomb.zero_fixed.raxml -s <bootstrap support threshold> -l <branch depth> -o ../data_for_scripts/recombination_detection/data_for_filtration/recomb.unpivot.filtered.s70.l3.csv

filter_events.py carries out the final filtration of the events according to the phylogenetic congruence of the parents and children per each domain-wise tree. The script takes the above-mentioned table with the updated list of parents per each event and discards the exchanges in which the respective clades that union parents and recombinants are placed in the root of the tree. The script is launched with the following command:

python3 filter_events.py -i ../data_for_scripts/recombination_detection/data_for_filtration/recomb.unpivot.filtered.s70.l3.csv -o ../data_for_scripts/recombination_detection/data_for_filtration/recomb.unpivot.filtered_events.s70.l3.csv

merge_filtered_with_original.py - the script for merging the results of recombination events filtration. The script takes the set of events from the original RDP4-based table that were retained after the filtration procedure. Apply the command below to run the script:

python3 merge_filtered_with_original.py -i ../data_for_scripts/recombination_detection/data_for_filtration/RDP_pre_filtered_fixed.tsv -f ../data_for_scripts/recombination_detection/data_for_filtration/recomb.unpivot.filtered_events.s70.l3.csv -o ../data_for_scripts/recombination_detection/data_for_filtration/merged.filtered_events.s70.l3.csv

analyze_rec_events_refined.py - the script for summarizing the properties of recombination events by calculating domain-wise identity between parents and recombinants as well as the number of toxins, unknown parents, and mismatches between the sequences of parents and recombinants. To run the script, use the following command:

python3 analyze_rec_events_refined.py -c ../data_for_scripts/recombination_detection/unique_multiclusters.tsv -an ../data_for_scripts/recombination_detection/all_domains -r ../data_for_scripts/recombination_detection/merged.filtered_events.s70.l3.csv -rn ../data_for_scripts/recombination_detection/ref_domains -o <output_dir>

predict_unknown_parents.py - the script for predicting the number of toxins possibly absent in the analyzed dataset based on the proportion of branches with recombinants and parents in domain-wise phylogenetic trees. The script could be run with the following command:

python3 predict_unknown_parents.py -r ../data_for_scripts/recombination_detection/merged.filtered_events.s70.l3.csv -t ../data_for_scripts/recombination_detection/collapsed_trees

build_recombination_graph.py - the script for reconstructing a directed graph with recombination exchanges. In the resulting graph, nodes correspond to toxins while edges represent transferred domains. The summary of connected components, including the number of toxins and events of a certain type, is also generated. In the results, the graph is visualized as well with the color of edges representing transferred domains and nodes illustrating the type of toxins (recombinants, parents, or both). The script is launched with the following command:

python3 build_recombination_graph.py -r ../data_for_scripts/recombination_detection/merged.filtered_events.s70.l3.csv -t ../data_for_scripts/recombination_detection/full_contat.nwk

Analysis of recombination mechanisms

clean_from_amb_sites.py - the script performs processing of the inputted fasta file by replacing ambiguous bases with the most frequent base to further run the ClonalFrameML software. To run the script, use the following command:

python3 clean_from_amb_sites.py -f <fasta_file>

summarize_MGES_in_genomes.py - the script generates a summary table with the number of MGEs (mobile genetic elements) for Bt assemblies by gathering coordinates of the elements and ‘cry’ loci. The script is run with the following command:

python3 summarize_MGES_in_genomes.py -r ../data_for_scripts/mechanisms/merged.filtered_events.s70.l3.csv -a ../data_for_scripts/mechanisms/asmbl_types.csv -m ../data_for_scripts/mechanisms/MGE_beds -c ../data_for_scripts/mechanisms/MGE_beds/all_crys -cl ../data_for_scripts/mechanisms/no_pat_cd_hit_clusters.tsv

get_breakpoints_from_alignments.py - the script extracts parental sequences in the regions surrounding recombination breakpoints. The coordinates are revealed by finding a substring of the recombinant sequence in the alignment of all parents with recombinants in the event. The script generates coordinates and contains functions applied in the code. To launch the script, use the following command:

python3 get_breakpoints_from_alignments.py -c ../data_for_scripts/mechanisms/cry_asmbl_stat.tsv -a ../data_for_scripts/mechanisms/bt_assemblies.tsv -an ../data_for_scripts/mechanisms/ref_full_nucl_corrected.fasta -pn ../data_for_scripts/mechanisms/ref_nucl_processed.fasta -rt ../data_for_scripts/mechanisms/merged.filtered_events.s70.l3.csv -ad ../data_for_scripts/mechanisms/ref_domains -fb ../data_for_scripts/mechanisms/Reference_domain_mapppings_full.bed -pb ../data_for_scripts/mechanisms/Reference_domain_mapppings_processed.bed

get_breakpoints_from_RDP.py - the script extracts regions flanking recombination breakpoints by extracting the sequences from the alignment of the processed sequences based on alignment-wise coordinates provided by the RDP4 utility. The extracted parental sequences are then aligned with MAFFT. The script provides the table with coordinate-wise identity estimates in the obtained alignments of parents. The script could be run with the following command:

python3 get_breakpoints_from_RDP.py -na ../data_for_scripts/mechanisms/ref_all_processed_nucl_aln.fasta -fn ../data_for_scripts/mechanisms/ref_full_nucl_corrected.fasta -rt ../data_for_scripts/mechanisms/RDP_raw_signals.csv -rf ../data_for_scripts/mechanisms/merged.filtered_events.s70.l3.csv -ad <output_directory> -bt ../data_for_scripts/mechanisms/Mappings_for_breakpoints_de_novo_search.csv -rc ../data_for_scripts/mechanisms/RDP_pre_filtered_fixed.tsv -nd ../data_for_scripts/mechanisms/ref_domains

Impact of recombination of host specificity

overall_toxicity_summary.py - the scripts summarize the total number of assays for bacterial strains and Cry toxins regarding the BRRPC nomenclature. The results of the script also include the table with the toxicity spectrum of the strains/genome assemblies containing toxins of particular types (recombinants, parents, both, and those unaffected by recombination) as well as the number of hosts for the toxins regarding the aforementioned type. To use the script, run the following command:

python3 overall_toxicity_summary.py -t ../data_for_scripts/specificity/toxicity_mearged_fixed_non_redundant.csv -r ../data_for_scripts/specificity/merged.filtered_events.s70.l3.csv -s ../data_for_scripts/specificity/Cry_strains_fixed.csv -f ../data_for_scripts/specificity/Ref_cry_prots_clust.csv -a ../data_for_scripts/specificity/Cry_asmbl_stat.csv -p ../data_for_scripts/specificity/Plasmids_chromosomes_per_accession.csv

cry_combinations_per_strain_and_assembly.py - the script performs comparisons between the sets of toxins strains/genome assemblies containing parent and recombinants. The script could be run with the following command:

python3 cry_combinations_per_strain_and_assembly.py -r ../data_for_scripts/specificity/merged.filtered_events.s70.l3.csv -s ../data_for_scripts/specificity/toxicity_mearged_fixed_non_redundant.csv -t ../data_for_scripts/specificity/Cry_strains_fixed.csv -p ../data_for_scripts/specificity/Plasmids_chromosomes_per_accession.csv -f ../data_for_scripts/specificity/Ref_cry_prots_clust.csv -a ../data_for_scripts/specificity/Cry_asmbl_stat.csv

build_strains_graph.py - the script reconstructs and visualizes a weighted undirected graph with strains corresponding to nodes and edges reflecting the composition of toxins shared between the strains. The color of the nodes illustrates the composition of affected host orders for toxins within the strain. The script also generates a summary table with the properties of connected components, namely, the number of toxins, strains, affected host species/orders, and recombination events. To use the script, run the following command:

python3 build_strains_graph.py -r ../data_for_scripts/specificity/merged.filtered_events.s70.l3.csv -s ../data_for_scripts/specificity/toxicity_mearged_fixed_non_redundant.csv -p ../data_for_scripts/specificity/Cry_strains_fixed.csv -o ../data_for_scripts/specificity/species_to_orders.csv

count_host_changes.py - the script performs comparisons between the sets of affected species and orders of recombinants and parents as well as strains and genome assemblies containing them. The script is launched with the following command.

python3 count_host_changes.py -r ../data_for_scripts/specificity/merged.filtered_events.s70.l3.csv -s ../data_for_scripts/specificity/toxicity_mearged_fixed_non_redundant.csv -t ../data_for_scripts/specificity/Cry_strains_fixed.csv -o ../data_for_scripts/specificity/species_to_orders.csv -i ../data_for_scripts/specificity/pairs_identity_no_pat.tsv -l ../data_for_scripts/specificity/Ref_cry_prots_clust.csv

prepare_combined_strain_graph.py - the script reconstructs and visualizes a weighted undirected graph with nodes representing connected components and edges illustrating recombination events between toxins in these components. The color of the edges corresponds to transferred domains, while nodes are colored according to the proportion of affected host orders for toxins and strains included in the component. To use the script, run the following command:

python3 prepare_combined_strain_graph.py -r ../data_for_scripts/specificity/merged.filtered_events.s70.l3.csv -s ../data_for_scripts/specificity/toxicity_mearged_fixed_non_redundant.csv -p ../data_for_scripts/specificity/Cry_strains_fixed.csv -o ../data_for_scripts/specificity/species_to_orders.csv

Evolutionary selection

report_gaps.py - the script generates a codon-wise nucleotide alignment based on nucleotide sequences and protein alignments. The script is run with the following command:

python3 report_gaps.py <nulc_aligment> <prot_alignment>

create_evol_data_on_fixed_recombinants.py - the script prepares the data required for performing evolutionary selection analysis, namely, protein and nucleotide sequences of the toxins in the extracted domain-wise subtrees containing recombinants and parents on one clade. The script also generates the table with the lists of recombinants and parents and the number of them as well as the total number of toxins in the subtrees. For a detailed description of the procedure, consult the Methods section of the article. To use the script, run the following command:

python3 create_evol_data_on_fixed_recombinants.py -r ../data_for_scripts/evol_selection/merged.filtered_events.s70.l3.csv -t ../data_for_scripts/evol_selection/collapsed_trees/domain1_collapsed.nwk -t ../data_for_scripts/evol_selection/collapsed_trees/domain2_collapsed.nwk -t ../data_for_scripts/evol_selection/collapsed_trees/domain3_collapsed.nwk -o evol_preparation_data.tsv -e <output_dir> -n ../data_for_scripts/evol_selection/nucl_domains/domain1.fasta -n ../data_for_scripts/evol_selection/nucl_domains/domain2.fasta -n ../data_for_scripts/evol_selection/nucl_domains/domain3.fasta -p ../data_for_scripts/evol_selection/prot_domains/domain1_prot.fasta -p ../data_for_scripts/evol_selection/prot_domains/domain2_prot.fasta -p ../data_for_scripts/evol_selection/prot_domains/domain3_prot.fasta -c ../data_for_scripts/evol_selection/unique_multiclusters.tsv -hf ../data_for_scripts/evol_selection/pairs_identity_no_pat.tsv

build_trees_for_evol.py - the script is applied to generate guiding phylogenetic trees for revealing signals of evolutionary selection. The trees are generated by creating codon-wise nucleotide alignments from protein alignments, identifying optimal evolutionary models with the ModelTest-NG utility, and reconstructing phylogeny with RAxML-NG. Therefore the paths to these instruments should be provided after the -mg and -rg flags, respectively. The script also requires the path to the directory with sequences obtained with the create_evol_data_on_fixed_recombinants.py script. To use the script, run the following command:

python3 build_trees_for_evol.py -r ../data_for_scripts/evol_selection/evol_preparation_data.tsv -e <evol_directory> -mg <modeltest_path> -rg <raxml-ng_path>

run_ete3_on_filtered_recomb.py - the script launches ete3 software based on the prepared data with guiding trees and alignments generated by create_evol_data_on_fixed_recombinants.py and build_trees_for_evol.py scripts. Three models are tested, namely, site, branch, and branch-site. The script could be run with the following command:

python3 run_ete3_on_filtered_recomb.py -r ../data_for_scripts/evol_selection/evol_preparation_data.tsv -e <evol_directory>

agregate_selection_results.py - the script summarizes the results of ete3 calculations and generates an overall table with the best models, selection type, LRT (log-likelihood ratio test), and the number of positively selected and conservative sites. The script is launched with the following command:

python3 agregate_selection_results.py -r ../data_for_scripts/evol_selection/evol_preparation_data.tsv -e ../data_for_scripts/evol_selection/calculated_evol

parse_selection_sites.py - the script generates coordinate-wise properties of sites obtained with site and branch-site models, namely, the type of the site (conservative or positively selected), the significance level for the best mode, and the probability of the selection signal. To use the script, run the following command:

python3 parse_selection_sites.py -s ../data_for_scripts/evol_selection/selection_results_summary_with_relaxed.csv -e ../data_for_scripts/evol_selection/calculated_evol -t ../data_for_scripts/evol_selection/Simpson_stat_for_events.csv -o ../data_for_scripts/evol_selection/Hosts_for_heatmap_stat_per_species.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cry phylogeny

Contents

Figures

Supplementary

Data

Sequences

Phylogenetic trees

Data for scripts

Data description

Recombination detection

Analysis of recombination mechanisms

Studying the impact of recombination on host specificity

Revealing relationships between recombination and evolutionary selection

Vizualizing results

Scripts

Data description

Phylogeny reconstruction

Recombination detection

Analysis of recombination mechanisms

Impact of recombination of host specificity

Evolutionary selection

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
data		data
data_for_scripts		data_for_scripts
scripts		scripts
supplementary		supplementary
README.md		README.md

lab7arriam/Cry_phylogeny

Folders and files

Latest commit

History

Repository files navigation

Cry phylogeny

Contents

Figures

Supplementary

Data

Sequences

Phylogenetic trees

Data for scripts

Data description

Recombination detection

Analysis of recombination mechanisms

Studying the impact of recombination on host specificity

Revealing relationships between recombination and evolutionary selection

Vizualizing results

Scripts

Data description

Phylogeny reconstruction

Recombination detection

Analysis of recombination mechanisms

Impact of recombination of host specificity

Evolutionary selection

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages