The project is based on IAGS and automates process
You can use this tool via conda, or by downloading the source code.
- Creating a Virtual Environment
conda create -n iags_auto python=3.9
- Download
conda install -c gurobi -c conda-forge -c huntguo iags_auto
- Download the source code
wget https://codeload.github.com/99gloom/IAGS_AUTO/zip/refs/heads/main
- Download mono to run DRIMM
sudo apt install mono-devel
Note: After downloading both methods, you need to activate gurobi, here we provide a help document to help you get the license.
The IAGS_AUTO tool requires three types of files: species GFF files, an orthogroup.tsv file, and a species.tree file. Please place these three types of files in the same folder. The GFF and orthogroup.tsv files are the same as those required for the previous script (processDrimm).
The following will be introduced one by one:
-
GFF files: The GFF files have the same format as the input for MCScanX. The file contains four columns, namely chromosome name, gene name, gene start coordinate, and gene end coordinate. The format is as follows:
sp_name gene_name starting_position ending_position
-
Orthogroups.tsv: The output file of OrthoFinder.
-
species.tree: WGD-Newick format. Essentially a modified version of the Newick format, with the addition of "[WGD]" markers at the WGD (Whole Genome Duplication) positions in the tree. In the figure below, the red dots represent WGD markers. (At the end of the document, whether there is a
';'
or not is acceptable)
command | parameters | instructions |
---|---|---|
-f, --filepath | ./file_dir | Directories where the three required files are stored |
-c, --cycleLength | The default value is 20 | The continuity of synteny blocks |
-d, --dustLength | Default value is all species copy number plus 1 | It controls the upper limit of gene family. The gene family will be filtered when homologous genes exceeding dustThreshold |
-s, --shape | "s" (Default) | Chromosome shape. "s" represents string chromosomes and “c” represents circular chromosomes |
"c" | ||
-m, --model | "manual" | Default is None. When users need to specify the outgroup manually, first use the "manual" mode to generate the node computation order file "model_and_outgroup.txt", and modify the information of the outgroup used. Then use "continue" mode to generate the result based on the problem pieces modified in the previous step |
"continue" | ||
"--dotplot" | - | Generate a two-by-two dotplot for each species |
"--expand" | - | Expanding synteny block coverage through graph algorithm |
--check | "yes" (Default) | Whether to stop the program when the percentage of empty chromosomes is greater than 30% after filtering Synteny blocks by copy number (Stopping the program when the quality of the synteny block is low) |
"no" |
The "Result" folder will be generated in the run directory, and there are subfolders inside, which are Tree_File, Process_Drimm, and IAGS in the order of generation. The files that the users are primarily concerned about are "IAGS" and "model_and_outgroup.txt" in "Tree_File". "IAGS" is the final generated ancestors genome result and chromosome painting. "model_and_outgroup.txt" is the outgroup information used by each ancestor node, which requires additional processing if manually specified.
The role of each file is described in more detail below:
- Tree_File
- species.ratio and all.ratio: Copy number information for all species;
- Evolutionary_tree.txt: Evolutionary tree shape and distribution of all nodes;
- model_and_outgroup.txt: Information about each ancestor node computation in the format: currently computed ancestor node : IAGS computation model : child node : outgroup. If the model is GMP or MultiCopyGMP, there are two child nodes. In particular, in the MultiCopyGMP model, if its outgroup does not have enough copy number, the outgroup chromosome will be doubled manually to compute, denoted by "*N".
- Process_Drimm
Essentially an automated process for processDrimm.- Process_OrthoFind: Gene sequences were generated by coding genes through species GFF and orthogroup.tsv files;
- Drimm_Synteny_Output: Raw results generated after running DRIMM;
- Drimm_Blocks: LCS (Longest Common Subsequence arithmetic, described in the processDrimm) of the raw results from the DRIMM run for downstream analysis;
- Final_Blocks: Filter Drimm_Blocks proportionally to generate blocks that can be run by IAGS.
- IAGS
- Name of each ancestor node: Ancestor node details including computed blocks, CRB ratio evaluation etc;
- painting: Chromosome painting of the ancestral genome, where "Painting_start_point.txt" records the basal ancestor of the drawing;
- shufflingEvents.txt: Species fission and fusion information.
-
1.Quick start
iags_auto -f ./example
-
2.Specified parameter
iags_auto -f ./example -c 60 -d 12 -s s
-
3.Run with graph expansion algorithm
iags_auto -f ./example --expand
-
4.Manual designation of outgroups
iags_auto -f ./example -m manual
Subsequently change the outgroups in "Result/Tree_File/model_and_outgroup.txt" and continue to run.
iags_auto -f ./example -m continue
-
5.Painting dotplot
iags_auto -f ./example --dotplot
Using this command generates an additional "Dotplot" folder in "Result" where the results will be stored.
-
6.Close chromosomes check
iags_auto -f ./example --check no
If you choose to download the source code for use, you will need to replace iags_auto
with python IAGS_ATUO.py
in the above command to start it by calling the python file directly.
Since IAGS is based on gurobi for integer optimization, this tool requires users to download and activate the gurobi license by themselves, here we provide a help document to help users install and activate gurobi.