Skip to content

Wrapper tool to reconstruct ancestral sequences using IQ-TREE

License

Notifications You must be signed in to change notification settings

YuSugihara/ancseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

61 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ancseq

Version 1.2.1

Table of contents

What is ancseq?

Ancestral sequence reconstruction is a technique to reconstruct ancestral states from a multiple sequence alignment. ancseq is a wrapper tool to reconstruct ancestral sequences using IQ-TREE. See more detail workflow in ancseq here.

Citation

Sugihara Y, Kourelis J, Contreras MP, Pai H, Selvaraj M, Toghani A, Martinez-Anaya C*, Kamoun S* (2024) Helper NLR immune protein NRC3 evolved to evade inhibition by a cyst nematode virulence effector. bioRxiv. *Corresponding authors

Installation

Dependencies

Softwares

Python (>=3.5) libraries

Installation using conda

You can install ancseq with the dependencies using anaconda.

git clone https://github.com/YuSugihara/ancseq.git
cd ancseq
conda env create -f ancseq.yml
conda activate ancseq
pip install .

Usage

$ ancseq -h
usage: ancseq -s <ALIGNED_FASTA> -m <MODE> -o <OUT_DIR> [-t <INT>]

ancseq version 1.2.1

options:
  -h, --help         show this help message and exit
  -s , --seq         Sequence alignment in FASTA format.
  -m , --mode        Sequence type. [DNA/AA/CODON]
  -o , --out         Output directory. The given name must not exist.
  -t , --threads     Number of threads. [8]
  -b , --bootstrap   Replicate for bootstrap. [1000]
  --max-report       Maximum number of ambiguous sites to report at the same position. [5]
  --min-prob         Minimum probability of being reported as an ambiguous site. [0.05]
  --min-gap-prob     Minimum probability of replacing the ancestral state with a gap. [0.5]
  --fast             Use -fast option in IQ-TREE [FLASE]
  --model            Specify substitution model for IQ-TREE. IQ-TREE searches the best substitution
                     model using ModelFinder in default [MFP]
  --outgroup         Specify outgroup for IQ-TREE. [None]
  --stop-codon-prob  Stop calculation of codon probabilities in DNA mode [FLASE]
  --asr-only         Skip building tree and reconstruct ancestral states only [FLASE]
  -v, --version      show program's version number and exit

We recommend to specify the outgroup to avoid misinterpretation of the ancestral states of nodes. See more detail here.

Example 1 : Running ancseq for nucleotide sequence alignment

ancseq -s test_nuc.fasta \
       -m DNA \
       -o out_dir

-s : Nucleotide sequence alignment in fasta format.

-m : Sequence type.

-o : Name of the output directory. The given name should not exist.

Example 2 : Running ancseq for amino acid sequence alignment

ancseq -s test_nuc.fasta \
       -m AA \
       -o out_dir

-s : Amino acid sequence alignment in fasta format.

-m : Sequence type.

-o : Name of the output directory. The given name should not exist.

Example 3 : Running ancseq for codon sequence alignment

!!!WARNING!!! IQ-TREE implements codon substitution models. However, it might take too long to build phylogenetic tree, depending on the alignment you input. In that case, we would recommend running ancseq in DNA mode. anceseq can calculate the probabilities of each codon in DNA mode.

ancseq -s test_codon.fasta \
       -m CODON \
       -o out_dir

-s : Codon sequence alignment in fasta format.

-m : Sequence type.

-o : Name of the output directory. The given name should not exist.

Example 4 : Running ancseq specifing outgroup

You can reconstruct the ancestral states without specifying the outgroup. However, the ancestral states of the node may be misinterpreted when you visualize the tree. Therefore, we recommend to specify the outgroup to avoid misinterpretation of ancestral states of the node. IQ-TREE converts the rooted tree to the unrooted tree in defalt.

ancseq -s test_nuc.fasta \
       -m DNA \
       --outgroup seq_id \
       -o out_dir

-s : Nucleotide sequence alignment in fasta format.

-m : Sequence type.

--outgroup : Sequence ID of outgroup.

-o : Name of the output directory. The given name should not exist.

Example 5 : Running ancseq with --fast option

ancseq -s test_nuc.fasta \
       -m DNA \
       -o out_dir \
       --fast

-s : Nucleotide sequence alignment in fasta format.

-m : Sequence type.

-o : Name of the output directory. The given name should not exist.

--fast : Use -fast option in IQ-TREE.

Outputs

Inside of OUT_DIR is like below.

├── 00_tree
│  ├── 00_iqtree.err
│  ├── 00_iqtree.out
│  ├── test_nuc.fasta
│  ├── test_nuc.fasta.bionj
│  ├── test_nuc.fasta.ckp.gz
│  ├── test_nuc.fasta.contree
│  ├── test_nuc.fasta.iqtree
│  ├── test_nuc.fasta.log
│  ├── test_nuc.fasta.mldist
│  ├── test_nuc.fasta.model.gz
│  ├── test_nuc.fasta.splits.nex
│  └── test_nuc.fasta.treefile
├── 10_asr
│  ├── 10_iqtree.err
│  ├── 10_iqtree.out
│  ├── test_nuc.fasta
│  ├── test_nuc.fasta.ckp.gz
│  ├── test_nuc.fasta.iqtree
│  ├── test_nuc.fasta.log
│  ├── test_nuc.fasta.state.gz
│  └── test_nuc.fasta.treefile
├── 20_indels
│  ├── 20_iqtree.err
│  ├── 20_iqtree.out
│  ├── test_nuc.fasta.binary
│  ├── test_nuc.fasta.binary.ckp.gz
│  ├── test_nuc.fasta.binary.iqtree
│  ├── test_nuc.fasta.binary.log
│  ├── test_nuc.fasta.binary.state.gz
│  └── test_nuc.fasta.binary.treefile
└── 30_result
   ├── ancestral_state_result.treefile
   ├── ancestral_state_result.fasta
   ├── ancestral_state_result_with_gap.fasta
   ├── ancestral_state_result.sort.tsv
   ├── ancestral_state_result.tsv.gz
   └── ancestral_state_result.codon_prob.tsv.gz
  • The phylogenetic tree reconstructed by IQ-TREE can be found in 00_tree.
  • The results of the ancestral sequence reconstruction can be found in 30_result.
    • ancestral_state_result.treefile: Phylogenetic tree with the node labels.
    • ancestral_state_result.fasta: FASTA file of the ancestral sequences without gaps.
    • ancestral_state_result_with_gap.fasta: FASTA file of the ancestral sequences with gaps.
    • ancestral_state_result.sort.tsv : Probabilities of the ancestral states.

Workflow in ancseq

IQ-TREE command 1

iqtree -s ${INPUT_FASTA} \
       -st ${SEQ_TYPE} \
       -T ${NUM_THREADS} \
       -B ${NUM_BOOTSTRAP} \
       -m MFP \
       1> /OUT_DIR/00_tree/00_iqtree.out \
       2> /OUT_DIR/00_tree/00_iqtree.err

-s : Sequence alignment in fasta format.

-st : Sequence type.

-T : Number of threads.

-B : Replicates for ultrafast bootstrap.

-m MFP : Extended model selection followed by tree inference.

If you specify the --fast option in ancseq, the IQ-TREE command 1 will change as follows.

iqtree -s ${INPUT_FASTA} \
       -st ${SEQ_TYPE} \
       -T ${NUM_THREADS} \
       --alrt ${NUM_BOOTSTRAP} \
       -m MFP \
       --fast \
       1> /OUT_DIR/00_tree/00_iqtree.out \
       2> /OUT_DIR/00_tree/00_iqtree.err

-s : Sequence alignment in fasta format.

-st : Sequence type.

-T : Number of threads.

--alrt : Replicates for SH approximate likelihood ratio test.

-m MFP : Extended model selection followed by tree inference.

--fast : Fast search to resemble FastTree.

IQ-TREE command 2

iqtree -asr \
       -s ${INPUT_FASTA} \
       -te /OUT_DIR/00_tree/${INPUT_FASTA}.treefile \
       -st ${SEQ_TYPE} \
       -T ${NUM_THREADS} \
       -m ${MODEL} \
       -o ${OUTGROUP} \
       -keep_empty_seq \
       1> /OUT_DIR/10_asr/10_iqtree.out \
       2> /OUT_DIR/10_asr/10_iqtree.err

-asr : Ancestral state reconstruction by empirical Bayes.

-s : Sequence alignment in fasta format.

-te : Tree file.

-st : Sequence type.

-T : Number of threads.

-m : Model name.

-o : Sequence ID of the outgroup if you specify the outgroup with --outgroup.

-keep_empty_seq : Keep empty sequences in the alignment.

IQ-TREE command 3

iqtree -asr \
       -s ${INPUT_FASTA}.binary \
       -te /OUT_DIR/00_tree/${INPUT_FASTA}.treefile \
       -st BIN \
       -T ${NUM_THREADS} \
       -blfix \
       -m JC2 \
       -o ${OUTGROUP} \
       -keep_empty_seq \
       1> /OUT_DIR/20_indels/20_iqtree.out \
       2> /OUT_DIR/20_indels/20_iqtree.err

-asr : Ancestral state reconstruction by empirical Bayes.

-s : Sequence alignment in fasta format.

-te : Tree file.

-st BIN : Binary sequence type.

-T : Number of threads.

-blfix : Fix branch lengths of tree passed via -t or -te.

-m JC2 : Jukes-Cantor type binary model.

-o : Sequence ID of the outgroup if you specify the outgroup with --outgroup.

-keep_empty_seq : Keep empty sequences in the alignment.

References

  1. Aadland K, Pugh C, Kolaczkowski B. 2019. High-Throughput Reconstruction of Ancestral Protein Sequence, Structure, and Molecular Function. In: Sikosek T ed. Computational Methods in Protein Evolution. Methods in Molecular Biology. New York, NY: Springer, 135–170. DOI: 10.1007/978-1-4939-8736-8_8.

  2. VanAntwerp J, Finneran P, Dolgikh B, Woldring D. 2022. Ancestral Sequence Reconstruction and Alternate Amino Acid States Guide Protein Library Design for Directed Evolution. In: Traxlmayr MW ed. Yeast Surface Display. Methods in Molecular Biology. New York, NY: Springer US, 75–86. DOI: 10.1007/978-1-0716-2285-8_4.

How does anceseq calculate codon probabilities in DNA mode?

IQ-TREE implements codon substitution models. However, it might take too long to build phylogenetic tree, depending on the alignment you input. Therefore, we implemented a function to calculate the probabilities of each codon by simply multiplying the probabilities of each nucleotide. For example, a probability of methionine at the $j$ th position, $P_{j, M}$, can be calculated by multiplying a probability of A at the 1st nucleotide of the $j$ th codon, $p_{j_{1}, A}$, that of T at the 2nd nucleotide, $p_{j_{2}, T}$, and that of G at the 3rd nucleotide, $p_{j_{3}, G}$ like below.

$P_{j, M} = p_{j_{1}, A} * p_{j_{2}, T} * p_{j_{3}, G}$

About

Wrapper tool to reconstruct ancestral sequences using IQ-TREE

Resources

License

Stars

Watchers

Forks

Packages

No packages published