Please note - this document is under development and only contains outlines of the sections being prepared
This repository aims to serve as a starting point for learning how to perform various phylogenomic analyses. The focus is on the estimation of species trees for angiosperm plants (but the methods can be used for other organisms).
The documentation is structured as follows:
- background to performing these analyses
- workflow outline to estimate a species tree and discussion of the outputs obtained at each step. Includes links to existing tutorials, including ones compiled by people at Kew
- brief introductions to some computing subjects with links to existing tutorials, including those compiled by people at Kew, and how to run analysis software
- Skeleton workflow containing useful commands
If you would like to contribute content to this repository please contact an existing contributor at RGB Kew.
[ Proposed topics to include (very short paragraph on each topic)]:
- Why build trees with sequence data?
- What does it tell you that taxonomy can't?
- Discuss models of evolution and tree building algorithms
- Discuss Jukes Cantor, Kimura, F84 and REV models; model testing
- Discuss different types of tree method: distance matrix, parsimony, maximum likelihood and Bayesian analysis - this document discusses macimum likelihood only.
- Discuss what sequences are required for the analysis:
- requirement for orthologous loci
- NGS techniques - e.g.'s:
- RAD-Seq
- target capture of gene regions
- mention baits and enrichment - requirement of short ( e.g 120bp) bait seqs designed against a subset of target genes selected from the genome
- Various methods but recommend HybPiper (Johnson et al, 2019)
- Requires a bait kit for enrichment and reference target sequences (Angios353)
- Mention skimming is possible
- Mention MEGA353 and family-specific kits
- Brief discussion of data mining - enables analysis of new data to fit into context of existing data. Example data sources: OneKP, SRA, PAFTOL, Earth biogenomes, Darwin, 10KP etc - links to all large projects
- Annotating trees with taxonomy and other data sets
- WCVP
- Angiosperms353 - or section on existing tree resources incl Explorer ]
- Performing the analysis - are there any online tools for estimating a species tree?
- Kapli et al (2020) Phylogenetic tree building in the genomic age - review
- Guo et al (2022) Phylogenomics and the flowering plant tree of life - review
- Johnson et al (2019) A universal probe set for targeted sequencing of 353 nuclear genes from any flowering plant designed using k-medoids clustering
- courses in Evolution and genomics, Český Krumlov, Czech Republic
- Phylogenomics and Population Genomics: Inference and Applications, University of Barcelona - two week course, yearly
This workflow is discussed briefly in six sections below but consists of two major steps:
- step 1: assembly of gene sequences from samples
- step 2 organising sequences into gene sets, aligning the genes and using the alignments to estimate a species tree
It is assumed that target capture data is being used and available in the form of Illumina short pair end reads in fastq format. At the end of this section, there are links to existing more detailed documentation on how to run the programs. Following this section there are some details on computing requirements and a section that summarises the commands required for a complete analysis.
- The HybPiper example
- PAFTOL dataset e.g. one representative sample of all orders in Angiosperms
- adaptor trimming
- quality trimming
- trimmomatic tool
- FASTQC
Outputs (and Interpretation)
- Trimmomatic output - expected % retained pair end and single end reads; single can also be used in HybPiper (can be 7% of total reads).
Go further
- HybPiper
- organising sequences into gene sets
- output DNA sequences should be in frame
Outputs (and Interpretation)
Go further
- Captus software (?)
- Data mining
Intro - Various easy to use alignment programs exist - recommended ones: MAFFT, UPP for large alignments
Outputs (and Interpretation)
- Assessment: AMAS summary
- Filtering tools: AMAS, Trimal
- Removal of long branches with TreeShrink
- visualisation of the alignments
Go further
- Concatenation --> species tree (also see two sections below)
- DNA aligned by protein sequence
- Rogue taxon programs
Intro:
- Various easy to use maximum likelihood tree programs exist - recommended ones (in order of speed): FastTree, IQTREE, RAxML
- recommended program options e.g. IQTREE Ultrafastbootstrap
- Virtues of each one: speed, ease of use, documentation
Outputs (and Interpretation)
- Newick file format and manipulation of
- Assessment: support values
- Adding taxonomy and visualisation
Go further
Intro - choice of program: ASTRAL
Outputs and Interpretation
- Assessment of support values e.g. what do the pp1 scores mean? explain roots - trees are unrooted; Recommended value of confidence = >0.95 (ref) Normalised quartet score
- Visualisation
Go further
- e.g. concatenation; partitioning data with with RAxL-NG; not as robust a method to ILS as ASTRAL summary method but gives you real branch lengths which might be of interest e.g. long branches that coincide with particular taxonomic levels (or not!)
- intro to incongruence
- other programs to investigate e.g. ASTRAL-MP, ASTRAL-Pro
- Tanglegrams
The following links to tutorials and workflows go into the practical details about processing read data and estimating trees. Some additional computing knowledge required and is discussed in the next section.
-
A practical workflow by Sidonie Bellot - details include:
- overview of target capture (bait design and making DNA libraries)
- data quality checks and cleaning
- use of HybPiper (version 1.3) and tips on running the program plus information on:
- formatting reference targets
- gene assembly strategy
- multiple sequence alignment, estimating gene trees
- estimating he species tree with ASTRAL-III
-
PhyloFrame by Ben Kuhnhäuser - short list of relevant commands covering the same areas as Sidonie's workflow above
-
Training and bioinformatics resources from the GAP phylogenomics project - contains:
- links to their phylogenomic workflows
- recommendations on data analysis to estimate a species tree
- bioinformatics webinars and workshops (with materials to try out)
-
Kew internal training course in phylogenomics (November 2019)
- teaches the steps required to estimate a species tree
- includes summary of basic Linux commands, data transfer to and from computers, an example of running analyses with the Slurm job scheduler
-
PAFTOL Explorer workflow
- Tutorial on ASTRAL by Siavash Mirarab
- slides summarizing the ASTRAL process
- recommendations on:
- dealing with low bootstrap support values in gene trees
- filtering sequences from gene alignments
It is very likely that phylogenomic analysis will require use of high performance computing (HPC) due to the hundreds of genes used, unless the number of samples being compared is very small e.g. under 20). Therefore some computing expertise for HPC is required.
The links below will get you started on various topics that will also enable you to get started with bioinformatics analyses in general.
-
Kew Bioinformatics documentation
- includes information on getting connected to the KewHPC cluster
- provides brief instructions on installing or setting up software
- contains an overview of running analyses via the Slurm job scheduler
-
Crop Diversity HPC documentation
- Kew employees can work on this cluster too
- includes information on getting connected
- links to a Linux command line tutorial
- installation of software is done via Bioconda (easy setup instructions provided)
- good overview of how to run your analysis via the Slurm job scheduler
- maximum number of cpu's per user is limited to 256 (@March 2024)
-
Course materials from Programming for Biology, Cold Spring Harbour
- comprehensive introductions to Unix, Git and Python
-
A short Linux command line and Slurm primer from the course Methodological Advances in Reticulate Evolution (held at RGBKew, November 2023). Includes introduction to writing scripts in Bash, Python and Perl.
-
lots and lots of other documentation and courses are available online
- e.g. https://app.datacamp.com/ is available to Kew employees. It includes courses on R, Python, SQL, AI
- Bioinformatics Training at University of Cambridge:
The rest of this section summarizes handy commands for the following computing topics, the first three being important for getting started with analyses:
- Minimum commands - minimum commands required to take control of the command line
- Link to one or two more primers
- awk
It is essential to be able to run analysis work (jobs) via the Slurm job scheduler. The links at the top of this computing section contain plenty of instructions on how to use Slurm. The following link describes how to run jobs using relevant examples of target capture data analysis (using HybPiper version 1.3 or HybPiper version 2) and generating gene alignments and phylogenetic trees:
The remainder of this section summarizes useful Slurm example commands that can be copy and pasted onto the command line. Text or values that must be edited before using the command are shown between '<' and '>' characters.
sinfo -o "%16P %.5a %.10l %8p %.6D %.6t %N %m"
Output contains useful information you might want to consider before submitting an analysis (job) on a particular partition (queue)
- partition names (*=default)
- whether the partition is available
- time limit for the partition
- priority the job would have in a particular queue
- list of node names in the queue
- maximum amount of memory (RAM) available in each node
One way of submitting a job to Slurm using the --wrap flag (useful for practice and testing)
sbatch -J <jobName> -p long -c 1 -n 1 --mem=5000 -o <jobName>.log -e <jobName>.err --wrap " <your command(s) here> ] "
- By default memory (RAM) is specified in mega bytes (MB)
- Note - if the following characters are used within the command, they need to be used with a backslash character: ` " $ (i.e. ' " $)
- Variables from outside the script e.g. $variableName must be used without the backslash!
squeue -o "%.22i %.13P %.25j %.8u %.8T %.10M %.9l %.6D %.R %.C %.m %.p"
Output can be used to keep track of queued and running jobs e.g.:
- the unique identifier of your job (JOBID)
- the node on which the job is running (NODELIST)
- the current state of your job (NODELIST(reason))
- the number of cpus (CPUS) you requested
- how much memory (MIN_MEMORY) you requested
- To see your jobs only add: -u $USER
If using Slurm arrays, to change the number of jobs to run in parallel:
scontrol update arraytaskthrottle=10 job=<jobId>
To change the amount of RAM for jobs still pending:
scontrol update MinMemoryNode=5000 job=<jobId>
- By default memory (RAM) is specified in mega bytes (MB)
To change the number of cpus to use for jobs still pending:
scontrol update NumCPUs=10 job=<jobId>
- Note - you may also have to set CPUsPerTask to the value required as well
To change the time limit:
scontrol update TimeLimit=2-00:00:00 job=<jobId>
- TimeLimit example values: UNLIMITED, 1-00:00:00 (day-hours:minutes:seconds)
Other scontrol update variables:
- Partition
sacct -S 093023-09:00 --format=jobid,jobname%-25,start%16,end%16,Partition%13,elapsed,cputime,ncpus%3,ntasks%4=3,NodeList,AveVMSize,MaxRss%12,user%7,state,ExitCode < | grep batch | grep [jobId]1226030\|1227326 | sort -k11n >
Output provides a table of information on the outcome of a job:
- -S option (Start) restricts the output to jobs started from the specified date
- the Elapsed field contains the real time the job took
- the MaxRSS field is the maximum amount of RAM your job used. If the value is larger than the amount of RAM specified your job will fail
- the State field shows whether the last command executed in the job completed successfully. Values are: COMPLETED, FAILED, OUT_OF_ME+, TIMEOUT
- the ExitCode field shows the Linux exit code: 0.0 (successful), 1.0 (failed), >1.0 (some otheer issue)
- final part of the command is useful for large job arrays - | grep batch | grep | sort -k11n:
- restricts output to job of interest
- 'sort -k11n' sorts the output by the size of memory consumed; replace with sort -k5 to sort the time taken (Elapsed) (sorts correctly up to 9 days only!)
To cancel a job (array):
scancel <jobId>
To cancel all jobs pending for a user:
scancel -t PENDING -u <username>
To cancel a large number of jobs within a range jobId values:
scancel {<jobId>..<jobid>}
- ... but you probably should have used a Slurm job array instead
srun --cpus-per-task=1 --mem-per-cpu=2G --partition=himem --pty bash
- logs you directly into a node where you can run jobs interactively
- To see inside a specific node add the nodelist flag
- e.g. --nodelist=n21-64-2048-buck
- Organising software advise
- Installers
- Conda
- Installing HybPiper2 on a bad day (connection slow and dropping)
- Running a Python script - if you hzve to run a script with python in front
- Installing Python modules
- Setting up a Python environment
- Setting up a script
- Java versions and JRE/JDE
- Running a Java script - setting memory
This section is a very concise summary of the commands required for a complete analysis. It takes the example data set from HybPiper and works through read preparation, HybPiper gene recovery, gene alignment with MAFFT, gene tree estimation with IQTREE2 and ASTRAL tree for estimating the species tree.
- To use the same headings as in the Workflow outline provided above.
- Suggest to follow HybPiper2 tutorial
- then more detail from the gene alignment step onwards [yet to outline]
- Paul Bailey
- Sidonie Bellot
- Ben Kuhnhaeuser
- Need to think how the document(s) can be modified. Is it best to transfer agreed tracked changes from e.g. a Google doc?
- Can edit inGuitHub online and leave a commit message with your name - there's even an extended message box to leave further info - might be sufficient
- The idea is the the content should be very succinct and to the point
- Maybe contributors should meet occasionally to assess structure and agree new content
- If the document grows significantly or new topics are included, could break out into several files
- What other topics could be covered by other people as well as estimating species trees?
- Open door sessions (?)
Copyright © 2024 The Board of Trustees of the Royal Botanic Gardens, Kew