-
Notifications
You must be signed in to change notification settings - Fork 86
StrainPhlAn 4.1
StrainPhlAn is a computational tool for tracking individual strains across a large set of samples. The input of StrainPhlAn is a set of metagenomic samples and for a given species of interes, the output is a multiple sequence alignment (MSA) file of the strains of the species of interest reconstructed from the samples. From this MSA, StrainPhlAn calls PhyloPhlAn to build the phylogenetic tree showing the phylogenetic relationship between strains identified in each sample.
For each sample, StrainPhlAn is able to identify strains of a species by merging and concatenating all reads mapped against species-specific MetaPhlAn markers.
As an introductory examples, let us start from a toy dataset with six HMP gut metagenomic samples (SRS055982-subjectID_638754422, SRS022137-subjectID_638754422, SRS019161-subjectID_763496533, SRS013951-subjectID_763496533, SRS014613-subjectID_763840445, SRS064276-subjectID_763840445) from three subjects (each was sampled at two time points) and one Bacteroides caccae genome G000273725. We would like to:
(1) extract the Bacteroides caccae (SGB1877) strains from these samples and compare them with the reference genome in a phylogenetic tree. (2) know how many SNPs between those strains and the reference genome.
By running StrainPhlAn on these samples, we will obtain the Bacteroides caccae (SGB1877) phylogenetic tree and its multiple sequence alignment in the following figure (produced with ete2 and Jalview):
We can see that the strains from the same subject are grouped together. The tree also highlights that the strains from each subject did not evolve between the two sampling time points. From the tree, we also know that the strains of subject "763496533" are closer to the reference genome than those of the others.
In addition, the table below shows the number of SNPs between the sample strains and the reference genome based on the strain alignment returned by StrainPhlAn.
In the next sections, we will illustrate step by step how to run StrainPhlAn on this toy example to reproduce the above figures.
StrainPhlAn is included in MetaPhlAn. You can get more info about MetaPhlAn installation here.
Let's reproduce the toy example result in the introduction section. The steps are as follows:
Step 1. Download the 6 HMP gut metagenomic samples, the metadata.txt file and one reference genome from the folder "fastqs" and "reference_genomes" in this link and put these folders under a folder named "strainer_tutorial".
Step 2. Obtain the sam files from these samples by mapping them against MetaPhlAn database:
This step will run MetaPhlAn: the metagenomic samples will be mapped against the MetaPhlAn marker database and then SAM files (*.sam.bz2) will produced. Each SAM file (in SAM format) contains the reads mapped against the marker database of MetaPhlAn. After activating your conda MetaPhlAn environment (check install instructions), the commands to run are:
mkdir -p sams
mkdir -p bowtie2
mkdir -p profiles
for f in fastq/SRS*
do
echo "Running MetaPhlAn on ${f}"
bn=$(basename ${f})
metaphlan ${f} --input_type fastq -s sams/${bn}.sam.bz2 --bowtie2out bowtie2/${bn}.bowtie2.bz2 -o profiles/${bn}_profiled.tsv
done
After this step, you will have a folder "sams" containing the SAM files (*.sam.bz2) and other MetaPhlAn output files in the "bowtie2" and "profiles" folders. If you want to skip this step, you can download the SAM files from the folder "sams" in this link.
Step 3. Produce the consensus-marker files which are the input for StrainPhlAn:
For each sample, this step will reconstruct all species strains found in it and store them in a json file (*.json). Those strains are referred as sample-reconstructed strains. The commands to run are:
mkdir -p consensus_markers
sample2markers.py -i sams/*.sam.bz2 -o consensus_markers -n 8
The result is the same if you want to run several sample2markers.py
scripts in parallel with each run for a sample (this may be useful for some cluster-system settings).
After this step, you will have a folder consensus_markers
containing all sample-marker files (*.json).
This step cannot be skipped, since the consensus marker files from the folder "consensus_markers" in this link use a deprecated format (.pkl instead of .json).
Step 4. Extract the markers of Bacteroides_caccae (SGB1877) from MetaPhlAn database (to add its reference genome later):
This step will extract the markers of Bacteroides_caccae (SGB1877) in the database and then StrainPhlAn will identify the sequences in the reference genomes that are closest to them (in the next step by using blast). Those will be concatenated and referred to as reference-genome-reconstructed strains. The commands to run are:
mkdir -p db_markers
extract_markers.py -c t__SGB1877 -o db_markers/
After this step, you should have one file in folder "db_markers": "t__SGB1877.fna" containing the markers of Bacteroides caccae (SGB1877). Those markers can be found in the folder "db_markers" in this link.
Step 5. Build the multiple sequence alignment and the phylogenetic tree:
This step will filter the selected clade markers based on their presence in the sample-reconstructed strains (stored in the marker files produced in step 3) and reference-genomes (if specified). Also, the sample-reconstructed strains and reference-genomes will be filtered based on the presence of the selected clade markers. From this filtered markers and samples, StrainPhlAn will call PhyloPhlAn to produce a multiple sequence alignment (MSA) to then build the phylogenetic tree.
The commands to run are:
mkdir -p output
strainphlan -s consensus_markers/*.json -m db_markers/t__SGB1877.fna -r reference_genomes/G000273725.fna.bz2 -o output -n 8 -c t__SGB1877 --mutation_rates
After this step, you will find the tree "output/RAxML_bestTree.t__SGB1877.StrainPhlAn4.tre". All the output files can be found in the folder "output" in this link. You can view it by Archaeopteryx or any other viewers.
In order to add the metadata, we provide a script called add_metadata_tree.py which can be used as follows:
add_metadata_tree.py -t output/RAxML_bestTree.t__SGB1877.StrainPhlAn4.tre -f fastq/metadata.txt -m subjectID --string_to_remove .fastq.bz2
The script "add_metadata_tree.py" can accept multiple metadata files (space-separated, wild card can also be used) and multiple trees. A metadata file is a tab-separated file where the first row is the meta-headers, and the following rows contain the metadata for each sample. Multiple metadata files are used in the case where your samples come from more than one dataset and you do not want to merge the metadata files.
For more details of using "add_metadata_tree.py", please see its help (with option "-h"). An example of a metadata file is the "fastqs/metadata.txt" file with the below content:
sampleID subjectID
SRS055982 638754422
SRS022137 638754422
SRS019161 763496533
SRS013951 763496533
SRS014613 763840445
SRS064276 763840445
G000273725 ReferenceGenomes
Note that "sampleID" is a compulsory field.
After adding the metadata, you will obtain the tree files "*.tre.metadata" with metadata and view them by Archaeopteryx as in the previous step.
If you have installed graphlan, you can plot the tree with the plot_tree_graphlan.py script:
plot_tree_graphlan.py -t output/RAxML_bestTree.t__SGB1877.StrainPhlAn4.tre.metadata -m subjectID
and obtain the following figure (output/RAxML_bestTree.t__SGB1877.StrainPhlAn4.tre.metadata.png):
usage: strainphlan [-h] [-d DATABASE] [-m CLADE_MARKERS] [-s SAMPLES [SAMPLES ...]] [-r REFERENCES [REFERENCES ...]] [-c CLADE] -o OUTPUT_DIR [-n NPROCS] [--trim_sequences TRIM_SEQUENCES]
[--sample_with_n_markers SAMPLE_WITH_N_MARKERS] [--sample_with_n_markers_perc SAMPLE_WITH_N_MARKERS_PERC] [--marker_in_n_samples_perc MARKER_IN_N_SAMPLES_PERC]
[--sample_with_n_markers_after_filt SAMPLE_WITH_N_MARKERS_AFTER_FILT] [--sample_with_n_markers_after_filt_perc SAMPLE_WITH_N_MARKERS_AFTER_FILT_PERC] [--breadth_thres BREADTH_THRES]
[--phylophlan_mode {accurate,fast}] [--phylophlan_configuration PHYLOPHLAN_CONFIGURATION] [--tmp TMP] [--mutation_rates] [--phylophlan_params PHYLOPHLAN_PARAMS] [--print_clades_only]
[--non_interactive] [--treeshrink] [--debug] [-v]
options:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
The input MetaPhlAn 4.1.0 database (default: latest)
-m CLADE_MARKERS, --clade_markers CLADE_MARKERS
The clade markers as FASTA file (default: None)
-s SAMPLES [SAMPLES ...], --samples SAMPLES [SAMPLES ...]
The reconstructed markers for each sample (default: None)
-r REFERENCES [REFERENCES ...], --references REFERENCES [REFERENCES ...]
The reference genomes (default: [])
-c CLADE, --clade CLADE
The clade to investigate (default: None)
-o OUTPUT_DIR, --output_dir OUTPUT_DIR
The output directory (default: None)
-n NPROCS, --nprocs NPROCS
The number of threads to use (default: 1)
--trim_sequences TRIM_SEQUENCES
The number of bases to remove from both ends when trimming markers (default: 50)
--sample_with_n_markers SAMPLE_WITH_N_MARKERS
Threshold defining the minimum absolute number of markers for a sample to be primary. This rule is combined with AND logic with --sample_with_n_markers_perc (default: 20)
--sample_with_n_markers_perc SAMPLE_WITH_N_MARKERS_PERC
Threshold defining the minimum percentage of markers to for a sample to be primary. This rule is combined with AND logic with --sample_with_n_markers (default: 25)
--marker_in_n_samples_perc MARKER_IN_N_SAMPLES_PERC
Threshold defining the minimum percentage of primary samples to keep a marker (default: 50)
--sample_with_n_markers_after_filt SAMPLE_WITH_N_MARKERS_AFTER_FILT
Threshold defining the minimum absolute number of markers after filtering to keep a sample. This rule is combined with AND logic with --sample_with_n_markers_after_filt_perc (default:
20)
--sample_with_n_markers_after_filt_perc SAMPLE_WITH_N_MARKERS_AFTER_FILT_PERC
Threshold defining the minimum percentage of markers kept after filtering to keep a sample. This rule is combined with AND logic with --sample_with_n_markers_after_filt (default: 25)
--breadth_thres BREADTH_THRES
Threshold defining the minimum breadth of coverage for the markers (default: 80)
--phylophlan_mode {accurate,fast}
The presets for fast or accurate phylogenetic analysis (default: fast)
--phylophlan_configuration PHYLOPHLAN_CONFIGURATION
The PhyloPhlAn configuration file (default: None)
--tmp TMP If specified, the directory where to store the temporary files. (default: None)
--mutation_rates If specified, StrainPhlAn will produce a mutation rates table for each of the aligned markers and a summary table for the concatenated MSA. This operation can take long time to finish
(default: False)
--phylophlan_params PHYLOPHLAN_PARAMS
Additional phylophlan parameters (default: None)
--print_clades_only If specified, StrainPhlAn will only print the potential clades and stop the execution (default: False)
--non_interactive If specified, StrainPhlAn will select the first SGB available when the clade is specified at the species level (default: False)
--treeshrink If specified, StrainPhlAn will execute TreeShrink after building the tree (default: False)
--debug If specified, StrainPhlAn will not remove the temporary folders (default: False)
-v, --version Shows this help message and exit (default: False)
In the output folder, you can find the following files:
- clade_name.info: this file shows the general information like the total length of the concatenated markers (full sequence length), number of used markers, etc.
- clade_name.polymorphic: this file shows the statistics on the polymorphic site, where "sample" is the sample name, "percentage_of_polymorphic_sites" is the percentage of sites that are suspected to be polymorphic, "avg_freq" is the average frequency of the dominant alleles on all polymorphic sites, "avg_coverage" is the average coverage at all polymorphic sites.
- clade_name.StrainPhlAn4_concatenated.aln: the alignment file of all metagenomic strains.
- clade_name.mutation: this file contains a mutation rates summary table for the concatenated MSA. This file will be generated if --mutation_rates param is specified.
- clade_name_mutation_rates: this folder contains the mutation rates table for each of the aligned markers. This folder will be generated if --mutation_rates param is specified.