-
Notifications
You must be signed in to change notification settings - Fork 86
MetaPhlAn 4.1
MetaPhlAn is a computational tool for species-level microbial profiling (bacteria, archaea, eukaryotes, and viruses) from metagenomic shotgun sequencing data. StrainPhlAn (available within MetaPhlAn) allows strain-level microbial population genomics.
MetaPhlAn 4.1, as of the database vJun23, relies on ~7.3M unique clade-specific marker genes (the latest marker information file can be found here) identified from >1.1M microbial genomes (>150k isolate genomes and >950k metagenome-assembled genomes) spanning 36,822 species-level genome bins (SGBs, http://segatalab.cibio.unitn.it/data/Pasolli_et_al.html), 11,062 of them taxonomically unidentified at the species level (the full list of the species included in the latest database can be found here).
MetaPhlAn 4.1 allows:
- unambiguous taxonomic assignments;
- accurate estimation of organismal relative abundance;
- SGB-level resolution for bacteria, archaea and eukaryotes;
- quantification of metagenomic reads matching viruses;
- metagenomic strain-level population genomics and strain tracking (with StrainPhlAn 4.1).
While the documentation below provides an overview on MetaPhlAn 4.1 basic usage, more usage examples are available in the MetaPhlAn 4.1 tutorial page.
MetaPhlAn requires python 3 or newer with numpy, and Biopython libraries installed. Python libraries are automatically installed by pip
.
MetaPhlAn relies on BowTie2 (version 2.3 or higher) to map reads against marker genes. Check that bowtie2
is present in the system path with execute and read permissions.
If MetaPhlAn is installed using conda, no pre-requisites are needed.
MetaPhlAn is integrated with advanced heatmap plotting with hclust2 and cladogram visualization with GraPhlAn. If you use such visualization tools please refer to their prerequisites.
The best way to install MetaPhlAn is through conda via the Bioconda channel. If you have not configured your Anaconda installation to fetch packages from Bioconda, please follow these steps to set up the channels.
You can install MetaPhlAn by running
$ conda install -c bioconda metaphlan
It is recommended to create an isolated conda environment and install MetaPhlAn into it.
$ conda create --name mpa -c bioconda python=3.7 metaphlan
If during the installation you encounter an incompatibility error with the glibc package, we suggest you to add the conda-forge
channel to conda or run one of the following commands.
$ conda install -c conda-forge -c bioconda metaphlan
$ conda create --name mpa -c conda-forge -c bioconda python=3.7 metaphlan
This allows having the correct version of all the dependencies isolated from the system's python installation.
Before using MetaPhlAn, you should activate the mpa
environment:
$ conda activate mpa
MetaPhlAn is also available in PyPi
$ pip install metaphlan
Alternatively, you can manually download from GitHub or clone the repository using the following command
$ git clone https://github.com/biobakery/MetaPhlAn.git
and install MetaPhlAn by running
$ pip install .
If you choose this way, you'll need to install manually some dependencies!
MetaPhlAn needs the clade markers and the database to be downloaded locally. To obtain them:
$ metaphlan --install
Important! The MetaPhlAn 4 database has been substantially increased in comparison with the previous 3.1 version. Thus, for running MetaPhlAn 4, a minimum of 25GB of memory is needed.
If you have installed MetaPhlAn using Anaconda, it is advised to install the database in a folder outside the Conda environment. To do this, run
$ metaphlan --install --bowtie2db <database folder>
If you install the database in a different location, remember to run MetaPhlAn using --bowtie2db <database folder>
!
By default, the latest MetaPhlAn database is downloaded and built. You can download a specific version with the --index
parameter
$ metaphlan --install --index mpa_vJan21_CHOCOPhlAnSGB_202103 --bowtie2db <database folder>
When --index
is specified, MetaPhlAn skips the check for the latest database version and run the analysis using the database version provided by --index
located in --bowtie2db
.
This option is recommended when MetaPhlAn is run on HPC clusters or containerized
If you have issues in downloading the database, you can get it from:
Just download the .tar, .md5, and the mpa_latest files and place them in the metaphlan_databases folder.
$ metaphlan metagenome.fastq --input_type fastq -o profiled_metagenome.txt
It is highly recommended to save the intermediate BowTie2 output for re-running MetaPhlAn extremely quickly (--bowtie2out
), and use multiple CPUs (--nproc
) if available:
$ metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt
If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you can obtain the results in few seconds by using the previously saved --bowtie2out
file and specifying the input (--input_type bowtie2out
):
$ metaphlan metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out -o profiled_metagenome.txt
bowtie2out
files generated with MetaPhlAn versions below 3.0 are not compatible. Starting from MetaPhlAn 3, the BowTie2 output now includes the size of the profiled metagenome.
You can also provide an externally BowTie2-mapped SAM if you specify this format with --input_type
.
Two steps here: first map your metagenome with BowTie2 and then feed MetaPhlAn with the obtained SAM:
$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_202307 -U metagenome.fastq
$ metaphlan metagenome.sam --input_type sam -o profiled_metagenome.txt
MetaPhlAn can also natively handle paired-end metagenomes (but does not use the paired-end information), and, more generally, metagenomes stored in multiple files (but you need to specify the --bowtie2out parameter):
$ metaphlan metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt
Starting from version 3, MetaPhlAn can estimate the unclassified fraction of the metagenome. The relative abundance profile is scaled according to the percentage of reads mapping to a clade in the database.
$ metaphlan metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq --unclassified_estimation -o profiled_metagenome.txt
If you want to estimate the unknown fraction of a metagenome and your input file is a SAM file, remember to specify the metagenome size using --nreads
. You can easily get the metagenome size from a SAM file if you have run bowtie2 without the --no-unal
parameter by running
$ wc -l metagenome.sam
Otherwise, read_fastx.py
is your choice: this will print on the standard error the metagenome size.
$ read_fastx.py metagenome.fastq > /dev/null
It is possible to subsample the reads before the MetaPhlAn run by passing the number of reads to use (which must be < than the total number of reads of the sample) to --subsampling
. In the following example, subsampling to 10,000 reads:
$ metaphlan metagenome.fastq --input_type fastq --subsampling 10000 -o profiled_metagenome_subsampled_10000.txt
Since MetaPhlAn 4.1.1, it is possible to use paired-end information during subsampling (above, paired-end reads would be treated as single-end, i.e., independent). For that, use --subsampling_paired
instead:
metaphlan --subsampling_paired <N_PAIRED_READS> -1 <R1_FASTQ> -2 <R2_FASTQ> --input_type fastq --subsampling_out <SUBSAMPLED_READS_OUTPUT> -o <METAPHLAN_OUTPUT> --bowtie2out <BOWTIE2OUT>
You can provide the specific database version with --index
.
By default MetaPhlAn is run with --index latest
: the latest version of the database is used; if it is not available, MetaPhlAn will try to download it.
When --index
is specified, MetaPhlAn skips the check for the latest database version and run the analysis using the database version provided by --index
located in --bowtie2db
.
For advanced options and other analysis types (such as strain tracking) please refer to the full command-line options metaphlan --help
.
MetaPhlAn's repository features a few utility scripts to aid in the manipulation of sample output and its visualization. These scripts can be found under the utils
folder in the MetaPhlAn directory.
The script merge_metaphlan_tables.py allows to combine MetaPhlAn output from several samples to be merged into one table Bugs (rows) vs Samples (columns) with the table enlisting the relative normalized abundances per sample per bug.
To merge multiple output files, run the script as below
$ merge_metaphlan_tables.py metaphlan_output1.txt metaphlan_output2.txt > metaphlan_output3.txt output/merged_abundance_table.txt
Wildcards can be used as needed:
$ merge_metaphlan_tables.py metaphlan_output*.txt > output/merged_abundance_table.txt
Output files can be merged only if the profiling was performed with the same version of the MetaPhlAn database.
There is no limit to how many files you can merge.
The script sgb_to_gtdb_profile.py allows to convert a SGB-based MetaPhlAn 4 output into a GTDB-taxonomy-based profile.
To do so, run the script as below
$ sgb_to_gtdb_profile.py -i metaphlan_output.txt -o metaphlan_output_gtdb.txt
The script calculate_diversity.R allows to compute alpha and/or beta diversity, with different metrics of choice, starting from a merged MetaPhlAn table. Available alpha-diversity metrics are richness, shannon, simpson, and gini. Available beta-diversity distance functions are bray-curtis, jaccard, weighted-unifrac, unweighted-unifrac, centered log-ratio, and aitchison. For example, to generate a beta diversity distance matrix with bray-curtis, you need to run the script as below:
Rscript calculate_diversity.R -f merged_mpa4_profiles.tsv -d beta -m bray-curtis
To compute UniFrac distances, the SGB tree in the Newick format (available here) must be provided.
For the full list of options, please run:
Rscript calculate_diversity.R
The hclust2 script generates a hierarchically-clustered heatmap from MetaPhlAn abundance profiles. To generate the heatmap for a merged MetaPhlAn output table (as described above), you need to run the script as below:
hclust2.py \
-i HMP.species.txt \
-o HMP.sqrt_scale.png \
--skip_rows 1 \
--ftop 50 \
--f_dist_f correlation \
--s_dist_f braycurtis \
--cell_aspect_ratio 9 \
-s --fperc 99 \
--flabel_size 4 \
--metadata_rows 2,3,4 \
--legend_file HMP.sqrt_scale.legend.png \
--max_flabel_len 100 \
--metadata_height 0.075 \
--minv 0.01 \
--no_slabels \
--dpi 300 \
--slinkage complete
The tutorial of using GraPhlAn can be found from the GraPhlAn wiki.
In order to add a marker to the database, the user needs the following steps:
- Reconstruct the marker sequences (in fasta format) from the MetaPhlAn BowTie2 database by:
bowtie2-inspect metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_202307 > metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_202307_markers.fasta
- Add the marker sequence stored in a file new_marker.fasta to the marker set:
cat new_marker.fasta >> metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_202307_markers.fasta
- Rebuild the bowtie2 database:
bowtie2-build metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_202307_markers.fasta metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_NEW
- Assume that the new marker was extracted from genome1, genome2. Update the taxonomy file from the Python console as follows:
import pickle
import bz2
db = pickle.load(bz2.open('metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_202307.pkl', 'r'))
# Add the taxonomy of the new genomes
db['taxonomy']['7-levels taxonomy with clade names of genome1'] = ('7-levels NCBI taxonomy id of genome1', length of genome1)
db['taxonomy']['7-levels taxonomy with clade names of genome2'] = ('7-levels NCBI taxonomy id of genome1', length of genome2)
# Add the information of the new marker as the other markers
db['markers'][new_marker_name] = {
'clade': the clade that the marker belongs to,
'ext': {the GCA of the first external genome where the marker appears,
the GCA of the second external genome where the marker appears,
},
'len': length of the marker,
'taxon': the taxon of the marker
}
To see an example, try to print the first marker information:
print list(db['markers'].items())[0]
# Save the new mpa_pkl file
with bz2.BZ2File('metaphlan_databases/mpa_vJun23_CHOCOPhlAnSGB_NEW.pkl', 'w') as ofile:
pickle.dump(db, ofile, pickle.HIGHEST_PROTOCOL)
- To use the new database, remember to run metaphlan.py with the "--index mpa_vJun23_CHOCOPhlAnSGB_NEW" parameter.