Skip to content
kishori82 edited this page Nov 8, 2016 · 3 revisions

MetaPathways v3.0 Wiki

MetaPathways Logo

Welcome to the MetaPathways v3.0 wiki! Here we have a knowledge base detailing installation, usage, example use cases, known issues, and academic references to learn more details about the methods used.

Note: If you find any inconsistencies with this document, have questions, or have found additional installation and usage bugs, please raise an issue on GitHub.

Table of Contents

Overview

MetaPathways v3.0 is a meta'omic analysis pipeline for the annotation and analysis for environmental sequence information. This release has a number of improvements over our previous pipeline, including:

  • a graphical user interface (GUI) for easier setup and process monitoring
  • interactive data visualization and data query via a custom Knowledge Engine data structure
  • compute grid tasks via a master-worker model of worker grids in an ad hoc, asynchronous, distributed network (available on command line)
  • refinements and bug fixes to the underlying Python code
  • automated calculation of reads per kilobase per million (RPKM) for every predicted open reading frame (ORF)

Technical details of the master-worker algorithm and Knowledge Engine data structures can be found in its associated publication. If using MetaPathways for academic work, please cite either:

Pipeline Overview

MetaPathways is composed of five general stages, encompassing a number of analytical or data handling steps (Figure 1):

  1. QC and ORF Prediction: Here MetaPathways performs basic quality control (QC) including removing duplicate sequences and sequence trimming. Open Reading Frame (ORF) prediction is then performed on the QC'ed sequences using Prodigal [2] or GeneMark [3]. The final translated ORFs and now also trimmed according to a user-defined.
    • MetaPathways steps: PREPROCESS INPUT, ORF PREDICTION, and FILTER AMINOS
  2. Functional and Taxonomic Annotation: Using seed-and-extend homology search algorithms (B)LAST [4,5], MetaPathways can be used to conduct searches against functional and taxonomic databases.
    • MetaPathways steps: FUNC SEARCH, PARSE FUNC SEARCH, SCAN rRNA, and ANNOTATE ORFS
  3. Analyses: After sequence annotation, MetaPathways performs further taxonomic analyses including the Lowest Common Ancestor (LCA) algorithm [6] and tRNA Scan [7], and prepares detected annotations for environmental Pathway/Genome database (ePGDB) creation via Pathway Tools.
    • MetaPathways Steps: PATHOLOGIC INPUT, CREATE ANNOT REPORTS, and COMPUTE RPKM.
  4. ePGDB Creation: MetaPathways then predicts MetaCyc pathways [8] using the Pathway Tools software [9] and its pathway prediction algorithm PathoLogic [10], resulting in the creation of an environmental Pathway/Genome Database (ePGDB), an integrative data structure of sequences, genes, pathways, and literature annotations for integrative interpretation.
    • MetaPathways Steps: BUILD ePGDB
  5. Pathway Export: Here MetaCyc pathways or reactions are exported in a tabular format for downstream analysis. As of the v2.5 release, MetaPathways will perform this step automatically.
    • MetaPathways Steps: BUILD ePGDB

MetaPathways Overview

Figure 1. MetaPathways Overview. The pipeline consists of five general steps: Quality Control (QC) & Open Reading Frame (ORF)/Gene prediction, functional annotation, taxonomic analyses, environmental Pathway/Genome Database (ePGDB) creation, and Pathway Export.

Setup, Installation, and Configuration

We have endeavored to make the system prerequisites minimal, however, MetaPathways has the following dependencies:

  • Unix-based operating system (e.g., OSX, Ubuntu)
  • Python 2.7
  • Pathway Tools - Academic users can apply for a free licence.

Further, the MetaPathways v2.5 download has two parts:

  1. The MetaPathways v2.5.1 GitHub Release: containing the Python source code, as well as compiled binaries and GUI executables for Mac OSX and Ubuntu
  2. The MetaPathwaysDBs.zip: contains the MetaCyc and COG sequence databases and support files for the various functional hierarcies (e.g., KEGG, COG, MetaCyc, CAZy, etc.)

Download and extract the above, and note the locations you install the metapathways2.5 and MetaPathwaysDBs directories on your system, as you’ll need them in a follow-up configuration step.

Mac OS X
  • Mount the MetaPathways2.dmg found in the GitHub repository and transfer the MetaPathways2 executable to the Applications.
  • Launch MetaPathways2 through the OSX Finder to begin.
Linux-based systems (Ubuntu)
  • Extract the MetaPathways2.Ubuntu.zip file (e.g., unzip MetaPathways2.Ubuntu.zip)
  • Run the MetaPathways2 executable to start the GUI with ./MetaPathways2.Ubuntu/MetaPathways2. You may have to change permissions (e.g., chmod 700 MetaPathways2.Ubuntu/MetaPathways2).
Windows Systems
  • Download the machine disk image MetaPathways_2_5.vmdk.zip (Approximately 5.0 GB).
  • Start the Image with a visualization software like VirtualBox
  • Run the MetaPathways2 executable to start the GUI with ./MetaPathways2.Ubuntu/MetaPathways2
  • You still have to obtain a licence and install Pathway Tools
  • If using Virtual Box, Ubuntu 14.06 runs poorly unless 'Enable 3D Acceleration' is checked in the Settings menu (Settings > Display Tab > Enable 3D Acceleration)

Notes:

  • Many of the reference protein (e.g., RefSeq, KEGG, SEED, etc.) or taxonomic (e.g., Silva, GreenGenes) databases are not included in the MetaPathways downloads due to their ever-growing size. See the section Obtaining Protein Sequence Databases for tips on obtaining the latest copies of protein sequence databases, inlacing RefSeq [11].
  • MetaPathways also requires a variety of executables in order to run. If you have a system other than Ubuntu or OSX, it is possible that these will not work for you. In this case you’ll have to compile these from source for your system. The source code of these binaries can be found in executables/source of the GitHub repo. See the section on Compiling Custom Executables for more details.

Configuration

The MetaPathways GUI will attempt to populate a machine-specific configuration file (config/template_config.txt) to inform itself of the location of the executables. Occasionally this automatic generation can fail, so it should be double-checked for accuracy if errors related to missing files or directories are thrown. See the section Running MetaPathways from the Command Line for more details.

Installing Reference Databases

MetaPathways is designed around the use of the COG, KEGG, MetaCyc, CAZy, and RefSeq protein databases for ORF annotations and the GreenGenes and Silva taxonomic databases. These database are not distributed with the software due to their ever-growing size; however, a tutorials detailing how to obtained them can be found in the Obtaining Protein Sequence Databases section. The MetaPathwaysDBs folder has the following structure:

  • MetaPathwaysDBs/
    • functional/ : location for functional protein databases (amino acid)
    • taxonomic/ : location for taxonomic databases (nucleotide)
    • functional_categroies/ : functional hierarchy files for GUI
    • ncbi_tree/ : flat file of the NCBI taxonomic database and translation file

Raw fasta files of database sequences can be added to the functional/ and taxonomic/ directories. MetaPathways will format sequence databases for each of the fasta files, putting the formatted databases in the formatted sub-directory. Here it is possible to add additional custom protein databases by including them in their respective folders. The additional sequences will be searched against, however, since many of the analytical features of MetaPathways is built around specific databases, not all downstream analytical steps can be applied.

Obtaining Protein Sequence Databases

Sequence homology searches via BLAST or LAST require curated taxonomic or functional reference databases to conduct functional and taxonomic annotation. Here we state a few tips for obtaining sequences for these databases.

Functional:

Taxonomic:

Running MetaPathways with the GUI

The MetaPathways GUI enables the setup, management, and inquiry of MetaPathways results. It is a C++-based application written using the QT framework to allow it to compare hundreds of samples simultaneously. It consists of five tabs from which to navigate MetaPathways:

  • Setup: Configures MetaPathways for a particular system. Here the installed locations of Python, Pathway Tools, PGDB directory, Sequence Databases, and the MetaPathways Python code base are supplied for a particular system.
  • Parameters: Allows you to specify run parameters for a number of MetaPathways stages.
  • Stages: Specifies which stages of MetaPathways should be run
  • Run: Starts and Monitors the completion of a MetaPathways run
  • Results: Allows for the interactive query and export of results for downstream analysis.

Setup

MetaPathways needs to know the locations of certain resources and directories are in order to be properly configured to run (Figure 2).

  • MetaPathways Directory: Location of the metapathways2/ GitHub code base folder (required)
  • Database Directory: Location of the MetaPathwaysDBs directory (required)
  • Directory for OS specfic executables: specifies the set of executables found in the executables/ folder appropriate for your system
  • Python Executable: The location of the Python programming language executable. Usually /usr/bin/python on most OSX or Unix systems. Must be Python v2.7 or higher.
    • The command which python will check for it in your system PATH variable.
  • PGDB Folder Path: Location of Pathway Tool’s ptools-local/pgdbs/user folder. By default this is in the home directory (e.g., ‘~/ptool-local/pgdbs/user/`). Completed ePGDBs will be placed in this folder.
  • Pathway Tools Executable: Location of Pathway Tool’s executable. By default this is ‘~/pathway-tools/pathway-tools`

Green check appear appear beside resources and directories that Metapathways believes to be correctly specified. When finished configuring the above, click the Validate button at the bottom of the page. This will generate a pop-up window identifying any foreseeable problems. If all configurations are validated press the ‘Save’ button; this will update the template_config.txt with the current settings.

Notes:

  • Its actually possible to use MetaPathways just to visualize previous runs without performing this configuration. Just click proceed and ignore the pop-up warning windows. Proceed to the section on Interacting With Results for more information.

MetaPathways v2.5 Setup

Figure 2: MetaPathways v2.5 Setup. MetaPathways needs certain system resources specified in order to run. These include the MetaPathwaysv 2.5 Python code base (required), the Database directory (required), the systems python executable, Pathway Tools' PGDB Folder Path, and the Pathway Tools Executable. Click the 'Validate' button when all the fields are specified and have green check marks.

Run Parameters

Many steps have parameters or thresholds that control their behavior. In general these thresholds specify the databases that will be searched against, or the permissiveness of quality control and annotation (Figure 3). Parameters are setup with intelligent defaults based on our own experience and literature sources, but one should consider these settings in the context of their own research questions. Defaults can be quickly restored using the ‘Restore Defaults’ button.

Quality Control: These settings affect the quality control steps applied to the input nucleotide input sequences.

  • Minimum Length: Specifies the shortest nucleotide sequence that MetaPathways will process (default: 180 nucleotides).
  • Delete Replicates: Exact duplicate sequences (both name and sequence) in the input fasta files will be removed (default: yes).

ORF Prediction: These parameters affect open reading frame prediction via Prodigal or Genewise.

  • Prediction Algorithm: This selects the default ORF prediction algorithm (default: prodigal for prokaryotic sequences). genewise can be used for eukaryotic sequences.
  • Translation Table: Selects a nucleotide translation table to translate ORFs with (default: 11 --- Bacterial, Archaeal and Plant Plastid Code).
  • Minimum Length: Specifies the minimum length of predicted ORF to include in the database searches (default: 60 amino acids)

ORF Annotation Parameters: Here protein databases, search algorithms, and hit quality parameters are specified for ORF annotation.

  • Databases: Selects which protein databases to search against. This will list all available databases in the functional folder of the MetaPathwaysDBs directory.
  • Algorithm: Which homology search algorithm to search against, LAST or BLAST (default: LAST).
    • Minimum BSR: Specifies the minimum blast-score ratio (BSR) for hits. This score is a ratio of the bit-score against a perfect hit (default: 0.4). This normalizes the bit-score for sequence length.
    • Minimum Score: minimum acceptable bit-score (default 20).
    • Maximum Hits: maximum number of hits to report per sequence (default: 5).
    • Minimum Length: minimum match length for a hit (default:60 aa)
    • Maximum E-value: maximum expectation value for a hit (default: 0.000001). This represents the probability that an match equally as good as the one observed could happen by random chance. Highly dependant on database size.

Functional Bit-score ratio and E-value default thresholds were motivated by the following papers:

  • B. Rost, Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94 (1999).
  • R. Carr, E. Borenstein, C. Gibas, Ed. Comparative Analysis of Functional Metagenomic Annotation and the Mappability of Short Reads. PLoS ONE 9, e105776 (2014).

rRNA Annotation Parameters: analogous to the above ORF Annotation parameters, these specify search thresholds for taxonomic rRNA annotation.

  • Databases: Selects which taxonomic rRNA databases to search against. This will list all available databases in the taxonomic/ folder of the MetaPathwaysDBs directory.
    • Minimum Identity: Specifies the minimum percent sequence identity for a match (default: 20%)
    • Maximum E-value: Analogous to above to the functional search parameter above, this specifies the maximum e-value for taxonomic hits (default: 0.000001).
    • Minimum Bit-score: Specifies the minimum bit-score for taxonomic hits (default: 50).

Pathway Tools Settings: Here settings related to ePGDB construction via the Pathway Tools software is set.

  • Taxonomic Pruning: Specifies if the the PGDB should be built with taxonomic pruning enabled or not (default: no).

For more information on taxonomic pruning and the PathoLogic algorithm and interpreting MetaCyc predicted pathways in meta'omic samples see:

  • P. D. Karp, M. Latendresse, R. Caspi, The pathway tools pathway prediction algorithm. Stand Genomic Sci 5, 424–429 (2011).
  • R. Caspi, K. Dreher, P. D. Karp, The challenge of constructing, classifying, and representing metabolic pathways. FEMS Microbiol Lett 345, 85–93 (2013).
  • T. Altman, M. Travers, A. Kothari, R. Caspi, P. D. Karp, A systematic comparison of the MetaCyc and KEGG pathway databases. BMC Bioinformatics 14, 112 (2013).
  • N. W. Hanson et al., Metabolic pathways for the whole community. BMC Genomics 15, 619 (2014).

MetaPathways Run Parameters

Figure 3: MetaPathways Run Parameters. The Parameters tab allows you to specify different parameters for a run, controlling permissiveness of sequence quality control and annotation, as well as database selection and other settings.

Stages

This section specifies input and output directories, as well which analytical states to perform for a given run of MetaPathways (Figure 4).

Input/Output Folders: Here input and output directories are specified. The input directory should contain the input files and the output directory will be where output results folder will be create. One output folder will be created for each input file.

  • File Inputs: Specifies the directory containing the input files. MetaPathways will detect whether each input file is nucleotide or protein fasta, annotated GenBank files (gbk-annotated), and GenBank (gbk-unannotated) and proceed accordingly.
  • Output Folder: Specifies an output directory to place the processed results. Sub-directories will be created for each input folder.
  • Select Samples: Specifies which samples in the input directory to run against.

Note: Input file names must begin with a letter and may not contain any period characters .. If a file's name is not correctly specified, it will not be possible to select that sample until it has be renamed.

Pipeline Execution Steps: This specifies which analytical steps will be run. Each stage has one of three basic settings:

  • Run - runs the step if no existing output is detected
  • Skip - skips the step
  • Redo - runs the step and overwrites existing output if present

Note:

  • MetaPathways can resume after an incomplete run, but users should be wary of incomplete outputs; Its best to redo the last run step when restarting.

We will now detail each of the pipeline steps and describe their activities in some detail.

  • PREPROCESS_INPUT: Process step filters the input nucleotide sequences based on the minimum length, presence of ambiguous nucleotide positions, and duplicates based on specified parameters. Filtered sequences are placed in the preprocessed output folder.
  • ORF_PREDICTION: Genes are predicted using the specified algorithm on the quality controlled input. The gene prediction algorithm results in a nucleotide fasta file and general feature format (gff) of predicted ORFs in the orf_prediction output folder.
  • FILTER_AMINOS: Resulting GFF files produced by the gene prediction algorithm are translated into the amino acid sequences via the selected translation table in the orf_prediction folder (.faa). The amino acid ORF sequences are then filtered by the specified length parameter (.qced.faa).
  • FUNC_SEARCH: The quality controlled ORFs are now searched against themselves using the specified search algorithm (BLAST or LAST) to obtain a perfect bit-score, while will be used in the calculation of the bit-score ratio (BSR). The resulting files bit-scores are placed in the the blast_results folder (.refscore.[algorithm]). Next, the quality controlled ORFs are searched against the functional reference database sequences. This step can be computationally expensive and can performed using multiple girds using compatible batch processing system if configured. Otherwise, the searches will be performed locally. The results are put in a tabular form in the folder blast_results with individual files for each of the databases ([samplename].[dbname].[algorithm]out, e.g., mysample.refseq_2013.BLASTout).
  • PARSE_FUNC_SEARCH: Functional search results from the blast_results folder are parsed and filtered for search quality criteria (e.g., BSR, E-value) returning in a parsed results file ([dbname].[algorithm]out.parsed.txt, e.g., refseq.BLASTout.parsed.txt.
  • SCAN rRNA: Pre-processed nucleotide sequences in the preprocessed folder are blasted against the selected rRNA databases, such as Silva and GreenGenes. The hits passing the specified quality thresholds (i.e., E-value, length, identify, bit-score) are put in the results/rRNA/ folder.
  • SCAN tRNA: The filtered nucleotide sequences from the preprocessed folder are scanned with tRNA-Scan to find the tRNAs. Results are placed in the results/tRNA folder.
  • ANNOTATE ORFS: Annotations found by (B)LAST in the FUNC SEARCH step are captured and placed into the .gff file. The annotations in the .parsed.txt files are scanned, and for each ORF, all annotations are stored in a.gff file in the orf_prediction folder. This .gff file is used to create downstream reports and inputs in subsequent steps. TODO: Check Details here
  • PATHOLOGIC INPUT: The functional annotations of the ORFs stored in the .gff file are reformatted into the pathologic format and placed in the ptools directory.
  • CREATE ANNOT REPORTS: This is large step that summarizes the ORF annotations on the KEGG, COG, and MetaCyc functional hierarchies, as well as the NCBI Taxonomy Tree via the LCA algorithm. This step can take some time to run, but uses some out-of-core computing techniques to keep memory requirements to a minimum. The result files are placed in the folder results/annotation_table. Additionally, the functional annotations of the ORFs stored in the .gff file are reformatted into the GenBank (.gbk) format and placed in the genbank directory.
  • BUILD ePGDB: The pathway prediction algorithm Pathologic in the Pathway Tools software is run with the folder ptools as the input. The result of this step is an ePGDB (environmental pathway genome database). The resulting ePGDB is in the ~/ptools-local/pgdbs/user folder. They can be viewed using the Pathway Tools software.
  • COMPUTE RPKM: Based on the availability of paired end fastq files (interleaved or separate forward and reverse read files) containing the original sequence reads MetaPathways will use bwa) [12] to recruit the short reads into the detect ORFs found in the input contigs using reads per kilobase per million mapped (RPKM) statistic. Alternatively, a SAM file can be supplied and MetaPathways will this result to create RPKM values for each ORF. The results are dropped in the results/rpkm folder are used by the tool for reporting RPKM values values. To use this feature:
    • create a subdirectory within your input file directory entitled reads
    • place either .fastq (interleaved or separate forward and reverse files) or .SAM files with the same name as the corresponding samples within the newly created reads directory. For example, if my input config file is named my_samples_contigs.fasta the corresponding file in the reads directory would be named my_samples_contigs.fastq or my_samples_contigs.SAM. MetaPathways will automatically detect these fastq or SAM files and calculate RPKM on the appropriate samples. It is also possible to include multiple fastq files.

For example The input reads (in the form of fastq files) for this step must be added to the subdirectory reads in the input folder (where the input fasta files are located). The read files are identified by the name format of the files: For examples, if the sample name is "abcd" then the following read files in the "reads" folders associated with the samples abcd:

           1.   abcd.fastq : this means non-paired reads

           2.   abcd.b1.fastq  : means only unpaired read from batch b1

           3.   abcd_1.fastq  and abcd_2.fastq: this means paired reads for sample

           4.   abcd_1.fastq or abcd_2.fastq: this means only one end of a paired read

           5.   abcd_1.b2.fastq and  abcd_2.b2.fastq: this means paried reads from batch b2, note that batches are          identified as bn, where n is a number

           6.   abcd_1.b3.fastq or abcd_2.b3.fastq: this means only one of a paried read from batch b1

Note:

  • MetaPathways uses BWA Version: 0.6.1-r104 'MEM' to align reads. If you would prefer to use a different algorithm or version the alignment can be done outside of MetaPathways and the resulting .SAM file can be supplied in the reads subdirectory.

Stages

Figure 4: Stages. Here one can specify input and output directories, select samples, and decide which analytical stages to run.

Run

Here you can launch and monitor a MetaPathways run.

A run in launched via the 'Run' button, respecting all set configuration, parameters, and run steps set in other menus. Clicking the Verbose Run Log button sets MetaPathways to run in verbose mode, which will print the full commands for each stage as it runs in the RunLog below (this is useful for debugging should something go wrong).

Monitoring a MetaPathways Run

The run and cancel buttons execute and cancel a MetaPathways run, respectively. The Execution Summary, Progress Log, Run Log, and Error Log show the status and progress of the sample currently selected in the drop-down menu (Figure 5). The progress of each sample can be viewed by changing the sample selected from the drop down menu in the upper right hand corner.

Sample Progress, Execution Summary, Run Log, and Error Log: MetaPathways will visualize the total percentage of the run's completion as a progress bar, and the result of each stage is displayed in the Execution Summary and Progress Log sections. Results for each stage are classified based on their current state of completion:

  • Green Checkmark: Results complete
  • Black Ellipses: Stage currently running
  • Orange Exclamation Mark: Results incomplete.
  • Black Question mark: Unknown or undetected results
  • Red Cross: The stage through an error, check the Show Errors button or Command Log for more information

The Progress and Command Log are the real-time progress and python commands being executed. The Error Log will display all warnings and errors encountered by MetaPathways in sequential format for each sample being run.

Note:

Metapathways processes samples in blocks of stages for increased parallel efficiency when running on grids or cloud compute nodes.

  • Block 0 (PREPROCESS INPUT, ORF PREDICTION, and FILTER_AMINOS), then
  • Block 1 (FUNC SEARCH) and finally
  • Block 2 (PARSE_FUNC_SEARCH, SCAN rRNA, SCAN tRNA, ANNOTATE ORFS, PATHOLOGIC INPUT, CREATE ANNOT REPORTS, BUILD ePGDB, and COMPUTE RPKM).

Monitoring a MetaPathways Run

Figure 5: Monitoring a MetaPathways Run. In the Run Tab is where an active run is launched and monitored. A run is started with the Run button, and Execution Summary Log, Progress Logm Run Log, and Error Log allow real-time progress to be monitored.

Results

Allows the interactive and comparative query of pipeline results (Figure 6).

MetaPathways uses a custom 'Knowledge Engine' data structure to drive multi-sample comparisons on the KEGG, COG, MetaCyc SEED and CAZy hierarchies. Enabling Comparative Mode will allow the user to view these comparisons in real-time:

  • Add and drop output directories: Using the '+' allows the user to browse to one or more directories that contain Metapathways output results. The '-' allows the user to drop a previously specified directory. Since this tabs only executes read commands from disk, it can be used to observe samples as they are completed by MetaPathways. Clicking the button while samples are running will update the reports with the current results.
  • Select Samples: This button brings up a window in which to select samples for comparative mode. The drop down menu allows you to select an individual sample for analysis.
  • Single-sample/Comparative Mode: This drop down menu allows the user to toggle between single-sample and comparative mode. Selecting multiple samples here will allow for comparison of functional annotations across the KEGG, COG, MetaCyc, SEED, and CAZy functional hierarchies if those databases were searched.

Viewing Results

Figure 6: Viewing Results. The Results Tab allows the interactive query, comparison and export of annotations obtained by MetaPathways.

Running MetaPathways from the Command Line

The MetaPathways GUI actually is just a fancy way of firing the pipeline from the command line. The basic structure of the command line looks like so

$ source MetaPathwaysrc
$ python MetaPathways.py -i input_folder/ -o output_folder/ -c template_config.txt -p tempate_param.txt [-v] [-r overlay] [-s samplename]

where,

  • -i: specifies the folder of input files
  • -o: specifies the output directory where results will be created
  • -c: specifies the configuration file template_config.txt
  • -p: specifies the paramemter file template_param.txt
  • -v: runs MetaPathways in verbose mode printing all the commands as they are executed
  • -r: specifies the mode for MetaPathways to run
    • overlay: accepts results if present only overwriting if command set to redo (default)
    • overwrite: overwrite results in output folder (dangerous)
  • -s: specifies a specific sample name to run

Aside from the ability to specify specific samples to run with the -s option, usage is almost identical to the MetaPathways v1.0 pipeline.

Interacting with Results

As briefly mentioned above, the Results section can be used to query and export annotation information from individual or multiple MetaPathways runs. To load results specified in the current output directory (see Stages tab), click the 'Load/Reload sample' button.

Individual Samples

To view an individual sample, select an individual sample from the ‘Select sample’ dropdown menu. A number of tabs showing different results should showup in the main window:

  • RUN STATS: Displays various sequence and annotation statistics for the currently loaded sample.
  • CONT LEN HIST: Density plot or histogram of post-QC reads/contigs
  • ORF LEN HIST: Density plot or histogram of post-QC open reading frames
  • Functional Hierarchies: MetaPathways populates with annotations a number of functional hierarchies. The annotations can be summarized as ORF count or RPKM statistics at various depths of the hierarchies. Displayed annotation tables can be exported as tab-delimited or comma-delimited text files, along with the underlying fasta sequences.
    • KEGG: Kyoto Encyclopedia of Genes and Genomes
    • COG: Clusters of Orthologous Genes
    • MetaCyc: The MetaCyc database of genes and pathways
    • SEED: The SEED subsystems functional hierarchy
    • CAZy: Carbohydrate Active Enzymes database
  • FUNC TAX: The Functional and Taxonomic table provides statistics and annotation status for all predicted ORFs, including annotation, LCA taxonomy, and EC number if available. The table has robust search, subsetting, and export capacity.
  • rRNA Annotation Tables: Annotations against the GreenGenes or Silva SSU or LSU databases are displayed as separate tabs.
  • tRNA Scan: The output table of tRNA-Scan is summarized

Sequence and Annotation Statistics (RUN STATS)

Provides the following statistics for the current sample:

  • Nucleotide Sequences: min, average, max, and total lengths of input nucleotide sequences before and after filtering.
  • *Amino Acids (ORFs)**: min, average, max, and total lengths of predicted ORF amino acid sequences before and after filtering.
  • Total Protein Annotations: Total number of annotations found for each database (including multiple hits per ORF)
  • Protein Annotations: Number of annotated ORFs in each searched database.
  • Total Protein Annotations: Number of annotated ORFs across all databases (no double counting, same ORF could be annotated in multiple databases)
  • Taxonomic Hits: Number of annotations found in each taxonomic database.

Length Densities (CONT LEN HIST and ORF LEN HIST)

These are histograms of lengths for post-QC nucleotide and amino acid sequences for the current sample. The plot is interactive and can be scaled with the mouse. If the sample becomes too sparse the density plot will automatically change to a histogram with a dynamic scaling bin width. Additionally, the currently displayed image can be exported as a pnd via the 'Export' button.

Functional Hierarchies (KEGG, COG, etc.)

The annotations of the current sample can be explored in the context of various functional hierarchies in a hierarchical table. This table has many features allowing to subset and interact with the underlying annotations. Here there are a number of navigation controls that allow you to navigate a given hierarchy:

  • 'Level' Toggle: Allows one to specify the level of the current hierarchy by its depth.
  • 'Hide Zero Rows': Hides annotations that found no hits.
  • 'Show Hierarchy': shows the levels of the hierarchy stepwise on the left-hand-side. Useful for navigation at lower levels; not necessarily a useful form for export and downstream analysis.
  • Displayed Statistics Drop-down: This drop-down menu, next to the ‘Show Hierarchy’ checkbox, allows one to change the unit of the displayed counts. By default, ORF Count will always be a available for typical run of MetaPathways. However, if raw reads were provided along with assembled contigs, ‘RPKM’ counts (Reads per Kilobase Mapped per Million) can be displayed. On the MetaCyc hierarchy, some specialized pathway statistics related to ‘base pathways’ become available:
    • 'BasePwy ORF': Pathways occur at various levels of the MetaCyc hierarchy. Since this is a useful level of comparison, this option flattens the hierarchy to the base pathway level.
    • 'BasePwy RPKM': Pathways occur at various levels of the MetaCyc hierarchy. Since this is a useful level of comparison, this option flattens the hierarchy to the base pathway level and displays the RPKM values (if computed).
    • 'BasePwy RxnTot': Specifies the number of reactions for each MetaCyc base pathway
    • 'BasePwy RxnCov': Specifies the number of reactions for each MetaCyc pathway covered by an annotation.
  • 'Export': Exports current table (as a tab-delimited or comma-separated text file) and underlying nucleotide contigs/reads or ORF sequences associated with the annotations (.fasta). Table columns can selectively exported for convenience. Additionally, MEGAN-compatible files can also be exported.
  • 'Search': Allows current table to be filtered by keyword search and basic boolean (AND/OR) logic.

Count cells in a functional hierarchy have double-click and right-click features:

  • Double-click: opens a new hierarchy window focusing on the underlying ORFs specifically found in that frame. From here these annotations can be compared against results against different functional hierarchies with the drop-down menu in the left-hand corner.
  • Right-click: Opens a new window with the Functional and Taxonomic table containing the underlying annotations of the selected cell. This is useful for taking a closer look at the underlying annotations or exporting the representative sequences for further downstream analysis (e.g., functional tree building, specialized or more intensive homology searches)

Notes:

  • The primary navigation method is the ‘Level’ toggle, which allows to up and down the hierarchy, increasing or decreasing the granularity of the annotations.
  • Many of the functional hierarchies annotations in multiple places, meaning that an annotated ORF can be mapped to multiple positions on the same tree.
  • As the CAZy hierarchy is large and complex and was specifically created for visualization MetaPathways, it is not possible to show rows containing only zeros.

Functional and Taxonomic Table (FUNC TAX)

This is a master table listing all predicted ORFs and their final annotation (as decided by sequence homology statistics and the entropy-based information criterion). The scales to millions of annotations by effective memory cycling. Each ORF has a number of fields with phystical statistics and annotations:

  • 'ORF_ID': an identifier for all predicted ORFs in a sample <sequence#><orf#>, uniquely identifying the contig or read sequence (sequence_#) as well as the <orf_#> identifying the predicted ORFs in order from the 5’-end of the '+' strand
  • ORF_length: The ORF length in nucleotides
  • start: ORF starting position (relative to strand)
  • end: ORF ending position (relative to strand)
  • Contig_name: unique identifier of each read or contig <sequence#>
  • Contig_length: length of the read or contig in nucleotides
  • strand: forward (+) or backward (-) orientation of the ORF
  • ec: Enzyme Commission annotation (if present)
  • taxonomy: LCA taxonomy of the ORF annotation
  • product: annotation for the ORF based on best functional annotation across all searched databases

The Functional and Taxonomic table also has a number of analytical features:

  • ‘Search’: Similar to the search in the Functional Hierarchies, this button opens a window allowing basic boolean keyword search for subsetting annotations in particular columns.
  • ‘Export’: Similar to the Functional Hierarchies, a custom tab- or column-separated text file can be exported for downstream analyses. Additionally, the underlying sequences for the reads/contigs and ORFs can be exported as fasta files.
  • ‘Function’: Launches a Functional Hierarchy window, projecting ORFs currently in the Functional and Taxonomic table onto a functional hierarchy of choice. This is particularly useful for isolating function based on keyword searches of particular functional and taxonomic annotations.

rRNA Annotation Tables

These tables contain taxonomic rRNA annotations found in the Silva or GreenGenes databases and have similar analytical features to the Functional and Taxonomic table described above. The table contains the following fields:

  • sequence: read/contig sequence the annotation was found in
  • start: start nucleotide position of the annotation
  • end: end nucleotide position of the annotation
  • similarity: sequence similarity of the query and the database target sequence
  • evalue: BLAST expectation value of the hit
  • bitscore: BLAST bit-score value of the hit
  • taxonomy: taxonomic annotation

rRNA Annotation tables can be searched and exported similar to the Functional and taxonomic table via the Search and Export button. One difference to note is that the rRNA alignment sequences can be exported in fasta format, a feature helpful in taxonomic tree building.

####tRNA-Scan

Represents a basic parsing of the output table from tRNA-Scan. Has same basic export features of the Functional and Taxonomic and rRNA Annotation Tables previously discussed.

Comparative Mode

While the above features are great for querying into an individual sample, Comparative Mode allows multiple samples to be compared across functional hierarchies. Additionally this feature was designed with scale in mind, and will handle hundreds of meta’omic samples with millions of annotations simultaneously.

To start, select ‘Comparative Mode’ from the drop down menu and then the ‘Select Samples’ button. Many of the same features from the individual samples simply have been scaled to apply to multiple samples Currently, multiple samples can be compared across the following tables:

  • Sequence and Annotation Statistics (RUN STATS): the same statistics for the individual sample, scaled across samples.
  • Functional Hierarchies (KEGG, COG, etc.)

Note:

There are some small differences in functionality when comparing multiple samples.

  • Double-clicking an individual to create subset functional hierarchy pulls all annotations in that column instead of just the individual cells
  • Right-clicking still refers to an individual cell
  • Export now allows you to select which samples you want to export sequences from
  • A folder of exported sequences will be created for each sample

Downstream Analysis

Due to this section's length, we have decided to relocate our collection of examples, tutorials, and use-cases to its own page.

References

  1. N. W. Hanson, K. M. Konwar, S.-J. Wu, S. J. Hallam, MetaPathways v2.0: A master-worker model for environmental Pathway/Genome Database construction on grids and clouds. Computational Intelligence in Bioinformatics and Computational Biology, 2014 IEEE Conference on, 1–7 (2014)
  2. D. Hyatt et al., Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 11, 119 (2010).
  3. D. Hyatt, P. F. LoCascio, L. J. Hauser, E. C. Uberbacher, Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28, 2223–2230 (2012).
  4. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool. J Mol Biol 215, 403–410 (1990).
  5. S. M. Kiełbasa, R. Wan, K. Sato, P. Horton, M. C. Frith, Adaptive seeds tame genomic sequence comparison. Genome Res 21, 487–493 (2011).
  6. D. H. Huson, A. F. Auch, J. Qi, S. C. Schuster, MEGAN analysis of metagenomic data. Genome Res 17, 377–386 (2007).
  7. T. M. Lowe, S. R. Eddy, tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research 25, 0955–0964 (1997).
  8. R. Caspi et al., The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research 38, D473–D479 (2009).
  9. P. D. Karp, S. Paley, P. Romero, The pathway tools software. Bioinformatics 18, S225–S232 (2002).
  10. P. D. Karp, M. Latendresse, R. Caspi, The pathway tools pathway prediction algorithm. Stand Genomic Sci 5, 424–429 (2011).
  11. K. D. Pruitt, T. Tatusova, D. R. Maglott, NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Research 35, D61–5 (2007).
  12. H. Li, R. Durbin, Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26, 589–595 (2010).
  13. R. L. Tatusov et al., The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4, 41 (2003).
  14. M. Kanehisa, S. Goto, KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Research 28, 27–30 (2000).
  15. F. Meyer et al., The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 9, 386 (2008).
  16. R. K. Aziz et al., SEED servers: high-performance access to the SEED genomes, annotations, and metabolic models. PLoS ONE 7, e48053 (2012).
  17. B. L. Cantarel et al., The Carbohydrate-Active EnZymes database (CAZy): an expert resource for Glycogenomics. Nucleic Acids Research 37, D233–D238 (2009).

Frequently Asked Questions

Question 1: What should be the value of BSR in the Parameters tab?

Answer: Usually, it is recommended that it is set to a values in the range 0.3--0.4. For smaller values of BSR cutoffs the selection of valid hits will be more permissive, and the converse is true for higher values of this cutoff.

Question 2: Should I use LAST or BLAST?

Answer: Although both LAST and BLAST are homology search tools on protein and nucleotide sequences, one can use any one of them. The LAST code used in MetaPathways has been modified to produce the same fields, in the output, as the tabular format of BLAST. However, due to some significant algorithmic improvements LAST usually runs up to 100x faster than BLAST. For large target protein databases, such as Refseq protein database, LAST would be recommended over BLAST.

Question 3: Does LAST/BLAST use the nucleotide or protein sequences?

Answer: In MetaPathways for homology search for protein sequences, in the FUN SEARCH step, the user can choose to use both LAST or BLAST. For homology search of the preprocessed nucleotide sequences (located in the preprocessed subfolder, of the output folder of a sample) against the rRNA databases in the in the Setup tab BLAST is use. soon we will enable LAST for this step as the rRNA gene databases are getting larger.

Question 4: How does MetaPathways determine taxonomy?

Answer: Taxonomic identification on sequences are done in two primary ways.

  • The proprocessed nucleotide sequences in the "preprocessed" folder are BLASTed against the rRNA gene databases in the folder . The the resulting hit is used to identify the taxonomy.
  • Once the ORFs are B/LASTed against Refseq, then the taxonomy for an ORF is computed by taking the lowest common ancestor (LCA) of the hits from Refseq protein database for the ORF.

Question 5: How should my inputs be formatted?

Answer: The MetaPathways code expects the input to be in the FASTA or Genbank format. For FASTA formatted inputs the file name should be in the form "samplename.fasta" (or ".fas", ".fna" ) adn the "samplename" part should not begin with a non-alphabetical characters and should not contain symbols like "." and spaces.

Question 6: How are the .BLASTout.parsed.txt or .LASTout.parsed.txt files in the blast_results folder generated?

Answer: The .BLASTout.parsed.txt or .LASTout.parsed.txt for a sample sample_name is located in the folder sample_name/blast_results/. These files are in the format sample_name.db_name.LASTout.parsed.txt or sample_name.db_name.LASTout.parsed.txt. The db_name refers to the protein database (actually the file name in the functional folder) used to search the sample ORFs searched for, such as, COG, KEGG, RefSeq, etc. Initially, the functional search results from B/LAST, i.e., the files sample_name.db_name.LASTout or sample_name.db_name.LASTout are generated in the FUNC_SEARCH step.

Next, in the PARSE_FUNC_SEARCH step, for each individual db_name, the hits that passes the minimum BSR are used to create the file sample_name.db_name.LASTout.parsed.txt or sample_name.db_name.LASTout.parsed.txt. Therefore, only the hits that passed the cutoff are added in the parsed.txt files not all hits.

Question 7: How are pathways predicted by MetaPathways?

Answer: It uses the Pathologic algorithm in Pathway-Tools

Question 8: How do I add a custom database, of reference protein database, for functional annotation?

Answer: MetaPathways reference databases are located in the folder path specified in "Database Directory" textbox in the the Setup tab. They are organized as:

  database-folder/    
              functional/
                   database-1 (COG/MetaCyc reference proteins in fasta fmt)
                   database-2
                   ................
                   ...............
                   MyCustomDatabase (place your custom reference database here in fasta fmt)
                   formatted/ (files in this folder are created automatically)
                      database-1.<suffix1> (suffix related to (B)LAST fmt)
                      database-1.<suffix2> (BLAST (pnr, phr), LAST (suf, tis))
                      .........................
                      database-1.<suffix2>
                      database-1.<suffix2>
                      .........................
                      MyCustomDatabase.<suffix1>
                      MyCustomDatabase.<suffix2>
                      .........................
               tanxonomic/
                   rRNA-sequence-database-1 (rRNA genes fasta fmt)
                   rRNA-sequence-database-2 (e.g. Silva/Greengene)
                   ................
                   ...............
                   formatted/ (files in this folder are created automatically)
                      rRNA-sequence-database-1.<suffix1>
                      rRNA-sequence-database-1.<suffix2>
                      .........................
                      .........................
                      rRNA-sequence-database-2.<suffix1>
                      rRNA-sequence-database-2.<suffix2>
                      .........................
                      .........................

Suppose you want to add a third MyCustomDatabase (in FASTA format) containing your new reference protein sequences. Then you simply put the file directly under the functional folder and rerun MetaPathways and select the database in the Parameters tab. Then the tool with automatically format "MyCustomDatabase" and put the formatted files under the formatted folder (see the figure above).

Question 9: MetaPathways found my custom database sequences and tried to format automatically but failed. What do I do now?

Answer: TBA

Question 9: Can I use a nucleotidetide database for functional annotation?

Answer: No, currently only reference protein databases can be used to functionally annotate the ORFs.

Question 10: Is it possible to process the samples through MetaPathways on a remote machine and view the results locally?

Answer: Yes. MetaPathways essentially consists of two major components:

1 The python code base that drives the actual processing of data, which is usually located in the folder MetaPathways_Python.x.x.x.

2 The GUI component that can drive the python code base to process the samples to create the data products, and then view the results. There are few ways of doing this:

  • One can process the samples, using the python code base, on a remote machine while connected through a Linux/Unix shell and then move the data products to a another machine to view with the GUI component. For more information on how to run MetaPathways manually please refer to Questions 11 below.
  • Install both the GUI and the python code base on a Linux machine and use the ssh -Y to do a X11 forwarding to run MetaPathways through the GUI.

Question 11: I usually connect to my Linux/Unix machine through a ssh shell. How do I run MetaPathways from a command line?

Answer: Follow the the steps below

  1. Type

$bash

to use the bash shell

  1. To create the input/output folders, type the following commands

    $mkdir myproject

    $mkdir myproject/input

    $mkdir myproject/output

  2. Put the input files into myproject/input folder. For this example, we will assume you have added sample files _N1.fasta, N2.fasta, sample1.fasta, new-samplex.fasta. _, corresponding to the samples N1, N2, sample1 and new-samplex. Note that we always drop .fasta, .fas and is not considered to be a part of the sample name, such as, N1.fasta corresponds to sample N1, N2.fasta to sample N2, sample1.fasta to sample sample1, etc.

  3. Type

$source <metapathways-folder>/MetaPathways_Python2.x.x/MetaPathwaysrc

   This step sets the right module paths.  <metapathways-folder> is the path where you have the MetaPathways_Python2.x.x, the Python code base for MetaPathways.
  1. You can set the run parameters in template_config.txt and template_param.txt located under the folder <metapathways-folder>/MetaPathways_Python2.5.1/config. Usually, you need to configure template_config.txt only once during the installation. The parameter file template_param.txt has the setting for the run. You can copy these file into the current folder you are working in. This will save you from typing long file paths.

  2. Following are a few different scenarios of processing the samples where the input folder is myproject/input/ and output in the folder myproject/output/

    (a) To process sample N1, type

    $python <metapathways-folder>/MetaPathways_Python2.5.1/MetaPathways.py -i myproject/input/ -o myproject/output/ -c config/template_config.txt -p config/template_param.txt -s N1

    (b) To process sample N1 and sample1, type

    $python <metapathways-folder>/MetaPathways_Python2.5.1/MetaPathways.py -i myproject/input/ -o myproject/output/ -c config/template_config.txt -p config/template_param.txt -s N1 -s sample1

    (c) To process all sample file in myproject/input, type

    $python <metapathways-folder>/MetaPathways_Python2.5.1/MetaPathways.py -i myproject/input/ -o myproject/output/ -c config/template_config.txt -p config/template_param.txt

    Note you can add as many samples as you want, in the input folder by using the -s option per sample. Also, "template_config.txt" does not need to be changed at all.

Questions 12: How does MetaPathways annotate the ORFs (predicted genes)? And how are the reported statistics calculated?

Answer: Because ORFs are annotated against multiple databases, MetaPathways comes at ORF annotation from three different perspectives: (1) annotation for minimum thresholds, (2) annotation for individual databases, and (3) annotation for pathway prediction in Pathway Tools. In the first case MetaPathways wants to ensure that all annotations meet certain quality thresholds, and has default annotation parameters for (B)LAST that are fairly conservative (Length: 180 bp (60 aa), E-value: 1e-6, Bit-score: 20, Bit-score ratio >0.4). Next, as certain sequence databases often have specialized annotation criteria and functional hierarchies so annotations within a database are preserved, and these annotations are what are used to generate the counts in the functional hierarchy tables of the GUI (e.g., COG, KEGG, and SEED, etc.). Finally, Pathway Tools maps to annotations to pathways during Pathway/Genome databases creation based on E.C. number mapping and keyword annotation. Thus in the annotation process, if ORFs have high-quality hits from different databases, MetaPathways will select annotations that have a preference for E.C. numbers and long descriptive annotation. It does this by an calculating an 'information-score' that gives points for informative words and extra points for successfully parsed E.C. numbers.

This annotation involves several stages of the pipeline, so here's a simple example using for four ORFs starting with the FILTER AMINOS step. We assume that the name of the sample is X and we are using LAST for the functional annotation against the COG, KEGG, and SEED databases.

FUNC SEARCH:

After filtering out very short amino acid sequences, suppose we are left with four orfs (orf1, orf2, orf3, orf4).

Suppose we do a homology search for the ORFs against COG, KEGG and SEED reference databases. This would result in the output tables in the X/blast_results/ folder. The content of these tables are the hits for the ORFs in these databases (only relevant columns are shown):

  • X.COG.LASTout:
queryId subjectId ... evalue bitscore
orf1 cogA ... 1e-8 70
orf2 cogB ... 1e-14 80
orf3 cogC ... 1e-9 100
orf4 cogD ... 1e-5 (too big) 85
  • X.KEGG.LASTout:
queryId subjectId ... evalue bitscore
orf1 keggA ... 1e-9 75
orf2 keggB ... 1e-12 67
orf3 keggC ... 1e-16 90
orf4 keggD ... 1e-14 85
  • X.SEED.LASTout
queryId subjectId ... evalue bitscore
orf1 seedA ... 1e-20 75
orf2 seedB ... 1e-8 30
orf3 seedC ... 1e-9 105

PARSE FUNC SEARCH:

In this step the above tables are processed, but only hits that pass the maximum E-value, minimum bit score and minimum BSR (Bit-Score Ratio) cutoffs end up in the parsed file (stored in the same directory). For example, here we assume our maximum E-value is 10e-6, minimum bit score is 50, and minimum BSR is at 0.4. The BSR of a query sequence and target sequence alignment is computed by taking the ratio of the query-to-target bit-score and query-to-query bit-score (called refscores). Following table, in the folder X/blast_results/ shows the refscores for the ORF sequences

  • X.refscores.LAST
queryId bitscore
orf1 80
orf2 90
orf3 120
orf4 100

Now the resulting LASTout.parsed.txt files are generated by only retaining the hits, for each database, that pass the cutoffs

  • X.COG.LASTout.parsed.txt
queryId subjectId ... product/annotation
orf1 cogA ... enzyme V
orf2 cogB ... enzyme W
orf3 cogC ... enzyme S
  • X.KEGG.LASTout.parsed.txt:
queryId subjectId ... product/annotation
orf1 keggA ... enzyme V
orf2 keggB ... enzyme W
orf3 keggC ... enzyme Y
orf4 keggD ... enzyme Z
  • X.SEED.LASTout.parsed.txt:
queryId subjectId ... product/annotation
orf1 seedA ... enzyme V
orf3 seedC ... enzyme K

ANNOTATE ORFS

Now all the annotations for each ORF in the parsed.txt from every database are aggregated in one table X/results/annotation_table/X.2.txt. For each database there is an annotation column and a 'information-score' column (score). This score estimates how specific an annotation is for a given enzyme by counting informative words (after removing uninformative stop words as well as 'protein' and 'enzyme') and giving a +10 bonus for E.C. This selects annotations that have more informative words or have E.C. numbers.

ORFId COG (annot) COG (score) KEGG (annot) KEGG (score) SEED (annot) SEED (score)
orf1 enzyme V 15 enzyme V 10 enzyme V 12
orf2 enzyme W 15 enzyme W 15 - -
orf3 enzyme S 9 enzyme Y 7 enzyme K 8
orf4 enzyme Z 9 - - - -

In the above example orf1, orf2, orf3, and orf4 will be annotated as 'enzyme V', 'enzyme W', 'enzyme S' and 'enzyme Z', which are the final annotations that end up in X/results/annotation_table /ORF_annotation_table.txt.

PATHOLOGIC INPUT

The annotations in the file ORF_annotation_table.txt are used to create the Pathologic input to build the ePGDBs, which predicts metabolic pathways in the MetaCyc pathway database.

Interpreting Annotation Statistics

Below we explain meanings of the sample statistics generated by the pipeline.

statsdiagram1

  • A - Total hits from the given database (in this example MetaCyc and COG) regardless of the user-defined thresholds (BSR, e-value or minimum score).
  • B - Total hits from the given database (in this example MetaCyc and COG) that met user-defined thresholds (BSR, E-value and minimum score). These hits end up in the output/blast_results folder with the name SampleName.DatabaseName.(B)LASTout.parsed.txt
  • C - Total number of ORFs for which a protein annotation was assigned (across all databases after filtering hits using the user-defined thresholds). This number will not be the same as adding up the hits from all `Annotations meeting user defined thresholds from Database' columns (B), as a single ORF may have been successfully annotated by more than one database.

functional_stats

  • D - Total number of hits meeting the user-defined thresholds in the database selected (in this example COG). Additionally, quite often annotations map to multiple places in the same hierarchy because of multifunctionality and the difficulty of functionally separating certain enzymatic functions to a particular metabolic task. This means that ORFs can be counted twice (one for each time it was counted in the sub category). This means that functional hierarchy tables are not guaranteed to sum to their respective total hits (B). We have thought about this double counting issue, but we have not settled on a method to correct for it or if correction is necessarily useful in all cases of multifunctional enzymes.

Question 13: How do I manually run LAST or BLAST?

Answer:

  • LAST: You need to two executables to use it: lastdb and lastal . The version of "lastal", distributed with MetaPathways has been modified to produce tabular format supported by BLAST (produced by the option "-m 8"), along with the inclusion of e-value into the table. The process involves two steps The first step involves formatting the reference database. Note that this step takes substantially longer than BLAST, which actually builds the suffix arrays for the reference sequences. For example, in the case of NCBI Refseq protein database (>20 million sequences) it can take up to a day. Fortunately you do not need to keep reformatting this reference database unless the set of sequences changes

$lastdb -s 4000M -p -c <reference-seq> <query_sequence_file_fasta>

-p is for protein sequences

Next we can do a homology search of a set of query sequences in FASTA format by running the command

$lastal -f 2 -o <output> <reference_seq> <query_sequence_file_fasta>

The -f 2 is an option to produce BLAST like tabular format. This step if substantially faster than BLAST. Note that <reference_seq> is the same for both the steps. For examples, if reference sequences are in a file called reference.fasta then the formatted database gets identified as "reference.fasta". Note, if you want the name to be pretty, it is not required to have the ".fasta" extension as long as the sequences are in FASTA format.

  • BLAST: There are plenty resources about BLAST on the Internet.

Question 14: I copied the MetaPathways_Python.x.x.x folder from another computer but it does not run.

Answer: Apart from checking the path and following the correction instruction for MetaPathways installation and runnin, sometimes the ".pyc" files from the source machine may confuse your system. ".pyc" files are compiled binary files from the corresponding source ".py" Python source files, when the Python compiler see the source in the first run to speed up subsequent in subsequent visits to the code. If the source machine and destination machines are of different architecture or OS type then most likely it wont run. The trick to to delete the ".pyc" files in in the MetaPathways_Python.x.x.x folder and in the libs/ folder and its sub-folders at various depth. Try using "rm *.pyc"; "rm */*/*.pyc"; "rm */*/*/*.pyc"; until you exhaust up to the deepest sub-folder.

Question 15: The last or blast executable does not work, are the binaries compatible?

Answer: The lastdb, lastal, etc and other binaries used by the pipeline are located under system specific folders, such as, macosx, ubuntu, etc under the executables folder in MetaPathways_Python.x.x.x. However, it is often necessary to compile the binaries when needed. The source code for the binaries are located in a folder called sources under the executables folder. For example, in order to compile lastal and lastdb please cd into the source folder; untar the LAST.tar ( tar -xvf LAST.tar) and then cd into folder LAST. Now, do a make clean make clean and then "make". You will notice that the executable lastal and lastdb are created in the folder. Now, copy lastal and lastdb into your systems specific executable folders, e.g., macosx, ubuntu, etc. Make sure you also give them execute permissions as chmod 711 lastal and chmod 711 lastdb.

Question 16: Which reference protein databases should I use? Where do I get these databases from?

Answer: MetaPathways displays the functional potential of the sample ORFs by tallying the counts against the SEED subsystems, KEGG category, COG category, MetaCyc pathways and CAZY hierarchy. Therefore, in order to produce these results correctly we need to provide the right databases associated to the functional categories. Normally, one is expected to use the KEGG, COG, CAZY, Refseq protein, MetaCyc and SEED protein sequence databases.

  • NCBI Refseq: This protein database is among the large protein databases, but it is very essential. Using this reference database MetaPathways can determine both function and taxonomy (using LCA rule) for individual ORFs. To create the required protein sequence file you will need NCBI fastacmd tool that comes with the NCBI + executables (from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/). However, the protein sequences for NCBI Refseq protein can be retrieved by following the steps below:

    • Download all the files
     >refseq_protein.01.tar.gz
    
     >refseq_protein.02.tar.gz
    
     >refseq_protein.03.tar.gz
    
     >.........................
    
     These are essentially parts of the BLAST formatted databases (probably  made using **makeblastdb** tool).
    
    • Decompress each one of these files as the command

      $tar -zxvf refseq_protein.xx.tar.gz

      which will extract a few files as

    refseq_protein.01.phr

    refseq_protein.01.psq, etc

    All the files together constitute a formatted BLAST usable reference protein database.

    • Extract the protein sequences in the formatted database using the NCBI tool fastacmd, as

      $fastacmd -d refseq_protein -D 1 -o refseq-seq

      which will output the protein sequences in a FASTA file called "refseq-seq" and this can be dropped in the "functional" folder (see FAQ item Question 8) for MetaPathways to format automatically.

  • COG: The COG sequences can be download in the form of a FASTA file from the link ftp://ftp.ncbi.nih.gov/pub/wolf/COGs/COG0508/

  • KEGG: The KEGG sequences, from KEGG: Kyoto Encyclopedia of Genes and Genomes, is no longer free. Please check if your lab has already made a purchase.

  • MetaCyc: TBA

  • SEED: The SEED sequences with peg IDs can be downloaded in a FASTA format from the SEED Project website. More details TBA.

  • CAZY: These are a database of Carbohydrate-Active Enzymes from the [CAZY] (http://www.cazy.org/). More details TBA.

Question 17: Is there a log file of the parameters used for ORF prediction and functional annotation?

Answer: There is no permanent log for this, because during each run the logs are updated. However, the template_param.txt file in the config folder has these settings.

Questions 18: Is the pgdb folder or the ptools folder, located under the output folder for a sample, the input for Metapatahways? And does that input represent ALL the annotations (SEED, KEGG , COG, etc) from Metapathways?

Answer: The input to the results displayed on the MetaPathways GUI comes from the folder and files highlighted (by shading) below. The files in the results folder also contain the input to the GUI for SEED, KEGG, COG, etc.

  • Folders/files under <sample_name> folder

    • blast_results

    • genbank

    • mltreemap_calculations

    • ptools

    • bwa

    • orf_prediction

    • results

       annotation_table/ORF_annotation_table.txt
      
       annotation_table/functional_and_taxonomic_table.txt
      
    • preprocessed

    • run_statistics

    • errors_warnings_log.txt

    • metapathways_run_log.txt

    • metapathways_steps_log.txt

Question 19: We are visualizing the results of a Metapathways analysis on a different computer that has Pathways-Tools on it, so what is the proper way to load the output from Metapathways into Pathway-tools?

Answer: There are two input folders you need to move to the new machines to view the results, assuming that you have both Pathways-Tools and MetaPathways GUI (if not the MetaPathways_Python) installed on the remote machine.

  • Viewing the PGDB using Pathways-Tools: Pathway-Tools software, by default, looks for the processed PGDBs in the folder <your_home>/ptools-local/pgdbs/user/. The PGDBs are stored, on disk, in the form of a folder, which is named as sample name (in lowercase) suffixed by "cyc". For example, if the sample name is XYZ then the PGDB is stored in a sub-folder "xyzcyc" <your_home>/ptools-local/pgdbs/user/ and hence the full path would be "<your_home>/ptools-local/pgdbs/user/xyzcyc". Given the above explanation, if you want to view some samples from one machine to another, simply copy these PGDB folders, for the samples you want, located under "<your_home>/ptools-local/pgdbs/user/" to the corresponding folder in the other machine. You should carefully zip these folders (and recursively). For example, zip -r xyzcyc.zip xyzcyc, put the zip file under the other machine's "<your_home>/ptools-local/pgdbs/user/" and unzip as unzip xyzcyc.zip. Once you are done with that, simply start/restart Pathways-Tools for the PGDBs to show up.
  • View the COG, KEGG, SEED, etc results in MetaPathways: To do this simply copy the sample results folder into the new machine and view it with MetaPathways

Question 20: If I wanted to present the relative abundance of different reads in the KEGG/COG/SEED/ CAZY/MetaCyc output for instance, and I am looking at the RPKM display, can I use those read numbers or should they all still be normalized to the total reads for comparing the relative abundance between different data sets?

Answer: Once you are looking at RPKM values they are already normalized for the input size as well as for the length of the individual ORFs. In other words, there is no need to further normalize it. However there might be other sophisticated statistical sampling techniques, but we are not supporting these for now and instead going with the mainstream methods.