Skip to content

Latest commit

 

History

History
370 lines (247 loc) · 30.2 KB

pasta-tutorial.md

File metadata and controls

370 lines (247 loc) · 30.2 KB

Introduction

PASTA estimates alignments and ML trees from unaligned sequences using an iterative approach. In each iteration, it first estimates a multiple sequence alignment using the current tree as a guide and then estimates an ML tree on (a masked version of) the alignment. By default, PASTA performs 3 iterations, but a host of options enable changing that behavior. In each iteration, a divide-and-conquer strategy is used for estimating the alignment. The set of sequences is divided into smaller subsets, each of which is aligned using an external alignment tool (default is MAFFT). These subset alignments are then pairwise merged (by default using Opal) and finally the pairwise merged alignments are merged into a final alignment using a transitivity merge technique. The division of the dataset into smaller subsets and selecting which alignments should be pairwise merged is guided by the tree from the previous iteration. The first step therefore needs an initial tree.

Acknowledgment: The current PASTA code is heavily based on the SATe code developed by Mark Holder's group at KU. Refer to sate-doc directory for documentation of the SATe code, including the list of authors, license, etc.


Using PASTA

Step 1: Running a test job

PASTA can be run using command-line or GUI.

From GUI:

  1. If you have installed PASTA from the code, you need to open a terminal (sorry!) and run:
run_pasta_gui.py

on some machines, this may not work, saying

This program needs access to the screen. Please run with a Framework build of python, and only when you are logged in on the main display of your Mac.

you may be able to fix this by running:

pythonw {PATH to pasta}/run_pasta_gui.py

note the use of pythonw instead of python

  1. On the VM, on the left hand side, there is a PASTA icon. Click that to open PASTA.

This will open up the PASTA GUI. To provide the input file, click on Sequence File button and select the input fasta file.

We have provided some test data files under the data directory. A good first start is small.fasta, which is a small simulated alignment and runs really fast. You can also use 16S.E.ALL.unaligned.fasta, which is much larger (1,900 sequences) and will take longer (depending on your machine, between 15 minutes and few hours). The 16S input file is downloaded from the Gutell Lab CRW website, but is trimmed to the first 990 sites so that a test run can finish in 15-20 minutes given a relatively new desktop machine.

Once you select the input file, PASTA will ask you if you would like it to "Red input data now?". Answer OK. At this point PASTA automatically picks up its algorithmic parameters. Now click on the Start button and PASTA will start running. If you used small.fasta as input, PASTA will finish after a short while. Once PASTA finishes, the location of the output files are given in the prompt box just below the start button.

From Command-line:

If your installation is successful, you should be able to run PASTA by running the following command from any location. Open up a terminal window and type:

run_pasta.py --help

Running PASTA with the --help option produces the list of options available in PASTA. PASTA automatically picks its algorithmic settings based on your input, so you can ignore most of these options (but -d is essential if you have anything other than DNA sequences). The basic command-line usage you need to know is:

run_pasta.py -i input_fasta_file

The -i option is used to specify the input sequence file. The input file needs to be in the relaxed FASTA format. This command will start PASTA and will run it on your input file.

For a test run, use the cd command to go to the data directory under your PASTA installation directory. From there, run

run_pasta.py -i small.fasta

This will start PASTA and will finish quickly (30 seconds to 5 minutes based on your machine). Read PASTA output and make sure it finishes without producing any errors. If PASTA runs successfully, it produces a multiple sequence alignment and a tree, which we will explore in the next step.

Step 2: Inspecting the output of PASTA:

The two main outputs of PASTA are an alignment and a tree. The tree is saved in a file called [jobname].tre and the alignment file is named [jobname].marker001.small.aln. The [jobname] is a prefix which is by default set to pastajob, but can be changed by the user (see option -j below). When you start PASTA, if your output directory (which is by default where your input sequences are) already contains some files with the pastajob prefix, then the pastajob1 prefix is used, and if that exists, pastajob2 is used, and so forth. Thus the existing files are never overwritten. The name of your job and therefore the prefix used for output files can be controlled using the - j argument for command-line or the "Job Name" field on the GUI.

We will first briefly examine the alignment and the tree produced by PASTA.

Tree viewing software:

There are many tools that can be used for viewing phylogenetic trees. Here are few that are used by many people:

  1. FigTree is probably the most widely used tool. It produces nice looking figures and works under Linux, Windows and MAC.
  2. Archaeopteryx is very good for viewing large trees and also works under all three operating systems.
  3. EvolView: online application. You don't even need to download.

There are many more.

For this tutorial, let's use the online viewer (EvolView) or any other tool you can manage to download and install it (already installed on virtual machine). Using either of these applications, open the pastajob.tre file. We will explore various tree viewing options.

Alignment viewing software:

Similarly, many applications exist for viewing multiple sequence alignments. Some options :

  1. Alignments can be viewed in any text editor. In many situations, this is sufficient. In Linux/Mac command-line, one can use less, vim, or any number of other text editor applications. Also GUI-enable text editors (e.g. TextEdit for MAC) would also work. Windows notepad is not a good option (neither is MS word or any other word processing tool), but notepad++ should work.
  2. Seaview is a relatively light-weight application that does a good job of visualizing alignments.
  3. JalView has many options, but is not necessarily light-weight.
  4. SuiteMSA is also a full-featured software for alignment viewing, manipulation, and more.

In this tutorial, we will use both a text editor and the Seaview. But feel free to use your own favorite tools. We will open pastajob.marker001.small.aln using a text editor and using SeaView and will look at them.

Note about support values:

PASTA does not perfom bootstrapping. The tree outputted by PASTA, depending on the options used, might include support values on the branches. These are not bootstrap support values. Instead, they are SH-like local support values computed by FastTree, and are generally believed to be not as reliable as bootstrap support values. In our experience they tend to over-estimate support. Thus, if you want to have support values that can be trusted, we suggest that you use the PASTA alignment and an external tool (e.g., RAxML) for bootstrapping. If your alignment is too big for bootstrapping using RAxML, you can always use FastTree for bootstapping.

Other PASTA output files:

PASTA also generates a few other "output" files. pastajob.out.txt and pastajob.err.txt will contain output and error messages respectively. pastajob.score.txt gives the likelihood score produced by PASTA's internal ML tool on the (masked -- see below) alignment.

Besides these, many "temporary files" are also produced. The most important temporary file is called pastajob_temp_pasta_config.txt. This file contains all the PASTA configurations used in the current run. You should 1) inspect this file to make sure the configurations are what you intended, and 2) keep this file for future reference (so that you know what exact options where used). We will see in the next steps how configuration files can be used for running PASTA with a reproducible set of configurations.

Finally, there are a bunch of temporary output files with the format pastajob_temp_iteration_X_tree.tre,pastajob_temp_iteration_X_seq_alignment.txt, and pastajob_temp_iteration_X_seq_unmasked_alignment.gz. These are alignments and trees produced by individual iterations of PASTA (more on this later). Alignments are saved in masked and unmasked format. These temporary alignments and trees use PASTA's internal "safe" names. The translation between PASTA's internal names and your original names is given in pastajob_temp_name_translation.txt. As we will see, PASTA comes with a tool called run_seqtools.py that can help with manipulating these files (see Step 6).

Comparing alignments:

When two alignments are generated on the same set of sequences, one can ask how similar they are. We have a tool called FastSP that compares two alignments and tells you how similar or different they are. FastSP does not require installation. You can just download it and run it (Java is required). We will use it to compare alignments obtained using PASTA and a reference alignment obtained from the CRW website, and also included as part of the data package. Assuming FastSP is located at ~/bin/, you can run:

   java -jar ~/bin/FastSP_1.6.0.jar -r 16S.E.ALL.referene.fasta -e pastajob.marker001.16S.E.ALL.unaligned.aln

This will compare the reference alignment and your estimated alignment and will print out various measures of their similarity and difference.

We can also try:

   java -jar ~/bin/FastSP_1.6.0.jar -r 16S.E.ALL.referene.fasta -e pastajob.marker001.16S.E.ALL.unaligned.aln -mlr

which discards lower case letters from the reference alignment (in this case, only SPFN is meaningful and SPFP is not).

Step 3: Understanding and using PASTA options:

GUI:

When GUI is used, a limited set of important options can be adjusted on the GUI.

  • External tools: You can choose the tool used for obtaining alignments on subsets (Aligner), for performing pairwise merges (Merger), tree estimation tool (Tree estimator), and finally the model used for tree estimation. The default values should work fine and you can simply use them. However, we know of a few cases when you might want to change these values. If you have small datasets (say <100 taxa), you can use RAxML instead of FastTree without the running time increasing dramatically. For AA datasets, if you have an idea of what substitution model is more appropriate, you can change the model (but note that FastTree supports only a few models and you might have have to switch to RAxML for more models). Finally, if you don't have Java on your machine, you can always use Muscle instead of Opal for merging pairs of alignments.

  • Job Settings: these are options that don't change algorithmic settings, but control what the job name should be, where the output files should be saved, how many CPUs should be run, and finally how much memory should be made available to OPAL. The OPAL memory has not been an issue for PASTA, but if you ever encounter a memory problem, you can increase the maximum memory that OPAL can use.

  • Sequences and Tree: You can choose the input sequence files here (the only real mandatory field). The datatype is automatically determined by the GUI, but if it is incorrectly determined, you can always fix that. Use for initial tree is a checkbox that is enabled only if your input is already aligned. When chosen, your alignment is used as the initial alignment (from which the starting tree is estimated). If not checked, the alignment in your input file is ignored (gaps are removed). The Tree file option can be used to provide a starting tree to PASTA. If not provided, PASTA estimates the starting tree.

  • PASTA Settings: these sets of options change algorithmic settings of PASTA. Max. Subproblem controls the subsets size (subset sizes more than 200 are discouraged, but smaller values could improve running time). Maximum subset size can be set as an absolute value or as a fraction of the dataset size. Time limit and Iteration limit can be used to control how many iterations PASTA uses. The default is 3 iterations, but note that most of the gains are usually obtained in the first iteration. Thus, if the running time is an issue, you can always reduce the iteration number to 1 or 2. You can also set a running time limit in hours. When the time is up, the current iteration continues, but further iterations are not performed. Return is also a legacy of PASTA which determines whether the tree from the final iteration should be returned as final tree, or if the tree that had the highest likelihood (and the corresponding alignment) should be used. Since PATA (unlike SATe) uses masking before running the ML tool, changing this option is not recommended.

  • Workflow Settings: This section gives you options that are not really related to PASTA. If you choose Two-Phase (not PASTA), PASTA will not be used. Instead the alignment method selected will be used to get an alignment, and the tree estimation tool will be used to obtain a tree. With Extra RAxML Search, PASTA will be used to obtain an alignment, and then RAxML will also be run on the final alignment; this is a technique you can use to ensure that you get a RAxML tree instead of a FastTree tree as your output. Note that RAxML takes more time than FastTree, so using RAxML for your final tree increases the running time but not by as much as using it in each iteration.

Command-line:

The command line allows you to alter the behavior of the algorithm using a variety of configuration options. Running PASTA with the -h option lists all the options that can be provided to the command-line (see below for the most important ones). In addition to the command-line itself, PASTA can read the options from one or more configuration files. The configuration files have the following format:

[commandline]
option-name = value

[sate]
option-name = value

Note that as mentioned before, with every run, PASTA saves the configuration file for that run as a temporary file called [jobname]_temp_pasta_config.txt in your output directory. You can view one of these files in a text editor to better understand the format of the configuration file.

PASTA can read multiple configuration. Configuration files are read in the order they occur as arguments (with values in later files replacing previously read values). Options specified in the command line are read last. Thus these values "overwrite" any settings from the configuration files.

The following is a list of important options used by PASTA. Note that by default PASTA picks these parameters for you, and thus you might not need to ever change these (with the important exception of the -d option):

  • Initial tree: As mentioned before, PASTA needs an initial tree for doing the first round of the alignment. Here is how the initial tree is picked.

    • If a starting tree is provided using the -t option, then that tree is used.
    • If the input sequence file is already aligned and --aligned option is provided, then PASTA computes an ML tree on the input alignment and uses that as the starting tree. * If the input sequences are not aligned (or if they are aligned and --aligned is not given), PASTA uses the following procedure for estimating the starting alignment and tree. It 1) randomly selects a subset of 100 sequences, 2) estimates an alignment on the subset using the subset alignment tool (default MAFFT-l-insi), 3) builds a HMMER model on this "backbone" alignment, 4) uses hmmalign to align the remaining sequences into the backbone alignment, 5) runs FastTree on the alignment obtained in the previous step.
  • Data type: PASTA does not automatically detect your data type. Unless your data is DNA, you need to set the data type using -d command. Your options are DNA, RNA, and PROTEIN.

  • Tree estimation tool: the default tool used for estimating the phylogenetic tree in PASTA is FastTree. The only other option currently available is RAxML. You can set the tree estimator to RAxML using the --tree-estimator option. However, Be aware that RAxML takes much longer than FastTree. If you really want to have a RAxML tree, we suggest obtaining one by running it on the final PASTA alignment. You can change the model used by FastTree (default: -nt -gtr -gamma for nt and -wag -gamma for aa) or RAxML (default GTRGAMMA for nt and PROTWAGCAT for AA) by updating the [model] parameter under [FastTree] or [RAxML] header in the input configuration file. The model cannot be currently updated in the command line directly as an option.

  • Subset alignment tool: the default tool used for aligning subsets is MAFFT, but you can change it using the --aligner option. We strongly suggest that the alignment subset size should always be at most 200 sequences, because for subsets that are larger than 200, the most accurate version of MAFFT (-linsi) is not used.

  • Pairwise merge tool: the default merger tool is OPAL. You can change it using the --merger option. If you have trouble with OPAL (java version, memory, etc.), using Muscle should solve your problem; in our experience, it doesn't really affect the accuracy by a large margin.

  • CPUs: PASTA tries to use all the available cpus by default. You can use --num_cpus to adjust the number of threads used.

  • Number of iterations: the simplest option that can be used to set the number of iterations is --iter-limit, which sets the number of iterations PASTA should run for. You can also set a time limit using --time-limit, in which case, PASTA runs until the time limit is reached, then continues to run until the current iteration is finished, and then stops. If both options are set, PASTA stops after the first limit is reached. The remaining options for setting iteration limits are legacies of SATe and are not recommended.

  • Masking: Since PASTA can produce very gappy alignments, it is a good idea to remove sites that are almost exclusively gaps before running the ML tree estimation. By default, PASTA removes sites that are more than 99.9% gaps. You can change that using the --mask-gappy-sites option. For example, using --mask-gappy-sites 10 would remove sites that are gaps for all sequences except for (at most) 10 sequences. Increasing the masking can make PASTA a bit faster and can potentially reduce the memory usage, but it could also have a small effect on the final tree. If unsure, leave the option unchanged. Note that the final alignment outputted by PASTA is NOT masked, but masked versions of the output are also saved as temporary files (see below).

  • Maximum subset size: two options are provided to set the maximum subset size: --max-subproblem-frac and --max-subproblem-size. The --max-subproblem-frac option is a number between 0 and 1 and sets the maximum subset size as a fraction of the entire dataset. The --max-subproblem-size option sets the maximum size as an absolute number. When both numbers are provided (in either a configuration file or the command line), the LARGER number is used. This is an unfortunate design (legacy of SATe) and can be quite confusing. Please always double check the actual subset size reported by PASTA and make sure it is the value intended. The default subset sizes should work just fine. In our limited experiments, we have noticed that reducing the maximum subset size from 200 to 100 for very large datasets increases speed with little or no effect on the final alignments.

  • Temporary files: PASTA creates many temporary files, and deletes most at the end. You can control the behavior of temporary files using few options: --temporaries sets the directory where temp files are created, -k instructs PASTA to keep temporary files, and --keepalignmenttemps will keep even more temporary files. Note that these are different from the temporary files created in the output directory (which are always kept).

  • Dry run: The --exportconfig option can be used to just create a config file and exit without actually running PASTA. This is useful for making sure the configurations are correct before actually running the job.

The remaining options available in PASTA are mostly legacies from SATe and are generally not useful for PASTA runs.

Step 4: An example run on AA data

To run PASTA on amino acid datasets, it is required that the data type be declared to PASTA. In the GUI version, this is automatically done, but you can adjust the data type too. On the command-line, this has to be set explicitly.

In the sample data directory, there is a sample AA dataset called BBA0067-half.input.fasta. Run PASTA on this input file using the following command:

run_pasta.py -i BBA0067-half.input.fasta

Note the error message that is produced. Now try:

run_pasta.py -i BBA0067-half.input.fasta -d protein

PASTA should now work fine. Note that this run is going to take longer to finish (5-20 minutes) because the input file is larger and also because PASTA is slower on AA compared to DNA. If you are not willing to wait that long, stop the run using Ctrl+C.

Step 5: using a starting tree

To use a starting tree with GUI, you can simply use the "Tree file" option and select the starting tree. To provide a starting tree using the command-line, run the following command:

run_pasta.py -i BBA0067-half.input.fasta -t BBA0067-half.startingtree.tre -d protein -j myjob --num-cpus 2

This run will use a starting tree that we already had (-t option) and will also set the name for the job to myjob using the -j option. The output files will now be prefixed by myjob instead of pastajob. Finally, --num-cpus will change the number of threads that PASTA uses.

Step 6: using run_seqtools.py

Bundled with PASTA is a small tool calll seqtools, which can be invoked by running:

run_seqtools.py -h

This tool (which is available only as command-line) can help you with some specific manipulations of alignment files. The current version of this tool has a few actions that it can perform:

  1. Read alignments in a couple of formats and write them out in a different format
  2. Mask gappy sites from an alignment
  3. Filter fragmentary sequences
  4. Rename sequences in a file

These operations are useful for PASTA for a few reasons:

  1. The temporary files outputted by PASTA (pastajob_temp*) use "safe" taxon names internal to PASTA. PASTA also outputs a mapping file that can map taxon names to the original values (pastajob_temp_name_translation.txt). If you ever want to use these temporary files, you might need to change taxon names.
  • Some of those temporary files are also in formats that are specific to PASTA. In particular, the unmasked alignment files are in a format called COMPACT3 which PASTA uses, but no other software understands. You can convert from CPMACT3 to other formats using run_seqtools.py.
  • You might want to convert your PASTA alignment file to the PHYLIP format so that you can run RAxML on them.
  • PASTA produces gappy alignments for large datasets. This is a feature of PASTA alignments and relates to the design of the transitivity merge. To deal with this abundance of gaps, internally, PASTA masks sites with more than 99.9% gaps (by default) before estimating a tree. For any downstream analyses that uses PASTA alignments, it might also be a good idea to also perform some form of masking of gappy sites.
  • PASTA is not designed for aligning fragmentary data (it is a global alignment tool). You might want to remove fragmentary sequences from your dataset before using PASTA, or use UPP (another alignment method we have developed, and available at https://github.com/smirarab/sepp).

The way run_seqtools.py works is that you give it an input alignment and a set of options. Based on the options given, the alignment is altered in one of several ways and the resulting alignment is outputted. The input/output format can also be adjusted. An example will best illustrate this.

Go to the place where you ran one of your test runs. There, you will find a file called pastajob_temp_iteration_0_seq_unmasked_alignment.gz. This is the unmasked alignment from the first iteration of PASTA in the COMPACT3 format and is also compressed. Use gunzip pastajob_temp_iteration_0_seq_unmasked_alignment.gz to unzip it. Now, run:

run_seqtools.py -infile pastajob_temp_iteration_0_seq_alignment.txt -outfile iter0.phylip -informat COMPACT3 -outformat PHYLIP -masksites 5 -rename pastajob_temp_name_translation.txt

The above command reads pastajob_temp_iteration_0_seq_alignment.txt alignment file in COMPACT3 format, and outputs the results in PHYLIP format to a file named iter0.phylip. It also removes all sites that have at most 5 non-gap characters. Finally, it maps the names from internal PASTA names to original sequence names using the temporary mapping file produced by PASTA.

Here is another useful command:

run_seqtools.py -infile pastajob.marker001.small.aln -informat FASTA -outfile pata-masked-20.fasta -outformat FASTA -maskmin 20

This reads your final PASTA alignment and masks out columns with at most 20 non-gap characters.

If there are features that you would like to see implemented run_seqtools.py, let us know, and we will try to add them.

Restart PASTA from the previous runs.

Let us assume you run PASTA and after 1 iteration finished, the job stopped. Now you want to run another iteration. What do you do?

You can use

zcat pastajob_temp_iteration_1_seq_unmasked_alignment.gz |run_seqtools.py -informat COMPACT3  -outformat FASTA -outfile iterate1.fasta -rename pastajob_temp_name_translation.txt 

to get a file called iterate1.fasta that you can give as input to PASTA for the next iteration. What is going on here is the following.

  1. zcat will uncompress the unmasked alignment file
  2. The unmasked file in the COMPACT3 format is translated to FASTA output
  3. The internal PASTA names are translated to original names.

The output file may be very large. For very large input, you may be OK with loosing some very gappy portions of the alignment between the two steps. If so, you can use

zcat pastajob_temp_iteration_1_seq_unmasked_alignment.gz |run_seqtools.py -informat COMPACT3  -outformat FASTA -outfile iterate1.fasta -rename pastajob_temp_name_translation.txt -masksitesp 0.0001

to remove sites that have have (for example) 99.99% or more gaps.

Step 7: Running PASTA using configuration files

As mentioned before, the configurations used for running PASTA are all saved to a configuration file, and also PASTA can be run using a configuration file. These configuration files are useful for multiple purposes - for example, if you want to reproduce a PASTA run, or if you want to report the exact configurations used. Always make sure to keep the produced configuration files for future reference. Note however, that configuration files can be used as input only using command-line.

Let's open myjob_temp_pasta_config.txt under the data directory and take a look at it. Notice that the options we referred to are all mentioned here.

Now imagine that we wanted to instruct PASTA to use the JTT model instead of WAG for a protein run. Here is how we can accomplish that. Copy the myjob_temp_pasta_config.txt file as a new file (e.g., cp myjob_temp_pasta_config.txt jtt_config.txt). Then open jtt_config.txt using a text editor of your choice. Find model = -wag -gamma -fastest under the [FastTree] header. Remove the -wag option and save the config file. Note that the default model in FastTree is JTT, and therefore, when the -wag is removed, it automatically switches to using JTT. To run PASTA using this new configuration file, run:

run_pasta.py  jtt_config.txt

Adding custom parameters to aligners: It is also possible to add custom parameters to alignment and merge tools. To do so, you need to use the config file. Under each alignment tool in the config file, you can add an args attribute and list all the attributes you want to pass to that tool. For example, to run Mafft with your choice of gap penalty value, edit the config file under the [mafft] heading to something like:

[mafft]
path = [there will be a path here to your pasta directory]/bin/mafft
args = --op 0.2 --ep 0.2

and use this config file to run PASTA.

Note that PASTA does not try to understand these extra parameters you pass to external tools. It simply appends these parameters to the end of the command it executes.

Step 8: Turning on Debugging information

If PASTA encounters an error, we ask that you report the error on the google user group. For us to be able to understand and debug the error, it would be helpful to have extra log files. To produce these extra debugging information, you need to set some environmental variables before running PASTA. From command-line run:

export PASTA_DEBUG=TRUE
export PASTA_LOGGING_LEVEL=DEBUG
export PASTA_LOGGING_FORMAT=RICH
run_pasta.py -i small.fasta  2>log.err 1>log.out

This will produce two log files log.err and log.out that would be very useful for our debugging. Please send us those in addition to your error report.

Also useful for debugging is keeping temporary files that can be inspected after the job finishes. To keep the temporary files, run:

run_pasta.py -i small.fasta -k --keepalignmenttemps 2>log.err 1>log.out

This will leave back temporary files, which could be useful for testing.

Step 9: Analyzing your own data

At this stage, if you have input files that you like to have analyzed, you can start doing that.


Contact

Email: [email protected] for all issues.