Skip to content

Latest commit

 

History

History
469 lines (358 loc) · 17.1 KB

README.rst

File metadata and controls

469 lines (358 loc) · 17.1 KB

Fammer

Utilities for curating a hierarchical set of sequence profiles representing a protein superfamily.

If you use this software in a publication, please cite our paper that describes it:

Talevich, E. & Kannan, N. (2013) Structural and evolutionary adaptation of rhoptry kinases and pseudokinases, a family of coccidian virulence factors. BMC Evolutionary Biology 13:117 doi:10.1186/1471-2148-13-117

Available at: http://www.biomedcentral.com/1471-2148/13/117

Freely distributed under the permissive BSD 2-clause license (see LICENSE).

Installation

The installation consists of a Python library, fammerlib, and two scripts, fammer.py and tmalign.py.

Download the .zip file and unpack it, or clone this Git repository, to get the source code.

To use all the features of Fammer, you'll need the following third-party programs installed:

If you're on a Debian-based Linux system, check your package manager for these first to save yourself some time:

sudo apt-get install mafft hmmer tm-align fasttree python-pip

Then, install the Python library dependencies and Fammer itself as follows.

Recommended:

Install the Python packaging system pip or setuptools. Then run the setup script, and all Python dependencies will be pulled in:

python setup.py build
python setup.py install

(You might need root privileges for the last step.)

Manual:

Install the Python libraries Biopython, biofrills, biocma and networkx. Then run the setup script as above.

Basic usage

Global options:

-h, --help
Show a help message and basic usage.
--quiet
Don't print status messages, only warnings and errors.

Sub-commands:

build
Build profiles from a given directory tree.
update-fasta
Replace original FASTA sequence sets with the ungapped sequences from the corresponding alignment (.aln) files, sorted by decreasing length.
scan
Scan and classify input sequences using a set of profiles.
add
Scan a target database with the given HMM profile set. Add hits that meet acceptance thresholds to the profile FASTA files.
refine
Leave-one-out validation of HMM profiles.
cluster
Split a sequence set into clusters (based on phylogeny).

Commands

build

Construct a profile database from a directory tree of family profile alignments.

Assume we have a directory tree set up under Superfamily/ as above. Next, run fammer.py build Superfamily to align all sequence files with MAFFT, and (recursively up) align the consensus sequences of each subfamily together:

Superfamily/
    Group1/
        subfam1_Group1.fasta
        subfam1_Group1.aln
        subfam2_Group1.fasta
        subfam2_Group1.aln
        subfam3_Group1.fasta
        subfam2_Group1.aln
        ...
        Group1-Unclassified.fasta
        Group1-Unclassified.aln
    Group1.aln
    ...
    Superfamily-Unclassified.fasta
    Superfamily-Unclassified.aln
Superfamily.aln

The alignments are in un-wrapped Clustal format.

You can manually adjust the alignments and rebuild, if desired, perhaps iteratively. Only the "parent" family alignments will be rebuilt as needed, e.g. if subfam1_Group1.aln is edited, then only Group1.aln and Superfamily.aln will be rebuilt the next time fammer.py build Superfamily is called because the consensus sequences that constitute those alignments may have changed. (It's like Make.)

Finally, use the option --hmmer to build profiles:

Superfamily/
    Group1/
        subfam1_Group1.fasta
        subfam1_Group1.aln
        subfam2_Group1.hmm
        ...
    Group1.aln
    Group1.hmm
    ...
Superfamily.aln
Superfamily.hmm
Superfamily_all.hmm     # concatenated profiles
Superfamily_all.hmm.{h3f,h3i,h3m,h3p}   # indexes from hmmpress

The --mapgaps option works similarly, if you have the necessary programs installed.

The --clean option can be included with any of the above commands to remove intermediate files.

If you have included PDB structures in your directory tree and have a structure alignment program installed, the --pdb option will first create a structural alignment of the PDBs in the directory, then use that alignment as the seed for higher-up alignments:

Superfamily/
    Group1/
        subfam1_Group1.fasta
        subfam1_Group1.aln
        1ATP.pdb
        1O6K.pdb
        3C4X.pdb
        ...
    Group1.pdb.seq  # Alignment of 1ATP, 1O6K, 3C4X
    Group1.aln
    ...
Superfamily.aln

In this example, the alignment generated by aligning the structures 1ATP, 1O6K and 3C4X is passed to MAFFT as a seed for Group1.aln, along with the unaligned consensus sequences of each subfamily of Group1 (subfam1, subfam2, ...). The seed sequences are removed from Group1.aln after the alignment of consensus sequences is completed. This can help correctly align the more divergent families and groups to each other.

For nested directory trees, the option --tree generates a Newick file representing the structure of the directory tree. A tree based on the above examples would look something like this (ignoring whitespace), created as Superfamily.nwk:

((subfam1_Group1, subfam2_Group1, subfam3_Group1,
  Group1-Unclassified)Group1,
 (subfam1_Group2, subfam2_Group2, subfam3_Group2,
  Group2-Unclassified)Group2,
 (subfam1_Group3, subfam2_Group3, subfam3_Group3,
  Group3-Unclassified)Group3,
 Superfamily-Unclassified)Superfamily;

This tree could be passed to RAxML as a constraint tree in an effort to identify deeper subfamilies, for example.

update-fasta

Convert the contents of the .aln sequence alignment files back to unaligned FASTA format, overwriting the corresponding .fasta files.

After initially building a tree of sequence alignments, you might edit the Clustal alignments, deleting spurious sequences or trimming the alignment to the edges of a conserved domain. With update-fasta, you update the contents of the unaligned sequence files to match the .aln files.

The next step is usually to either (a) do some sequence processing unrelated to fammer, e.g. clustering, or (b) realign everything. Since you've presumably removed some junk from the input sequences, the resulting alignments may be better.

scan

Scan/search a set of sequences (FASTA) with the HMM profile database and assign a classification to each hit.

This is essentially a set of wrappers to process the output of hmmsearch, simplifying the results for common use cases. The three output forms are:

summary (default):
Print two formated columns for each profile in the given HMM profile database that matched at least one hit: the name of the profile and the number of hits for which it was the best match.
table (--table):
For each sequence in the target sequence set that matched a profile in the HMM profile database, print the sequence ID/accession and the name of the best-matching profile, separated by a tab character.
sequence sets (--seqsets):
For each profile and matching sequence set (as they'd appear in summary output), write a file containing the matching sequences. The output filenames indicate the name of the source sequence file name and the matching HMM profile names.

Note that --table and --seqsets can be combined.

add

Scan a target database with the given HMM profile set and add hits that meet a series of acceptance thresholds to the profile FASTA files.

Once you've constructed profiles from a collection of carefully selected sequences representing each subfamily, you can use this command to scan another sequence set and automatically add strong hits to the corresponding profile sequence sets. The target database could be the *-Unclassified.fasta sequence sets, to catch any classifiable members that were not noticed initially, or a larger sequence database like refseq_proteins, if you're confident in your coverage of the superfamily and want to improve the sensitivity of your profiles.

refine

Leave-one-out validation of sequence profiles. Unlike the other commands, this is non-recursive.

Given a target subdirectory and the name of the subdirectory's *-Unclassified.fasta file (if not specified, it looks for dirname-Unclassified.fasta), scan each subfamily's sequence set (.fasta) with the corresponding HMM profile (.hmm), and also scan the -Unclassified.fasta file with all the HMMs to obtain scores for each sequence and each profile. Then, compare the scores of sequences in a subfamily, starting with the worst-scoring sequence, to the highest-scoring "unclassified" sequence by the same profile. If, for a given profile, a classified sequence scores worst than an unclassified one, mark the classified one for removal from the sequence profile.

Note that if a member of a known subfamily was mistakenly placed in -Unclassified.fasta (i.e. was missed by the initial classification), then many of the legitimate members of the subfamily profile could score worse than this high-scoring "unclassified" sequence and be erroneously marked for removal from the profile. This is easy enough to spot in the logged output. One way to avoid it is to first use the add command with -Unclassified.fasta as the target, to catch and classify such sequences beforehand.

cluster

Extract clusters from a sequence set based on phylogenetic relationships.

Uses FastTree to quickly build a tree with branch support values, then extracts well-supported clades from the tree and writes the corresponding sequence sets to FASTA files. Unclustered sequences are written to another "Unique" file. Also writes a phyloXML tree file (.xml) showing clusters as colorized clades.

Bundled modules

tasks

Serves the basic purpose of a build tool like Make or Rake: compare the time stamps of input and output files at each step of the build process, and only rebuild the outputs that are out of date. Also track intermediate files to clean up after the process successfully completes. See this blog post about it.

tmalign

Align multiple structures using TMalign for pairwise alignments and a minimum spanning tree constructed from the pairwise TM-scores to assemble the pairwise alignments into a multiple sequence alignment. This module can also be used as a command-line script.

How to curate profiles

Directory tree is the superfamily hierarchy

Begin by creating a directory tree with each subfamily's representative sequences in an unaligned FASTA file. The FASTA file names must end in .fasta.

A simple un-nested layout looks like:

Superfamily/
    subfam1.fasta
    subfam2.fasta
    subfam3.fasta
    ...
    Superfamily-Unclassified.fasta

Typically, a protein superfamily will have some members that cannot be cleanly classified into subfamilies.

Recursive nesting is also allowed (in fact, that's the point of this project). It looks like this:

Superfamily/
    Group1/
        subfam1_Group1.fasta
        subfam2_Group1.fasta
        subfam3_Group1.fasta
        ...
        Group1-Unclassified.fasta
    Group2/
        subfam1_Group2.fasta
        subfam2_Group2.fasta
        subfam3_Group2.fasta
        ...
        Group2-Unclassified.fasta
    Group3/
        subfam1_Group3.fasta
        subfam2_Group3.fasta
        subfam3_Group3.fasta
        ...
        Group3-Unclassified.fasta
    ...
    Superfamily-Unclassified.fasta

At least 2 sequences are needed to define each subfamily. The sequences are initially retrieved and organized manually. External databases can help; for example, KinBase provides a classification scheme and representative sequences for the protein kinase superfamily. The directory hierarchy defines the higher levels of organization of the superfamily.

Now build the initial alignments:

fammer.py build --clean Superfamily/

Trim the tails

For best results, sequences should all cover the same conserved domain region and align exactly at N- and C-termini. For example, the eukaryotic protein kinase domain begins 7 residues before the glycine-rich loop (GxGxxG motif) and ends 12 residues after a conserved arginine in subdomain XI (RP[TS] motif). The consensus function used in Fammer includes the flanking sequences before the first conserved block or after the last conserved block, regardless of "gappiness", so even one sequence in a set can be used to define domain boundaries for the whole set.

Edit the subfamily-level alignments (.aln) to trim the sequences to the domain boundaries. I use Vim's block-selection mode (Ctrl-v); you might prefer JalView. Do not edit higher-level .aln files; just look at them to see which subfamily-level alignments have extended tails, then edit those .aln files.

When you think you've trimmed all the .aln files to the conserved domain region, update the corresponding unaligned sequence files (.fasta) to match:

fammer.py update-fasta Superfamily/

Rebuild the alignments whose source sequence sets have changed:

fammer.py build --clean Superfamily/

MAFFT may create better alignments for sets where the unalignable regions have now been removed. When this step completes, examine the top-level alignment (e.g. Superfamily.aln) -- are the domain boundaries aligned cleanly across all sequence sets? Repeat the process if any subfamilies need to be trimmed further.

If you realize you've trimmed a profile too far, use your version control system (you are using Git or Mercurial to take snapshots of the tree, right?) to revert the .fasta file to an earlier version, then rebuild and try trimming again.

If you never had the full-length sequences for a subfamily to begin with, try to find one representative full-length sequence from a database like UniProt, add it to the .fasta file, realign, and trim that sequence to the region it should cover. This won't improve the HMM or MAPGAPS profile for that subfamily much, but it will help the higher-level alignments that include the consensus sequence of that subfamily.

PDB files for structural alignment

While MAFFT typically creates good alignments within a subfamily, for high-level profiles it may struggle to align the consensus sequences of very divergent families or groups to each other. In these cases, seeding with a structural alignment can help line up homologous regions.

Manually identify the high-quality solved crystal structures that correspond to families in your tree, and place those PDB files in the directory tree at the same level as the subfamily they represent.

Open the PDB file (.pdb) in a text editor and:

  • Remove the ATOM records corresponding to residues outside the conserved domain.
  • If multiple chains are present, choose the best, most complete chain and delete the others. (Otherwise, Fammer will take the first chain by default.)

To determine the domain boundaries, load a "reference" PDB (e.g. 1ATP for kinases) and the other PDB together in PyMOL and align using the command "cealign" or "fit". Visually find which residues correspond to the start and end of the conserved domain. If your PDB structure diverges from the reference structure drastically before or after a certain point (i.e. N- or C-terminal region of the domain is non-homologous -- excluding inserts), it may be best to truncate the PDB to remove the non-homologous portion as it cannot be aligned accurately.

Save the edited .pdb file, and rebuild the higher-level profiles:

fammer.py build --clean Superfamily/

Now that PDB files are present in the directory tree, Fammer will writes structural alignments in FASTA format to the .pdb.seq files.

Once complete, examine the top-level alignment (e.g. Superfamily.aln) for misalignments. To fix these, first edit the corresponding .pdb.seq file. Usually there is just one structure that was misaligned. Rebuild and re-edit as necessary. If a particular PDB is very poorly aligned to the others, it may be best to just remove it altogether -- it may have a different conformation from the others.