The MCscan download page can be accessed here.
This software provides a clustering module for viewing the relationship of colinear segments in multiple genomes (or heavily redundant genomes). It takes the predicted pairwise segments from dynamic programming (DAGchainer in particular) and then try to build consensus segments from a set of related, overlapping segments.
Certain part of this package (dagchainer.cc
) is based on the TIGR software DAGchainer. The program used this as an initial step to generate pairwise segments.
Along with the DAGchainer guidelines, all code is copiable, distributable, modifiable, and usable without any restrictions.
Note
MCscan currently will only run on linux or cygwin platform, as it is dependent on GNU function.
Simply put mcscan.tar.gz
in any directory:
$ tar zxf mcscan_version.tar.gz $ cd mcscan_version.tar.gz/ && make
the compiled codes are within the same directory as the source.
Then put copy of MCL executable within the same folder as MCscan (MCL program downloadable here).
MCscan reads in at least two sources of data: .blast
file and .bed
file. This may seem daunting at first, but these are very easy to retrieve. Have a look at the at_at.blast
, at_at.bed
in the folder. In the actual execution, MCL is used to generate mcl file (at_at.mcl
), which is used in multiple synteny construction.
Here is what can be used to genenerate the files.
The blast file is the following tab-delimited format:
gene1 gene2 e-value
easily genenerated from a m8 blast output format:
$ cut -f 1,2,11 xyz.m8 > xyz.blast.unfiltered
The first thing please ensure that for each gene pair, only one e-value is reported, the blast output normally would contain multiple HSPs, a convenience script is attached to filter all the redundant pairs:
$ python filter_blast.py xyz.blast.unfiltered xyz.blast
The .bed
file contains the following tab-delimited format (see bed format):
chromosome_id start stop gene_name
notice when you compare multiple genomes, formulate your molecule name carefully to avoid duplicated names. The .bed
file can usually be generated by parsing the gene annotation file provided by the sequencing group (usually the sequencing project ftp will provide a .gff3
file).
Once you have everything ready, put them in the same folder. We need to generate .mcl
file if this is the first run (also take a look at the example in run.sh
):
$ more xyz.blast | mcl - --abc --abc-neg-log -abc-tf 'mul(0.4343), ceil(200)' -o xyz.mcl
Some might encounter a problem exec the mcl
command, in which case the mcl
binary needs to be rebuilt from here. After the first time you run it (the mcl
file has been generated). You can simply use:
$ ./mcscan xyz
The help screen:
Usage: mcscan [OPTION...] prefix_fn MCSCAN -- multiple collinearity scan (compiled Jul 6 2010 15:52:03) Reference: Tang,H., Wang,X., Bowers,J.E., Ming,R., Alam,M., Paterson,A.H. Unraveling ancient hexaploidy through multiply-aligned angiosperm gene maps Genome Research (2008) 18, 1944-1954. -a only builds the pairwise blocks (.aligns file) -A use base pair dist instead of gene ranks -b limit within genome synteny (e.g. Vv-Vv) mapping -e, --e_value=E_VALUE alignment significance -g, --gap_score=GAP_SCORE gap penalty -k, --match_score=MATCH_SCORE final score=MATCH_SCORE+NUM_GAPS*GAP_SCORE -p, --pivot=PIVOT PIVOT is the reference genome, make it two letter prefix inyour .bed file, everything else will be aligned to the reference -s, --match_size=MATCH_SIZE number of genes required to call synteny -u, --unit_dist=UNIT_DIST average intergenic distance -?, --help Give this help list --usage Give a short usage message -V, --version Print program version Mandatory or optional arguments to long options are also mandatory or optional for any corresponding short options. Report bugs to <[email protected]>.
The default values are quite generic and should work for many instances. The following are more detailed information for users who wish to tune their results.
The pairwise synteny formula is roughly (Haas et al. 2004), note that DIST_X
, DIST_Y
is in base pair:
FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)/UNIT_DIST*GAP_SCORE
The multiple synteny formula is roughly, now DIST_X
, DIST_Y
is the distance in the partial order graph (not in base unit, but in gene index unit):
FINAL_SCORE = MATCH_SCORE*num_matches+max(DIST_X, DIST_Y)*GAP_SCORE
Sometimes you may want to run just the pairwise synteny on .blast
and .bed
files, then you can try:
$ ./mcscan at_at -a
Note that to run this, .mcl
file is not required, the result is now slightly different, since MCscan uses the mcl file to filter the BLAST hits.
There are, by default at_vv
sets of files and os_sb
sets of files, which is basically two different projects.
First example, let us compare Os to Sb (rice to sorghum), just default settings, run:
$ ./mcscan os_sb
It takes about one minute to run, the result is best viewed in EXCEL. The first part of the file lists all the parameters of the program. The result is separated with a line like this:
## View 11: pivot Sb02
This is called a view, each view uses a different chromosome as the reference. Then the blocks following this line is the multiply aligned blocks. The first column is numerical identifier, the second column is the actual pivot. Then following columns are the regions that are aligned to the pivot. The alignments between rice and sorghum are in fact complicated by one or more shared WGDs, creating several columns but mostly are four regions matching each other.
For the second example, we wish to align Arabidopsis to grape, and use grape as the reference genome, but we need to do it a little differently. Unlike the first example, we are not interested in WGD in grape in this case, and we only wish to see the grape used as pivot. Therefore, we modify the pivot:
$ ./mcscan at_vv -p Vv -b
This trick -b
will limit any Vv-Vv matches (in fact this is an older duplication called gamma) in the output.
There are two outputs. .aligns
file and .blocks
file, corresponding to pairwise and multiple synteny respectively. You will find the .aligns
file very useful too, sometimes. But this is essentially similar to the output of DAGchainer (adding a few statistics and change the default paramters).
- May 12, 2007 (version <0.5) initial release.
- Aug 05, 2007 (version 0.5) add the option of of a reference genome
- Oct 13, 2007 (version 0.6) add convenience python script to streamline the process
- Mar 07, 2008 (version 0.7) implement statistical test for pairwise syntenic blocks
- Nov 13, 2008 (version 0.8) partial-order graph for alignment
Any questions, problems, bugs are welcome and should be dumped to
Haibao Tang : bao at uga dot edu
Plant Genome Mapping Laboratory, University of Georgia