Here are some pair-wise genome alignments made with LAST.
The 2023
directory has alignments of these genomes:
Code | Phylum | Animal | Scientific name | Genome |
---|---|---|---|---|
helRob | annelid | jawless leech | Helobdella robusta | GCF_000326865.1 |
hirMed | annelid | medicinal leech | Hirudo medicinalis | GCA_011800805.1 |
lamSat | annelid | Satsuma tubeworm | Lamellibrachia satsuma | GCA_022478865.1 |
oweFus | annelid | shingle tubeworm | Owenia fusiformis | GCA_903813345.2 |
armNas | arthropod | woodlouse | Armadillidium nasatum | GCA_009176605.1 |
cenScu | arthropod | scorpion | Centruroides sculpturatus | GCF_000671375.1 |
droMel | arthropod | fruit fly | Drosophila melanogaster | GCF_000001215.4 |
gloMae | arthropod | millipede | Glomeris maerens | GCA_023279145.1 |
homAme | arthropod | lobster | Homarus americanus | GCF_018991925.1 |
limPol | arthropod | horseshoe crab | Limulus polyphemus | GCF_000517525.1 |
macAtr | arthropod | robber fly | Machimus atricapillus | GCA_933228815.1 |
strMar | arthropod | centipede | Strigamia maritima | GCA_000239455.1 |
linAna | brachiopod | shamisen shell | Lingula anatina | GCF_001039355.2 |
asyLuc | chordate | Bahama lancelet | Asymmetron lucayanum | GCA_001663935.1 |
braFlo | chordate | lancelet | Branchiostoma floridae | GCF_000003815.2 |
calMil | chordate | chimaera | Callorhinchus milii | GCF_018977255.1 |
eptBur | chordate | hagfish | Eptatretus burgeri | GCA_024346535.1 |
homSap | chordate | human | Homo sapiens | hg38_no_alt_analysis_set |
petMar | chordate | lamprey | Petromyzon marinus | GCF_010993605.1 |
acrMil | cnidaria | stony coral | Acropora millepora | GCF_013753865.1 |
actTen | cnidaria | waratah anemone | Actinia tenebrosa | GCF_009602425.1 |
epiPla | cnidaria | zoanthid | Epizoanthus planus | GCA_025388665.1 |
nemVec | cnidaria | starlet sea anemone | Nematostella vectensis | GCF_932526225.1 |
horCal | ctenophore | sea gooseberry | Hormiphora californensis | GCA_020137815.1 |
mneLei | ctenophore | sea walnut | Mnemiopsis leidyi | GCA_000226015.1 |
apoJap | echinoderm | sea cucumber | Apostichopus japonicus | GCA_002754855.1 |
strPur | echinoderm | sea urchin | Strongylocentrotus purpuratus | GCF_000002235.5 |
ptyFla | hemichordate | Hawaiian acorn worm | Ptychodera flava | GCA_001465055.1 |
sacKow | hemichordate | acorn worm | Saccoglossus kowalevskii | GCF_000003605.2 |
aplCal | mollusc | sea hare | Aplysia californica | GCF_000002075.1 |
craGig | mollusc | oyster | Crassostrea gigas | GCF_902806645.1 |
halRuf | mollusc | abalone | Haliotis rufescens | GCF_023055435.1 |
limBul | mollusc | sea butterfly | Limacina bulimoides | GCA_009866985.1 |
mizYes | mollusc | scallop | Mizuhopecten yessoensis | GCF_002113885.1 |
octBim | mollusc | octopus | Octopus bimaculoides | GCF_001194135.2 |
phoLin | mollusc | top snail | Phorcus lineatus | GCA_921293015.1 |
watSci | mollusc | firefly squid | Watasenia scintillans | GCA_015471945.1 |
phoOva | phoronid | horseshoe worm | Phoronis ovalis | GCA_028565635.1 |
The alignments were made with LAST version 1453, like this:
lastdb -P8 -uMAM8 -c myDB genome1.fa
last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train
lastal -P8 -D1e9 -m100 --split-f=MAF+ -p my.train myDB genome2.fa > many-to-one.maf
last-split -r many-to-one.maf > one-to-one.maf
This is currently the recommended way to compare distantly-related genomes, where most of the DNA lacks similarity.
-
-P8
makes it faster by using 8 threads: adjust as suitable for your computer. This has no effect on the results. -
-uMAM8
and-m100
strive for high sensitivity, but use a lot of memory and run time. To go much faster, omit-m100
. To halve the memory use and run time, changeMAM8
toMAM4
. -
--sample-number=5000
makeslast-train
use more samples of genome2, for fear that most of genome2 lacks similarity to genome1. For the same reason,-D1e9
is used withlast-train
, to avoid weak chance similarities more strictly.
The 2022
directory has various alignments of these genomes:
Genome name | Animal | Source | Assembly name (if different) |
---|---|---|---|
allMis28112v4 | alligator | NCBI | ASM28112v4 |
cerSim1 | rhinoceros | UCSC | |
chrPic3.0.3 | turtle | NCBI | Chrysemys_picta_bellii-3.0.3 |
equCab3 | horse | UCSC | |
hg38 | human | UCSC | hg38.analysisSet |
mOrnAna1.pri.v4 | platypus | NCBI |
They were made with LAST version 1411, using the recipe below under "2021 alignments".
The 2021
directory has various alignments of these genomes:
Genome name | Animal | Source | Assembly name (if different) |
---|---|---|---|
allMis28112v4 | alligator | NCBI | ASM28112v4 |
Bfl_VNyyK | lancelet | NCBI | |
calMil1 | chimaera | UCSC | |
chrPic3.0.3 | turtle | NCBI | Chrysemys_picta_bellii-3.0.3 |
hg38 | human | UCSC | hg38.analysisSet |
kPetMar1 | lamprey | NCBI | kPetMar1.pri |
latCha1 | coelacanth | UCSC | |
lepOcu1 | gar | NCBI | LepOcu1 |
xenTro10 | frog | NCBI | UCB_Xtro_10.0 |
They can be replicated by running LAST version >= 1387 like this:
lastdb -P8 -uMAM8 myDB genome1.fa
last-train -P8 --revsym -D1e9 --sample-number=5000 myDB genome2.fa > my.train
lastal -P8 -D1e9 -m100 --split-f=MAF+ -p my.train myDB genome2.fa > many-to-one.maf
last-split -r many-to-one.maf | last-postmask > out.maf
-
The
-P8
option makes it faster by using 8 threads: adjust as appropriate for your computer. This has no effect on the results. -
The
-uMAM8
and-m100
strive for high sensitivity, but make thelastal
command use much time and memory, e.g. several days and hundreds of gigabytes. -
You can trade off multi-threading and memory use (with no effect on results), see here.
Warning: these recipes were for an older version of LAST.
-
Since LAST version 1205,
-R01
has no effect and can be omitted (because it's the default). -
For LAST version >= 1180, it's best to add option
-fMAF+
to the first (many-to-one)last-split
. (In older versions,-fMAF+
was the default.) -
Since LAST version 983,
last-split
option-m1
has no effect and can be omitted (because it's the default).
The human genome (hg38) was aligned to chimp (panTro5) and gorilla
(gorGor5), as follows. This alignment recipe is very
accurate-but-slow. A faster recipe would mask repeats during
alignment,
and/or omit -m50
.
First, an "index" of the human genome was prepared, suitable for comparing it to highly-similar sequences:
lastdb -P0 -uNEAR -R01 hg38-NEAR hg38_no_alt_analysis_set.fa
Then, substitution and gap frequencies were determined:
last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-NEAR panTro5.fa > hg38-panTro5.mat
- Human-chimp parameters: hg38-panTro5.mat
- Human-gorilla parameters: hg38-gorGor5.mat
Next, many-to-one ape-to-human alignments were made:
lastal -m50 -E0.05 -C2 -p hg38-panTro5.mat hg38-NEAR panTro5.fa | last-split -m1 > hg38-panTro5-1.maf
The above command was the slowest step (3 CPU-weeks). You can "easily" parallelize it, by processing each sequence within panTro5.fa separately (in parallel). But each process uses quite a lot of memory, so take care that multiple parallel runs don't exceed your memory.
- Human-chimp many-to-one alignments: hg38-panTro5-1.maf.gz
- Human-gorilla many-to-one alignments: hg38-gorGor5-1.maf.gz
Next, one-to-one ape-to-human alignments were made:
maf-swap hg38-panTro5-1.maf |
awk '/^s/ {$2 = (++s % 2 ? "panTro5." : "hg38.") $2} 1' |
last-split -m1 |
maf-swap > hg38-panTro5-2.maf
The awk command prepends the assembly name to each chromosome name (e.g. chr7 -> hg38.chr7).
- Human-chimp one-to-one alignments: hg38-panTro5-2.maf.gz
- Human-gorilla one-to-one alignments: hg38-gorGor5-2.maf.gz
Finally, simple-sequence alignments were discarded, the alignments were converted to tabular format, and alignments with error probability > 10^-5 were discarded:
last-postmask hg38-panTro5-2.maf |
maf-convert -n tab |
awk -F'=' '$2 <= 1e-5' > hg38-panTro5.tab
- Human-chimp tabular alignments: hg38-panTro5.tab.gz (dotplot)
- Human-gorilla tabular alignments: hg38-gorGor5.tab.gz (dotplot)
The human genome (hg38) was aligned to mouse (mm10). This alignment recipe is even more slow-and-sensitive.
First, an "index" of the human genome was prepared, suitable for comparing it to less-similar sequences:
lastdb -P0 -uMAM4 -R01 hg38-MAM4 hg38_no_alt_analysis_set.fa
Then, substitution and gap frequencies were determined:
last-train -P0 --revsym --matsym --gapsym -E0.05 -C2 hg38-MAM4 mm10.fa > hg38-mm10.mat
- Human-mouse parameters: hg38-mm10.mat
- Human-dog parameters: hg38-canFam3.mat
Next, many-to-one mouse-to-human alignments were made:
lastal -m100 -E0.05 -C2 -p hg38-mm10.mat hg38-MAM4 mm10.fa | last-split -m1 > hg38-mm10-1.maf
- Human-mouse many-to-one alignments: hg38-mm10-1.maf.gz
Finally, one-to-one MAF alignments, and high-confidence tabular alignments, were made in the same way as above.
-
Human-mouse one-to-one alignments: hg38-mm10-2.maf.gz
-
Human-mouse tabular alignments: hg38-mm10.tab.gz (dotplot)