Skip to content

cpockrandt/genmap

Repository files navigation

GenMap: Ultra-fast Computation of Genome Mappability

BioConda Install Github All Releases Travis CI BSD3 License

-- This project is not actively maintained anymore. --

GenMap computes the uniqueness of k-mers for each position in the genome while allowing for up to e mismatches. More formally, the uniqueness or (k,e)-mappability can be described for every position as the reciprocal value of how often each k-mer occurs approximately in the genome, i.e., with up to e mismatches. Hence, a mappability value of 1 at position i indicates that the k-mer in the sequence at position i occurs only once in the sequence with up to e errors. A low mappability value indicates that this k-mer belongs to a repetitive region. GenMap can be applied to single or multiple genomes and helps finding regions that are unique or shared by many or all genomes.

Below you can see the (4,1)-mappability and frequency M and F of the nucleotide sequence T = ATCTAGGCTAATCTA. The mappability value M[1] = 0.33 means that the 4-mer starting at position 1 T[1..3] = TCTA occurs three times in the sequence with up to one mismatch: at positions 1 (TCTA), 6 (GCTA) and 11 (TCTA).

example of mappability

The mappability can be exported in various formats that allow post-processing or display in genome browsers. A small example on how to run GenMap is listed below, further details are on the GitHub Wiki pages. For questions or feature requests feel free to open an issue on GitHub or send an e-mail to christopher.pockrandt [ÄT] fu-berlin.de.

Christopher Pockrandt, Mai Alzamel, Costas S. Iliopoulos, Knut Reinert. GenMap: Ultra-fast Computation of Genome Mappability. Bioinformatics, 2020.

$ conda install -c bioconda genmap

Your CPU must support the POPCNT instruction. If you have a modern CPU, you can go with the optimized 64 bit version that additionally uses SSE4. This improves the running time by 10 %. To verify whether your CPU supports these instructions sets you can check the output of cat /proc/cpuinfo | grep -E "mmx|sse|popcnt" (Linux) or sysctl -a | grep -i -E "mmx|sse|popcnt" (Mac).

Platform Download Version Additional requirements
Download Linux binaries Linux 64 bit 1.3.0 (2020-06-17) -
Linux 64 bit optimized 1.3.0 (2020-06-17) requires SSE4
Download Mac binaries Mac 64 bit 1.3.0 (2020-06-17) -
Mac 64 bit optimized 1.3.0 (2020-06-17) requires SSE4

If you want to build it from source, we recommend cloning the git repository as shown below. The tarballs on GitHub do not contain git submodules (i.e., SeqAn). Please note that building from source can easily take 10 minutes and longer depending on your machine and compiler.

$ git clone --recursive https://github.com/cpockrandt/genmap.git
$ mkdir genmap-build && cd genmap-build
$ cmake ../genmap -DCMAKE_BUILD_TYPE=Release
$ make genmap

You can install genmap as follows

$ sudo make install
$ genmap

or run the binary directly:

$ ./genmap

If you are using a very old version of Git (< 1.6.5) the flag --recursive does not exist. In this case you need to clone the submodule separately before you can run cmake:

$ git clone https://github.com/cpockrandt/genmap.git
$ cd genmap
$ git submodule update --init --recursive

Requirements

Operating System
GNU/Linux, Mac
Architecture
Intel/AMD platforms that support POPCNT
Compiler
GCC ≥ 4.9, LLVM/Clang ≥ 3.8
Build system
CMake ≥ 3.0
Language support
C++14

At first you have to build an index of the fasta file(s) whose mappability you want to compute. This step only has to be performed once. You might want to check out pre-built indices available for download.

$ ./genmap index -F /path/to/fasta.fasta -I /path/to/index/folder

A new folder /path/to/index/folder will be created to store the index and all associated files.

There are two algorithms that can be chosen for index construction. One uses RAM (divsufsort), one uses secondary memory/disk space (skew). Depending on the quota and main memory limitations you can choose the appropriate algorithm with -A divsufsort or -A skew. It is recommended to use divsufsort (default setting). It needs about 6n space in main memory (or 10n for fasta files >2GB). n is the number of bases in your fasta file(s). It might be more or less depending on the number and length of the individual sequences. If you are running out of memory, you can try to reduce the memory consumption a bit by inreasing -S, e.g., use -S 20 (up to 64) Although this will slow down the algorithm to compute the mappability.

Skew needs more space on disk, at least 25n. You can change the location of the temp directory via the environment variable (e.g., to choose a directory with more quota):

$ export TMPDIR=/somewhere/else/with/more/space

To compute the (30,2)-mappability of the previously indexed genome, simply run:

$ ./genmap map -K 30 -E 2 -I /path/to/index/folder -O /path/to/output/folder -t -w -bg

This will create a text, wig and bedGraph file in /path/to/output/folder storing the computed mappability in different formats. You can omit formats that are not required by removing the corresponding flags -t -w or -bg.

Instead of the mappability, the frequency can be outputted, you only have to add the flag -fl to the previous command.

A detailed list of arguments and explanations can be retrieved with --help:

$ ./genmap --help
$ ./genmap index --help
$ ./genmap map --help

More detailed examples can be found in the Wiki.

Building an index on a large genome takes some time and requires a lot of space. Hence, we provide indexed genomes for download. If you need other genomes indexed and do not have the computational resources, please send an e-mail to christopher.pockrandt [ÄT] fu-berlin.de. The genomes where built with a higher sampling value (-S 20) to reduce the index size. To increase speed when computing the mappability and outputting csv files, you can build your own index with a lower sampling value. The genomes do not contain alt scaffolds (i.e., only chromosomes and unplaced/unlocalized fragments).

Genome Index size (compressed) Download
Human GRCh38 [1] 5.4 GB GRCh38 index
Human hs37-1kg [2] 5.4 GB hs37-1kg index
Mouse GRCm38 4.9 GB GRCm38 index
D. melanogaster dm6 0.2 GB dm6 index
C. elegans ce11 0.1 GB ce11 index
Wheat T. aestivum ta45 [3] 21.9 GB ta45 index
[1]ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/seqs_for_alignment_pipelines.ucsc_ids/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna.gz
[2]ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/reference/human_g1k_v37.fasta.gz
[3]ftp://ftp.ensemblgenomes.org/pub/plants/release-45/fasta/triticum_aestivum/dna/Triticum_aestivum.IWGSC.dna.toplevel.fa.gz