- Introduction
- Prerequisites
- Getting started
- Detailed instructions
- How it works
- How to test
- Issues
- Changelog
- Licence
- Contact
KmerCamel🐫 is a tool for efficiently representing a set of k-mers a masked superstring.
It is based on the following paper:
Ondřej Sladký, Pavel Veselý, and Karel Břinda: Masked superstrings as a unified framework for textual k-mer set representations. bioRxiv 2023.02.01.526717, 2023. https://doi.org/10.1101/2023.02.01.526717
See supplementary materials of the aforementioned paper for experimental results with KmerCamel🐫.
The computation of masked superstring using KmerCamel🐫 is done in two steps - first a superstring is computed with its default mask and then its mask can be optimized.
The computation of the masked superstring works as follows. KmerCamel🐫 reads an input FASTA file (optionally gzip
ed), retrieves the associated k-mers (with supported
To compute masked superstrings takes about 4-6s / 1M k-mers, which means about 3h to compute masked superstrings for the human genome. The memory consumption on human genome is about 115 GB.
All algorithms can be used to either work in the unidirectional model or in the bidirectional model
(i.e. treat
Additionally, KmerCamel🐫 can optimize the mask of the superstring via the optimize
subcommand. The implemented mask optimization algorithms are the following:
- Minimize the number of 1s in the mask.
- Maximize the number of 1s in the mask.
- Minimize the number of runs of 1s in the mask.
- GCC
- Zlib
- GLPK (can be installed via
apt-get install libglpk-dev
on Ubuntu orbrew install glpk
on macOS)
Download and compile KmerCamel🐫 by running the following commands:
git clone --recursive https://github.com/OndrejSladky/kmercamel
cd kmercamel && make
Alternatively, you can install KmerCamel from Bioconda:
conda install kmercamel
./kmercamel -p yourfile.fa -k 31 -c > ms.fa # Compute MS with the default mask
cat ms.fa | tr acgt 0000 | tr ACGT 1111 > mask.txt # Extract mask
cat ms.fa | tr acgt ACGT > superstring.txt # Extract superstring
bzip2 --best mask.txt
xz -T1 -9 superstring.txt
For a super efficient compression of the superstring (often <2 bits / bp), you use some of the specialized tools based on statistical compression such as GeCo3 or Jarvis3.
Example with FMSI:
kmercamel -p yourfile.fa -k 31 -c > ms.fa # Compute MS with the default mask
kmercamel optimize -p ms.fa -k 31 -c -o ms-opt.fa # Maximize the number of 1s in the mask
fmsi index -p ms-opt.fa # Create a k-mer index
Computing masked superstrings:
./kmercamel -p ./spneumoniae.fa -k 31 -c # From a fasta file
./kmercamel -p - -k 31 -c # Read from stdin
./kmercamel -p ./spneumoniae.fa.gz -k 31 -c # From a gzipped fasta file
./kmercamel -p ./spneumoniae.fa -k 127 -c # Largest supported k
./kmercamel -p ./spneumoniae.fa -k 31 -a local -d 5 -c # Use local greedy
./kmercamel -p ./spneumoniae.fa -k 31 -c -o out.fa # Redirect output to a file
./🐫 -p ./spneumoniae.fa -k 31 -c # An alternative if your OS supports it
Optimizing masks:
./kmercamel optimize -p ./masked-superstring.fa -k 31 -a runs -c # Minimize the number of runs of 1s
./kmercamel optimize -p ./masked-superstring.fa -k 31 -a ones -c # Maximize the number of 1s
./kmercamel optimize -p ./masked-superstring.fa -k 31 -a zeros -c # Maximize the number of 0s
./kmercamel optimize -p ./masked-superstring.fa -k 31 -a runapprox -c # Approximately minimize the number of runs of 1s
Compute lower bound on the minimum possible superstring length of a k-mer set:
./kmercamel -l -p ./spneumoniae.fa -k 31
Additionally, KmerCamel🐫 experimentally implements both algorithms in their Aho-Corasick automaton versions. To use them, add AC
to the algorithm name.
Note that they are slower than the original versions, but they can handle arbitrarily large ks.
The program has the following arguments:
-p path_to_fasta
- the path to fasta file (can begzip
ed). This is a required argument.-k value_of_k
- the size of one k-mer (up to 127). This is a required argument.-a algorithm
- the algorithm which should be run. Eitherglobal
orglobalAC
for Global Greedy,local
orlocalAC
for Local Greedy. The versions with AC use Aho-Corasick automaton. Defaultglobal
.-o output_path
- the path to output file. If not specified, output is printed to stdout.-d value_of_d
- d_max used in Local Greedy. Default 5. Increasingd
beyondk
has no effect.-c
- treat k-mer and its reverse complement as equal.-l
- compute lower bound on the superstring length instead of the superstring.-m
- turn off memory optimizations forglobal
.-h
- print help.-v
- print version.
For mask optimization, run the subcommand optimize
with the following arguments:
p path_to_fasta
- the path to fasta file (can begzip
ed). This is a required argument.k k_value
- the size of one k-mer. This is a required argument.a algorithm
- the algorithm for mask optimization. Eitherones
for maximizing the number of 1s,runs
for minimizing the number of runs of 1s,runsapprox
for approximately minimizing the number of runs of 1s, orzeros
for maximizing the number of 0s. Defaultones
.o output_path
- the path to output file. If not specified, output is printed to stdout.c
- treat k-mer and its reverse complement as equal.h
- print help.v
- print version.
We provide a Python script for converting any masked superstring to a (r)SPSS representation.
Run ./convert_superstring.py < input.fa
. This runs a Python script which inputs a fasta file with masked-cased superstring and outputs the (r)SPSS representation.
For details about the algorithms and their implementation, see the Code README.
To ensure correctness of the results, KmerCamel🐫 has two levels of tests - unit tests and file-specific integration tests.
For integration tests install jellyfish (v2) and add it to PATH.
You can verify all the algorithms for 1 < k < 128
on a S. pneumoniae by running make verify
.
To run it on another dataset, see the verification script.
You can run the C++ unittests by make cpptest
.
To run all the test, simply run make test
.
Please use Github issues.
See Releases.