-
Notifications
You must be signed in to change notification settings - Fork 6
Sapling Manual
Sapling (Suffix Array Piecewise Linear INdex for Genomics) is a proof-of-concept project illustrating the power of learned data models for genomic read alignment. It build a light-weight data structure which augments a suffix array and accelerates queries by predicting where in the suffix array any given query string will occur. By using this prediction and pre-computed prediction error statistics across all k-mers in the genome, the standard binary search algorithm for suffix array queries can be performed over a much smaller range. This repository contains a number of components related to Sapling, including a substring search API, scripts for evaluating the runtime, accuracy, and other statistics, examples of usage, and even a bare-bones seed-and-extend aligner which uses Sapling.
The suffixarray subfolder contains the libdivsufsort library as a submodule, which can be used for pre-building the suffix array of a given genome. Since the input and output formats of libdivsufsort differ from those used by Sapling, the file suffixarray/refToSuffixArray.sh is a wrapper script which takes a fasta file and uses libdivsufsort plus a few auxiliary script to produce a suffix array file in the format used by Sapling (with a file name equal to the original fasta file with ".sa" appended to the end). It is recommended to use this for larger genomes to avoid the high runtime and memory overhead associated with Sapling's suffix array construction algorithm.
The file src/sapling_api.h contains an API for building and utilizing the data structure described above. Given a reference sequence, it filters the sequence, builds a suffix array, and then builds the Sapling data structure. The suffix array and Sapling data structures are both stored in files so that they can be optionally re-loaded when rerunning Sapling on the same genome. These data structures are stored in the following format:
- Suffix array file: The first 8 bytes (assuming a 64-bit architecture) make up an unsigned integer representing the length of the filtered genome in basepairs. Then, the next 8n bytes represent n unsigned integers giving the suffix array. Finally, the last 8n bytes represent n unsigned integers giving the LCP array corresponding to the suffix array.
- Sapling file: The first 4 bytes are a signed integer giving log (base 2) of the number of intervals the k-mer space is divided into. Then, the next 4 or 8 bytes (4 if the number of buckets is <= 30, and 8 otherwise) make up an integer n giving the number of points used in the piecewise linear p. The following 8n bytes are the x values of these points, and after that are 8n bytes giving the y-values. Finally, there are 5 signed 32-bit integers giving error statistics: the maximum over-prediction, the maximum under-prediction, the mean error, an error bound which captures most of the over-predictions, and an error bound which captures most of the under-predictions.
The file src/align.cpp, as well as the Makefile for compilation, contains a simple seed-and-extend aligner to demonstrate the usage of the Sapling data structure in genomic read alignment. This aligner takes as input a FASTA file containing the genome and a FASTQ file containing the reads to align, providing the alignments of all reads in SAM format. It uses a suffix array augmented with a piecewise linear data model to find the interval of matches in the suffix array for a number of evenly-spaced seeds throughout the read being aligned. Then, for a small fixed-size subset of these matches (to avoid high runtimes in the presence of seeds which occur very frequently), a Striped-Smith-Waterman dynamic programming algorithm is used to attempt to extend each of these seeds. The extension with the best match is retained, and if no high-quality matches are found, the read is marked as unaligned.
The file src/sapling_example.cpp contains an example of how to use Sapling for the substring search problem. It provides an intuitive command-line menu for adjusting the parameters of the data structure, validates the correctness of the suffix array queries, and provides runtime metrics for the substring search procedure in isolation. This file was used for many of our tests to obtain error statistics and runtimes across many genomes.
To compare out software to existing techniques, we implemented the binary algorithm proposed by Myers et. al. for searching through suffix arrays in O(n + log(m)), which is faster than the standard O(n*log(m)) algorithm. Similar to our Sapling example script, this script validates the correctness and reports runtime for the substring search procedure on its own.
The eval/ folder contains a number of scripts we used to evaluate Sapling, and figures which were generated by those scripts.