forked from PacificBiosciences/DAZZ_DB
-
Notifications
You must be signed in to change notification settings - Fork 0
flowers9/DAZZ_DB
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
/************************************************************************************\ * * * Copyright (c) 2014, Dr. Eugene W. Myers (EWM). All rights reserved. * * * * Redistribution and use in source and binary forms, with or without modification, * * are permitted provided that the following conditions are met: * * * * · Redistributions of source code must retain the above copyright notice, this * * list of conditions and the following disclaimer. * * * * · Redistributions in binary form must reproduce the above copyright notice, this * * list of conditions and the following disclaimer in the documentation and/or * * other materials provided with the distribution. * * * * · The name of EWM may not be used to endorse or promote products derived from * * this software without specific prior written permission. * * * * THIS SOFTWARE IS PROVIDED BY EWM ”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, * * INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND * * FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL EWM BE LIABLE * * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES * * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS * * OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY * * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING * * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN * * IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. * * * * For any issues regarding this software and its use, contact EWM at: * * * * Eugene W. Myers Jr. * * Bautzner Str. 122e * * 01099 Dresden * * GERMANY * * Email: [email protected] * * * \************************************************************************************/ /************************************************************************************\ UPGRADE & DEVELOPER NOTES ! ! ! If you have already built a big database and don't want to rebuild it, but do want to use a more recent version of the software that entails a change to the data structures (currently the updates on Sept 25, 2014 and December 31, 2014), please note the routines DBupgrade.Sep.25.2014 and DBupgrade.Dec.31.2014. These take a DB, say X, as an argument, and produce a file .X.ndx which you should then replace .X.idx with. To update a very old DB to today's version you will need to run both in sequence. Both of the upgrade programs can be made with "make" but are not by default created when make is called without an argument. For those interested in the details, on September 25, the "beg" and "end" fields went from shorts to ints, and on December 31, the "beg" and "end" fields became "fpulse" and "rlen", respectively where fpulse = beg and rlen = end-beg. Unfortunately, the .dust track formats also changed on Dec.31.2014 and Jan.1.2015. To upgrade said use DUSTupgrade.Jan.1.2015. This program takes a DB, say X as an argument and produces .X.next.anno and .X.next.data which you should then replace .X.dust.* with. Of course, it may, if the DB is not too big, be easier and simply to just rerun DBdust. Developers should also note carefully that the calling conventions to Open_DB have changed and there are new utility routines Number_Digits and Check_Track. \************************************************************************************/ The Dazzler Database Library Author: Gene Myers First: July 17, 2013 Current: December 31, 2014 To facilitate the multiple phases of the dazzler assembler, we organize all the read data into what is effectively a "database" of the reads and their meta-information. The design goals for this data base are as follows: (1) The database stores the source Pacbio read information in such a way that it can recreate the original input data, thus permitting a user to remove the (effectively redundant) source files. This avoids duplicating the same data, once in the source file and once in the database. (2) The data base can be built up incrementally, that is new sequence data can be added to the data base over time. (3) The data base flexibly allows one to store any meta-data desired for reads. This is accomplished with the concept of *tracks* that implementors can add as they need them. (4) The data is held in a compressed form equivalent to the .dexta and .dexqv files of the data extraction module. Both the .fasta and .quiva information for each read is held in the data base and can be recreated from it. The .quiva information can be added separately and later on if desired. (5) To facilitate job parallel, cluster operation of the phases of our assembler, the data base has a concept of a *current partitioning* in which all the reads that are over a given length and optionally unique to a well, are divided up into *blocks* containing roughly a given number of bases, except possibly the last block which may have a short count. Often programs con be run on blocks or pairs of blocks and each such job is reasonably well balanced as the blocks are all the same size. One must be careful about changing the partition during an assembly as doing so can void the structural validity of any interim block-based results. A Dazzler DB consists of one named, *visible* file, e.g. FOO.db, and several *invisible* secondary files encoding various elements of the DB. The secondary files are "invisible" to the UNIX OS in the sense that they begin with a "." and hence are not listed by "ls" unless one specifies the -a flag. We chose to do this so that when a user lists the contents of a directory they just see a single name, e.g. FOO.db, that is the one used to refer to the DB in commands. The files associated with a database named, say FOO, are as follows: (a) "FOO.db": a text file containing (i) the list of input files added to the database so far, and (ii) how to partition the database into blocks (if the partition parameters have been set). (b) ".FOO.idx": a binary "index" of all the meta-data about each read allowing, for example, one to randomly access a read's sequence (in the store ".FOO.bps"). It is 28N + 88 bytes in size where N is the number of reads in the database. (c) ".FOO.bps": a binary compressed "store" of all the DNA sequences. It is M/4 bytes in size where M is the total number of base pairs in the database. (d) ".FOO.qvs": a binary compressed "store" of the 5 Pacbio quality value streams for the reads. Its size is roughly 5/3M bytes depending on the compression acheived. This file only exists if .quiva files have been added to the database. (e) ".FOO.<track>.anno": a *track* containing customized meta-data for each read. For ".FOO.<track>.data" example, the DBdust command annotates low complexity intervals of reads and records the intervals for each read in two files .FOO.dust.anno & .FOO.dust.data. Any kind of information about a read can be recorded, such as micro-sats, repeat intervals, corrected sequence, etc. Specific tracks will be described as modules that produce them are released. If one does not like the convention of the secondary files being invisible, then un-defining the constant HIDE_FILES in DB.h before compiling the library, creates commands that do not place a prefixing "." before secondary file names, e.g. FOO.idx instead of .FOO.idx. One then sees all the files realizing a DB when listing the contents of a directory with ls. While a Dazzler DB holds a collection of Pacbio reads, a Dazzler map DB or DAM holds a collection of contigs from a reference genome assembly. This special type of DB has been introduced in order to facilitate the mapping of reads to an assembly and has been given the suffix .dam to distinguish it from an ordinary DB. It is structurally identical to a .db except: (a) there is no concept of quality values, and hence no .FOO.qvs file. (b) every .fasta scaffold (a sequence with runs of N's between contigs estimating the length of the gap) is broken into a separate contig sequence in the DB and the header for each scaffold is retained in a new .FOO.hdr file. (c) the original and first and last pulse fields in the meta-data records held in .FOO.idx, hold instead the contig number and the interval of the contig within its original scaffold sequence. A map DB can equally well be the argument of many of the commands below that operate on normal DBs. In general, a .dam can be an argument anywhere a .db can, with the exception of routines or optioned calls to routines that involve quality values, or the special routines fasta2DAM and DAM2fasta that create a DAM and reverse said, just like the pair fasta2DB and DB2fasta do for a normal DB. So in general when we refer to a database we are referring to either a DB or a DAM. The command DBsplit sets or resets the current partition for a database which is determined by 3 parameters: (i) the total number of basepairs to place in each block, (ii) the minimum read length of reads to include within a block, and (iii) whether or not to only include the longest read from a given well or all reads from a well (NB: several reads of the same insert in a given well can be produced by the Pacbio instrument). Note that the length and uniqueness parameters effectively select a subset of the reads that contribute to the size of a block. We call this subset the *trimmed* data base. Some commands operate on the entire database, others on the trimmed database, and yet others have an option flag that permits them to operate on either at the users discretion. Therefore, one should note carefully to which version of the database a command refers to. This is especially important for any command that identifies reads by their index (ordinal position) in the database. Once the database has been split into blocks, the commands DBshow, DBstats, and DBdust below and commands yet to come, such as the local alignment finder dalign, can take a block or blocks as arguments. On the command line this is indicated by supplying the name of the DB followed by a period and then a block number, e.g. FOO.3.db or simply FOO.3, refers to the 3'rd block of DB FOO (assuming of course it has a current partition and said partition has a 3rd block). One should note carefully that a block is a contiguous range of reads such that once it is trimmed has a given size in base pairs (as set by DBsplit). Thus like an entire database, a block can be either untrimmed or trimmed and one needs to again be careful when giving a read index to a command such as DBshow. All programs add suffixes (e.g. .db) as needed. The commands of the database library are currently as follows: 1. fasta2DB [-v] <path:db> ( -f<file> | <input:fasta> ... ) Builds an initial data base, or adds to an existing database, the list of .fasta files following the database name argument, or if the -f option is used, the list of .fasta files in <file>. A given .fasta file can only be added once to the DB (this is checked by the command). The .fasta headers must be in the "Pacbio" format (i.e. the output of the Pacbio tools or our dextract program) and the well, pulse interval, and read quality are extracted from the header and kept with each read record. If the files are being added to an existing database, and the partition settings of the DB have already been set (see DBsplit below), then the partitioning of the database is updated to include the new data. 2. DB2fasta [-vU] [-w<int(80)>] <path:db> The set of .fasta files for the given DB are recreated from the DB exactly as they were input. That is, this is a perfect inversion, including the reconstitution of the proper .fasta headers. Because of this property, one can, if desired, delete the .fasta source files once they are in the DB as they can always be recreated from it. By default the output sequences are in lower case and 80 chars per line. The -U option specifies upper case should be used, and the characters per line, or line width, can be set to any positive value with the -w option. 3. quiva2DB [-vl] <path:db> ( -f<file> | <input:quiva> ... ) Adds the given .quiva files on the command line or in the file specified by the -f option to an existing DB "path". The input files must be added in the same order as the .fasta files were and have the same root names, e.g. FOO.fasta and FOO.quiva. The files can be added incrementally but must be added in the same order as the .fasta files. This is enforced by the program. With the -l option set the compression scheme is a bit lossy to get more compression (see the description of dexqv in the DEXTRACTOR module). 4. DB2quiva [-vU] <path:db> The set of .quiva files within the given DB are recreated from the DB exactly as they were input. That is, this is a perfect inversion, including the reconstitution of the proper .quiva headers. Because of this property, one can, if desired, delete the .quiva source files once they are in the DB as they can always be recreated from it. By .fastq convention each QV vector is output as a line without new-lines, and by default the Deletion Tag entry is in lower case letters. The -U option specifies upper case letters should be used instead. 5. fasta2DAM [-v] <path:dam> ( -f<file> | <input:fasta> ... ) Builds a map DB or DAM from the list of .fasta files following the map database name argument, or if the -f option is used, the list of .fasta files in <file>. Any .fasta entry that has a run of N's in it will be split into separate "contig" entries and the interval of the contig in the original entry recorded. The header for each .fasta entry is saved with the contigs created from it. 6. DAM2fasta [-vU] [-w<int(80)>] <path:dam> The set of .fasta files for the given map DB or DAM are recreated from the DAM exactly as they were input. That is, this is a perfect inversion, including the reconstitution of the proper .fasta headers and the concatenation of contigs with the proper number of N's between them. By default the output sequences are in lower case and 80 chars per line. The -U option specifies upper case should be used, and the characters per line, or line width, can be set to any positive value with the -w option. 7. DBsplit [-a] [-x<int>] [-s<int(200)>] <path:db|dam> Divide the database <path>.db or <path>.dam conceptually into a series of blocks referable to on the command line as <path>.1, <path>.2, ... If the -x option is set then all reads less than the given length are ignored, and if the -a option is not set then secondary reads from a given well are also ignored. The remaining reads, constituting what we call the trimmed DB, are split amongst the blocks so that each block is of size -s * 1Mbp except for the last which necessarily contains a smaller residual. The default value for -s is 200Mbp because blocks of this size can be compared by our "overlapper" dalign in roughly 16Gb of memory. The blocks are very space efficient in that their sub-index of the master .idx is computed on the fly when loaded, and the .bps and .qvs files (if a .db) of base pairs and quality values, respectively, is shared with the master DB. Any relevant portions of tracks associated with the DB are also computed on the fly when loading a database block. 8. DBdust [-b] [-w<int(64)>] [-t<double(2.)>] [-m<int(10)>] <path:db|dam> Runs the symmetric DUST algorithm over the reads in the untrimmed DB <path>.db or <path>.dam producing a track .<path>.dust[.anno,.data] that marks all intervals of low complexity sequence, where the scan window is of size -w, the threshold for being a low-complexity interval is -t, and only perfect intervals of size greater than -m are recorded. If the -b option is set then the definition of low complexity takes into account the frequency of a given base. The command is incremental if given a DB to which new data has been added since it was last run on the DB, then it will extend the track to include the new reads. It is important to set this flag for genomes with a strong AT/GC bias, albeit the code is a tad slower. The dust track, if present, is understood and used by DBshow, DBstats, and dalign. DBdust can also be run over an untriimmed DB block in which case it outputs a track encoding where the trace file names contain the block number, e.g. .FOO.3.dust.anno and .FOO.3.dust.data, given FOO.3 on the command line. We call this a *block track*. This permits job parallelism in block-sized chunks, and the resulting sequence of block tracks can then be merged into a track for the entire untrimmed DB with Catrack. 9. Catrack [-v] <path:db|dam> <track:name> Find all block tracks of the form .<path>.#.<track>... and merge them into a single track, .<path>.<track>..., for the given DB or DAM. The block track files must all encode the same kind of track data (this is checked), and the files must exist for block 1, 2, 3, ... up to the last block number. 10. DBshow [-unqUQ] [-w<int(80)>] [-m<track>]+ <path:db|dam> [ <reads:FILE> | <reads:range> ... ] Displays the requested reads in the database <path>.db or <path>.dam. By default the command applies to the trimmed database, but if -u is set then the entire DB is used. If no read arguments are given then every read in the database or database block is displayed. Otherwise the input file or the list of supplied integer ranges give the ordinal positions in the actively loaded portion of the db. In the case of a file, it should simply contain a read index, one per line. In the other case, a read range is either a lone integer or the symbol $, in which case the read range consists of just that read (the last read in the database if $). One may also give two positive integers separated by a dash to indicate a range of integers, where again a $ represents the index of the last read in the actively loaded db. For example, 1 3-5 $ displays reads 1, 3, 4, 5, and the last read in the active db. As another example, 1-$ displays every read in the active db (the default). By default a .fasta file of the read sequences is displayed. If the -q option is set, then the QV streams are also displayed in a non-standard modification of the fasta format. If the -n option is set then the DNA sequence is *not* displayed. If the -Q option is set then a .quiva file is displayed and in this case the -n and -m options mayt not be set (and the -q and -w options have no effect). If one or more masks are set with the -m option then the track intervals are also displayed in an additional header line and the bases within an interval are displayed in the case opposite that used for all the other bases. By default the output sequences are in lower case and 80 chars per line. The -U option specifies upper case should be used, and the characters per line, or line width, can be set to any positive value with the -w option. The .fasta or .quiva files that are output can be converted into a DB by fasta2DB and quiva2DB (if the -q and -n options are not set and no -m options are set), giving one a simple way to make a DB of a subset of the reads for testing purposes. 11. DBstats [-nu] [-b<int(1000)] [-m<track>]+ <path:db|dam> Show overview statistics for all the reads in the trimmed data base <path>.db or <path>.dam, including a histogram of read lengths where the bucket size is set with the -b option (default 1000). If the -u option is given then the untrimmed database is summarized. If the -n option is given then the histogran of read lengths is not displayed. Any track such as a "dust" track that gives a seried of intervals along the read can be specified with the -m option in which case a summary and a histogram of the interval lengths is displayed. 12. DBrm <path:db|dam> ... Delete all the files for the given data bases. Do not use rm to remove a database, as there are at least two and often several secondary files for each DB including track files, and all of these are removed by DBrm. 13. simulator <genlen:double> [-c<double(20.)>] [-b<double(.5)] [-r<int>] [-m<int(10000)>] [-s<int(2000)>] [-x<int(4000)>] [-e<double(.15)>] [-M<file>] In addition to the DB commands we include here, somewhat tangentially, a simple simulator that generates synthetic reads for a random genome. simulator first generates a fake genome of size genlen*1Mb long, that has an AT-bias of -b. It then generates sample reads of mean length -m from a log-normal length distribution with standard deviation -s, but ignores reads of length less than -x. It collects enough reads to cover the genome -c times and introduces -e fraction errors into each read where the ratio of insertions, deletions, and substitutions are set by defined constants INS_RATE (default 73%) and DEL_RATE (default 20%) within generate.c. One can also control the rate at which reads are picked from the forward and reverse strands by setting the defined constant FLIP_RATE (default 50/50). The -r option seeds the random number generator for the generation of the genome so that one can reproducibly generate the same underlying genome to sample from. If this parameter is missing, then the job id of the invocation seeds the random number generator. The output is sent to the standard output (i.e. it is a UNIX pipe). The output is in Pacbio .fasta format suitable as input to fasta2DB. Finally, the -M option requests that the coordinates from which each read has been sampled are written to the indicated file, one line per read, ASCII encoded. This "map" file essentially tells one where every read belongs in an assembly and is very useful for debugging and testing purposes. If a read pair is say b,e then if b < e the read was sampled from [b,e] in the forward direction, and if b > e from [e,b] in the reverse direction. Example: A small complete example of most of the commands above. > simulator 1.0 -c20. >G.fasta // Generate a 20x data sets of a 1Mb genome > fasta2DB G G.fasta // Create a compressed data base of the reads, G.db > rm G.fasta // Redundant, recreate any time with "DB2fasta G" > DBsplit -s11 G // Split G into 2 parts of size ~ 11MB each > DBdust G.1 // Produce a "dust" track on each part > DBdust G.2 > Catrack G dust // Create one track for all of the DB > rm .G.*.dust.* // Clean up the sub-tracks > DBstats -mdust G // Take a look at the statistics for the database Statistics for all reads in the data set 1,836 reads out of 1,836 (100.0%) 20,007,090 base pairs out of 20,007,090 (100.0%) 10,897 average read length 2,192 standard deviation Base composition: 0.250(A) 0.250(C) 0.250(G) 0.250(T) Distribution of Read Lengths (Bin size = 1,000) Bin: Count % Reads % Bases Average 22,000: 1 0.1 0.1 22654 21,000: 0 0.1 0.1 22654 20,000: 1 0.1 0.2 21355 19,000: 0 0.1 0.2 21355 18,000: 4 0.3 0.6 19489 17,000: 8 0.8 1.3 18374 16,000: 19 1.8 2.8 17231 15,000: 43 4.1 6.2 16253 14,000: 81 8.6 12.0 15341 13,000: 146 16.5 21.9 14428 12,000: 200 27.4 34.4 13664 11,000: 315 44.6 52.4 12824 10,000: 357 64.0 71.2 12126 9,000: 306 80.7 85.8 11586 8,000: 211 92.2 94.8 11208 7,000: 95 97.3 98.4 11017 6,000: 43 99.7 99.8 10914 5,000: 6 100.0 100.0 10897 Statistics for dust-track There are 158 intervals totaling 1,820 bases (0.0% of all data) Distribution of dust intervals (Bin size = 1,000) Bin: Count % Intervals % Bases Average 0: 158 100.0 100.0 11 > ls -al total 66518744 drwxr-xr-x+ 177 myersg staff 6018 Mar 2 13:28 . drwxr-xr-x+ 20 myersg staff 680 Feb 26 19:52 .. -rw-r--r--+ 1 myersg staff 5002464 Mar 2 13:28 .G.bps -rw-r--r--+ 1 myersg staff 14704 Mar 2 13:28 .G.dust.anno -rw-r--r--+ 1 myersg staff 1264 Mar 2 13:28 .G.dust.data -rw-r--r--+ 1 myersg staff 73552 Mar 2 13:28 .G.idx -rw-r--r--+ 1 myersg staff 162 Mar 2 13:28 G.db > cat G.db files = 1 1836 G Sim blocks = 2 size = 11 cutoff = 0 all = 0 0 0 1011 1011 1836 1836
About
The Dazzler Data Base
Resources
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- C 99.0%
- Makefile 1.0%