diff --git a/docs/.doctrees/algos/compseq.doctree b/docs/.doctrees/algos/compseq.doctree new file mode 100644 index 00000000..18c69f17 Binary files /dev/null and b/docs/.doctrees/algos/compseq.doctree differ diff --git a/docs/.doctrees/algos/index.doctree b/docs/.doctrees/algos/index.doctree index 2ab2a891..dff68b6b 100644 Binary files a/docs/.doctrees/algos/index.doctree and b/docs/.doctrees/algos/index.doctree differ diff --git a/docs/.doctrees/api/seismicrna.clust.doctree b/docs/.doctrees/api/seismicrna.clust.doctree index a9302d23..13a86170 100644 Binary files a/docs/.doctrees/api/seismicrna.clust.doctree and b/docs/.doctrees/api/seismicrna.clust.doctree differ diff --git a/docs/.doctrees/api/seismicrna.core.batch.doctree b/docs/.doctrees/api/seismicrna.core.batch.doctree index d251699f..43383446 100644 Binary files a/docs/.doctrees/api/seismicrna.core.batch.doctree and b/docs/.doctrees/api/seismicrna.core.batch.doctree differ diff --git a/docs/.doctrees/api/seismicrna.core.doctree b/docs/.doctrees/api/seismicrna.core.doctree index fbb016f6..279037e4 100644 Binary files a/docs/.doctrees/api/seismicrna.core.doctree and b/docs/.doctrees/api/seismicrna.core.doctree differ diff --git a/docs/.doctrees/api/seismicrna.core.seq.doctree b/docs/.doctrees/api/seismicrna.core.seq.doctree index a10fb2a1..b0f8f439 100644 Binary files a/docs/.doctrees/api/seismicrna.core.seq.doctree and b/docs/.doctrees/api/seismicrna.core.seq.doctree differ diff --git a/docs/.doctrees/api/seismicrna.core.seq.tests.doctree b/docs/.doctrees/api/seismicrna.core.seq.tests.doctree index 5f17e88e..bba06ef6 100644 Binary files a/docs/.doctrees/api/seismicrna.core.seq.tests.doctree and b/docs/.doctrees/api/seismicrna.core.seq.tests.doctree differ diff --git a/docs/.doctrees/api/seismicrna.table.doctree b/docs/.doctrees/api/seismicrna.table.doctree index 5dd0dd24..9b4f5e2a 100644 Binary files a/docs/.doctrees/api/seismicrna.table.doctree and b/docs/.doctrees/api/seismicrna.table.doctree differ diff --git a/docs/.doctrees/api/seismicrna.workflow.doctree b/docs/.doctrees/api/seismicrna.workflow.doctree index 26691a61..b33709a9 100644 Binary files a/docs/.doctrees/api/seismicrna.workflow.doctree and b/docs/.doctrees/api/seismicrna.workflow.doctree differ diff --git a/docs/.doctrees/cli.doctree b/docs/.doctrees/cli.doctree index 978569b6..d43bbc8e 100644 Binary files a/docs/.doctrees/cli.doctree and b/docs/.doctrees/cli.doctree differ diff --git a/docs/.doctrees/data/index.doctree b/docs/.doctrees/data/index.doctree index acf27098..035927ca 100644 Binary files a/docs/.doctrees/data/index.doctree and b/docs/.doctrees/data/index.doctree differ diff --git a/docs/.doctrees/data/relate.doctree b/docs/.doctrees/data/relate.doctree deleted file mode 100644 index d850ced3..00000000 Binary files a/docs/.doctrees/data/relate.doctree and /dev/null differ diff --git a/docs/.doctrees/data/relate/codes.doctree b/docs/.doctrees/data/relate/codes.doctree new file mode 100644 index 00000000..c0350117 Binary files /dev/null and b/docs/.doctrees/data/relate/codes.doctree differ diff --git a/docs/.doctrees/data/relate/index.doctree b/docs/.doctrees/data/relate/index.doctree new file mode 100644 index 00000000..65c56fc6 Binary files /dev/null and b/docs/.doctrees/data/relate/index.doctree differ diff --git a/docs/.doctrees/data/relate/qnames.doctree b/docs/.doctrees/data/relate/qnames.doctree new file mode 100644 index 00000000..9b2b653d Binary files /dev/null and b/docs/.doctrees/data/relate/qnames.doctree differ diff --git a/docs/.doctrees/data/relate/refseq.doctree b/docs/.doctrees/data/relate/refseq.doctree new file mode 100644 index 00000000..2b299867 Binary files /dev/null and b/docs/.doctrees/data/relate/refseq.doctree differ diff --git a/docs/.doctrees/data/relate/relate.doctree b/docs/.doctrees/data/relate/relate.doctree new file mode 100644 index 00000000..5a0e6f3a Binary files /dev/null and b/docs/.doctrees/data/relate/relate.doctree differ diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle index 8ed028ff..a4103d74 100644 Binary files a/docs/.doctrees/environment.pickle and b/docs/.doctrees/environment.pickle differ diff --git a/docs/.doctrees/formats/data/fastq.doctree b/docs/.doctrees/formats/data/fastq.doctree index 9fdeab70..6b7acd91 100644 Binary files a/docs/.doctrees/formats/data/fastq.doctree and b/docs/.doctrees/formats/data/fastq.doctree differ diff --git a/docs/.doctrees/manuals/index.doctree b/docs/.doctrees/manuals/index.doctree index 1e0f8634..52f0fbca 100644 Binary files a/docs/.doctrees/manuals/index.doctree and b/docs/.doctrees/manuals/index.doctree differ diff --git a/docs/.doctrees/manuals/parallel.doctree b/docs/.doctrees/manuals/parallel.doctree new file mode 100644 index 00000000..a1866d27 Binary files /dev/null and b/docs/.doctrees/manuals/parallel.doctree differ diff --git a/docs/.doctrees/manuals/workflow.doctree b/docs/.doctrees/manuals/workflow.doctree index b2036cae..88b1846f 100644 Binary files a/docs/.doctrees/manuals/workflow.doctree and b/docs/.doctrees/manuals/workflow.doctree differ diff --git a/docs/.doctrees/writeme.doctree b/docs/.doctrees/writeme.doctree index ea77a4b6..5ec41993 100644 Binary files a/docs/.doctrees/writeme.doctree and b/docs/.doctrees/writeme.doctree differ diff --git a/docs/_sources/algos/compseq.rst.txt b/docs/_sources/algos/compseq.rst.txt new file mode 100644 index 00000000..09a3d904 --- /dev/null +++ b/docs/_sources/algos/compseq.rst.txt @@ -0,0 +1,122 @@ + +Algorithm for Sequence Compression/Decompression +======================================================================== + +Background on Sequence Compression/Decompression +------------------------------------------------------------------------ + +In a nucleic acid sequence, each base is represented by one character in +`ASCII code`_ and hence occupies 1 byte (8 bits). +But because each base has only 4 possibilities, each base requires only +log\ :sub:`2`\ (4) = 2 bits, shown in the following table: + +==== ================= +Base 2-Bit Binary Code +==== ================= +A ``00`` +C ``01`` +G ``10`` +T/U ``11`` +==== ================= + +For efficient storage of (long) nucleic acid sequences, the following +algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte, +and then restores the original sequence when needed. + +Algorithm for Sequence Compression +------------------------------------------------------------------------ + +Algorithm for Sequence Compression: Procedure +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +First, attributes of the nucleic acid sequence are recorded: + +- rna (``bool``): whether the sequence is RNA +- length (``int``): number of bases in the sequence +- ns (``tuple[int, ...]``): positions of ``N`` bases, if any + +The type and the length are scalar values with constant sizes regardless +of the length of the sequence. +The 0-indexed positions of ambiguous bases is an array that at worst (if +all bases are ``N``) scales linearly with the length of the sequence and +at best (if no bases are ``N``) is a constant (small) size. +Because most reference sequences contain no or very few bases that are +``N``, recording the ``N`` positions generally requires little space. + +Then, each non-overlapping segment of four bases is encoded as one byte +by concatenating the 2-bit codes (above) of the bases in the segment, in +reverse order. +Ambiguous (``N``) bases are arbitrarily encoded as ``00``. +Because this step requires the length of the sequence to be a multiple +of 4, the sequence is padded on its 3' side with ``A`` (an arbitrary +choice) until its length is a multiple of 4. + +Algorithm for Sequence Compression: Example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Given the DNA sequence ``CAGNTTCGAN``, the attributes are extracted: + +- rna: ``False`` +- length: ``10`` +- ns: ``(3, 9)`` (note: these positions are 0-indexed) + +The sequence is then padded with ``A`` at the 3' end until its length +becomes a multiple of 4 (in this case, length 12): + +``CAGNTTCGANAA`` + +Then, each 4-base segment is transformed into one byte by encoding each +base and concatenating the codes in order from code 4 to code 1: + +====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ======== +Number Sequence Base 1 Base 2 Base 3 Base 4 Code 4 Code 3 Code 2 Code 1 Byte +====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ======== +1 CAGN C A G N 00 10 00 01 00100001 +2 TTCG T T C G 10 01 11 11 10011111 +3 ANAA A N A A 00 00 00 00 00000000 +====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ======== + +Thus, the compressed byte string is ``[00100001, 10011111, 00000000]``. +Note that this string is only 3 bytes, compared to 10 for the original. +As a ``bytes`` object, the representation is ``b'!\x9f\x00'``. + +Algorithm for Sequence Decompression +------------------------------------------------------------------------ + +Algorithm for Sequence Decompression: Procedure +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Beginning with a compressed sequence of bytes, each byte is transformed +into four bases by decoding each 2-bit chunk to a base using the above +table, then reversing the order of the four bases. +The code ``11`` is decoded to ``U`` if the `rna` attribute is ``True`` +and to ``T`` if ``False``. +The sequence at this point must have a number of bases divisible by 4. +It is cut to the correct number of bases using the `length` attribute. +Finally, every position in the attribute `ns` is masked to ``N``. + +Algorithm for Sequence Decompression: Example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Suppose that a compressed sequence has the following attributes: + +- compressed byte string: ``[00100001, 10011111, 00000000]`` +- rna: ``False`` +- length: ``10`` +- ns: ``(3, 9)`` + +To decompress the sequence, each byte is split into four 2-bit segments, +decoded, reversed, and reassembled: + +====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ======== +Number Byte Code 1 Code 2 Code 3 Code 4 Base 4 Base 3 Base 2 Base 1 Sequence +====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ======== +1 00100001 00 10 00 01 C A G A CAGA +2 10011111 10 01 11 11 T T C G TTCG +3 00000000 00 00 00 00 A A A A AAAA +====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ======== + +The resulting sequence, ``CAGATTCGAAAA``, is trimmed to 10 nt: ``CAGATTCGAA``. +Finally, (0-indexed) positions 3 and 9 are replaced with ``N``: ``CAGNTTCGAN``. + +.. _`ASCII code`: https://en.wikipedia.org/wiki/ASCII diff --git a/docs/_sources/algos/index.rst.txt b/docs/_sources/algos/index.rst.txt index 59e21ac4..fb935511 100644 --- a/docs/_sources/algos/index.rst.txt +++ b/docs/_sources/algos/index.rst.txt @@ -1,3 +1,8 @@ ************************************************************************ Algorithms ************************************************************************ + +.. toctree:: + :maxdepth: 2 + + compseq diff --git a/docs/_sources/data/index.rst.txt b/docs/_sources/data/index.rst.txt index de902f1f..3f5855ae 100644 --- a/docs/_sources/data/index.rst.txt +++ b/docs/_sources/data/index.rst.txt @@ -6,4 +6,4 @@ Data Structures .. toctree:: :maxdepth: 2 - relate + relate/index diff --git a/docs/_sources/data/relate.rst.txt b/docs/_sources/data/relate/codes.rst.txt similarity index 98% rename from docs/_sources/data/relate.rst.txt rename to docs/_sources/data/relate/codes.rst.txt index 04ab01c5..aa9d24ba 100644 --- a/docs/_sources/data/relate.rst.txt +++ b/docs/_sources/data/relate/codes.rst.txt @@ -61,7 +61,9 @@ for some bases to define the relationship: - low-quality base calls - ambiguous insertions and deletions -Low-quality base calls +.. _relate_low_qual: + +Encoding low-quality base calls """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" A low-quality base call is defined as having a `Phred quality score`_ @@ -115,7 +117,7 @@ four columns "Read: A/C/G/T?". than once towards the total number of matches or mutations. To learn how mutations in relation vectors are counted, see [REF]. -Ambiguous insertions and deletions +Encoding ambiguous insertions and deletions """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" Insertions and deletions (collectively, "indels") in the read can cause @@ -183,7 +185,7 @@ is covered in one mate but not in the other, one mate's Phred score is above the threshold and the other's is below, or (more rarely) the base calls themselves differ. -Consensus relationships +Encoding consensus relationships """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" When finding the consensus of two mates, information in one mate should @@ -219,7 +221,7 @@ position 4 is a match, but it is low quality at position 5, so even the consensus byte is ambiguous. Neither mate covers position 6, so the consensus byte is blank. -Irreconcilable relationships +Encoding irreconcilable relationships """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" It is possible, although rare, for mates 1 and 2 to share no bits. For diff --git a/docs/_sources/data/relate/index.rst.txt b/docs/_sources/data/relate/index.rst.txt new file mode 100644 index 00000000..04b366c6 --- /dev/null +++ b/docs/_sources/data/relate/index.rst.txt @@ -0,0 +1,11 @@ + +Relate Data Structures +======================================================================== + +.. toctree:: + :maxdepth: 2 + + codes + relate + qnames + refseq diff --git a/docs/_sources/data/relate/qnames.rst.txt b/docs/_sources/data/relate/qnames.rst.txt new file mode 100644 index 00000000..b82c9052 --- /dev/null +++ b/docs/_sources/data/relate/qnames.rst.txt @@ -0,0 +1,39 @@ + +Read Names Batch +------------------------------------------------------------------------ + +Each batch of relation vectors is a ``QnamesBatchIO`` object. + +Read names batch: Structure +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The read names are in the attribute ``names``, a ``numpy.ndarray`` with +data type ``str``. +The name of the read with number ``i`` (see :ref:`relate_read_nums` for +more information) is at index ``i`` of the ``names`` array. + +Read names batch: Example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Suppose a XAM file contains five reads: + +=========== =========================================== +Read Number Read Name +=========== =========================================== +0 ``VL00355:66:AACYHM7M5:1:2106:54095:30496`` +1 ``VL00355:66:AACYHM7M5:1:2303:34554:55259`` +2 ``VL00355:66:AACYHM7M5:1:2411:30558:38352`` +3 ``VL00355:66:AACYHM7M5:1:1206:41825:52949`` +4 ``VL00355:66:AACYHM7M5:1:2506:10903:12605`` +=========== =========================================== + +The relate step would write a batch with the ``names`` attribute :: + + ["VL00355:66:AACYHM7M5:1:2106:54095:30496", + "VL00355:66:AACYHM7M5:1:2303:34554:55259", + "VL00355:66:AACYHM7M5:1:2411:30558:38352", + "VL00355:66:AACYHM7M5:1:1206:41825:52949", + "VL00355:66:AACYHM7M5:1:2506:10903:12605"] + +Note that the attribute is shown as a ``list`` for visual simplicity, +but would really be a ``numpy.ndarray``. diff --git a/docs/_sources/data/relate/refseq.rst.txt b/docs/_sources/data/relate/refseq.rst.txt new file mode 100644 index 00000000..59165633 --- /dev/null +++ b/docs/_sources/data/relate/refseq.rst.txt @@ -0,0 +1,31 @@ + +Reference Sequence +------------------------------------------------------------------------ + +The reference sequence is saved as a ``RefseqIO`` object. + +Reference sequence: Structure +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +A ``RefseqIO`` object stores a reference sequence as a space-efficient +``CompressedSeq`` object in the private attribute ``_s``. +(The original sequence is available via the cached property ``refseq``.) +A ``CompressedSeq`` object has the following attributes: + +- ``b`` (``bytes``): sequence of bytes, each byte encoding 4 nucleotides +- ``s`` (``int``): length of the sequence (number of nucleotides) +- ``n`` (``tuple[int, ...]``): 0-indexed position of each ``N`` +- ``r`` (``bool``): ``True`` if the sequence is RNA, ``False`` if DNA + +Reference sequence: Example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Suppose that a DNA sequence is ``CAGNTTCGAN``. +This sequence would be compressed into :: + + b = b'!\x9f\x00' + s = 10 + n = (3, 9) + r = False + +For details on the algorithm, see :doc:`../../algos/compseq`. diff --git a/docs/_sources/data/relate/relate.rst.txt b/docs/_sources/data/relate/relate.rst.txt new file mode 100644 index 00000000..89cab714 --- /dev/null +++ b/docs/_sources/data/relate/relate.rst.txt @@ -0,0 +1,125 @@ + +Relate Batch +------------------------------------------------------------------------ + +Each batch of relation vectors is a ``RelateBatchIO`` object. + +Relate batch: Structure +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The following attributes encode the relationships between each read and +each position in the reference sequence: + +========= ============================================ ==================================================================== +Attribute Data Type Description +========= ============================================ ==================================================================== +``end5s`` ``numpy.ndarray[int]`` array of the first position of the most upstream mate in each read +``mid5s`` ``numpy.ndarray[int]`` array of the first position of the most downstream mate in each read +``mid3s`` ``numpy.ndarray[int]`` array of the last position of the most upstream mate in each read +``end3s`` ``numpy.ndarray[int]`` array of the last position of the most downstream mate in each read +``muts`` ``dict[int, dict[int, numpy.ndarray[int]]]`` array of the reads with each type of mutation at each position +========= ============================================ ==================================================================== + +.. note:: + The positions of the first and last bases in the reference sequence + are defined to be 1 and the length of the sequence, respectively. + +.. _relate_read_nums: + +Relate batch: Structure of read numbers +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +Each read, or pair of paired-end reads, is labeled with a non-negative +integer: 0 for the first read in each batch, and incrementing by 1 for +each subsequent read. +Within one batch, all read numbers are unique. +However, two different batches can have reads that share numbers. + +Relate batch: Structure of 5' and 3' end positions +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +- ``end5s``, ``mid5s``, ``mid3s``, and ``end3s`` are all 1-dimensional + ``numpy.ndarray`` objects. +- For any relate batch, ``end5s``, ``mid5s``, ``mid3s``, and ``end3s`` + all have the same length (which may be any integer ≥ 0). +- A read with index ``i`` corresponds to the ``i``\ th values of + ``end5s``, ``mid5s``, ``mid3s``, and ``end3s``; denoted (respectively) + ``end5s[i]``, ``mid5s[i]``, ``mid3s[i]``, and ``end3s[i]``. +- For every read ``i``: + + - 1 ≤ ``end5s[i]`` ≤ ``end3s[i]`` ≤ length of reference sequence + - If paired-end and there is a gap of ≥ 1 nt between mates 1 and 2: + + - ``end5s[i]`` ≤ ``mid3s[i]`` < ``mid5s[i]`` ≤ ``end3s[i]`` + + - Otherwise: + + - ``end5s[i]`` = ``mid5s[i]`` ≤ ``mid3s[i]`` = ``end3s[i]`` + +Relate batch: Structure of mutations +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +``muts`` is a ``dict`` wherein + +- each key is a position in the reference sequence (``int``) +- each value is a ``dict`` wherein + + - each key is a type of mutation (``int``, see :doc:`./codes` for more + information) + - each value is an array of the numbers of the reads that have the + given type of mutation at the given position (``numpy.ndarray``) + +Relate batch: Example +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +For example, suppose that the reference sequence is ``TCAGAACC`` and a +batch contains five paired-end reads, numbered 0 to 4: + +==== ==== ============ +Read Mate Alignment +==== ==== ============ +0 1 ``_CAG____`` +0 2 ``_____AGC`` +1 1 ``___GTA__`` +1 2 ``TCT_____`` +2 1 ``____AAC_`` +2 2 ``_CA_____`` +3 1 ``TAAGT___`` +3 2 ``______CC`` +4 1 ``__AGA___`` +4 2 ``___GA-C_`` +Ref ``TCAGAACC`` +==== ==== ============ + +The positions, reads, and relationships can be shown explicitly as a +matrix (see :doc:`./codes` for information on the relationship codes): + +==== === === === === === === === === +Read 1 2 3 4 5 6 7 8 +==== === === === === === === === === +0 255 1 1 1 255 1 64 1 +1 1 1 128 1 128 1 255 255 +2 255 1 1 255 1 1 1 255 +3 1 16 1 1 128 255 1 1 +4 255 255 1 1 3 3 1 255 +==== === === === === === === === === + +In a relate batch, they would be encoded as follows: + +- ``end5s``: ``[2, 1, 2, 1, 3]`` +- ``mid5s``: ``[4, 1, 3, 5, 3]`` +- ``mid3s``: ``[6, 6, 5, 7, 7]`` +- ``end3s``: ``[8, 6, 7, 8, 7]`` +- ``muts``:: + + {1: {}, + 2: {16: [3]}, + 3: {128: [1]}, + 4: {}, + 5: {3: [4], 128: [1, 3]}, + 6: {3: [4]}, + 7: {64: [0]}, + 8: {}} + + Note that the numbers are shown here for visual simplicity as ``list`` + objects, but would really be ``numpy.ndarray`` objects. diff --git a/docs/_sources/formats/data/fastq.rst.txt b/docs/_sources/formats/data/fastq.rst.txt index b91570d4..d81c6b4a 100644 --- a/docs/_sources/formats/data/fastq.rst.txt +++ b/docs/_sources/formats/data/fastq.rst.txt @@ -33,10 +33,10 @@ file was called correctly during sequencing. The probability *p* that a base was called incorrectly is 10 raised to the power of the quotient of the Phred score *s* and -10: -*p* = 10 :sup:`-s/10` +*p* = 10\ :sup:`-s/10` For example, if a base call has a Phred score of 30, the probability -that the base call is incorrect is 10 :sup:`-30/10` = 0.001. +that the base call is incorrect is 10\ :sup:`-30/10` = 0.001. In FASTQ files, each phred quality score (a non-negative integer) is encoded as one character of text by adding another integer *N* to the diff --git a/docs/_sources/manuals/index.rst.txt b/docs/_sources/manuals/index.rst.txt index 6103830c..f684acf2 100644 --- a/docs/_sources/manuals/index.rst.txt +++ b/docs/_sources/manuals/index.rst.txt @@ -18,4 +18,5 @@ For a complete list of commands and their parameters, see :doc:`../cli`. workflow faclean inputs + parallel logging diff --git a/docs/_sources/manuals/parallel.rst.txt b/docs/_sources/manuals/parallel.rst.txt new file mode 100644 index 00000000..3e398788 --- /dev/null +++ b/docs/_sources/manuals/parallel.rst.txt @@ -0,0 +1,18 @@ + +Parallelizing Tasks +======================================================================== + + +.. _batches: + +Batches and Benefits +------------------------------------------------------------------------ + +SEISMIC-RNA can divide up data processing into multiple batches. +The benefits of batching depend on whether the batches are processed in +parallel or in series: + +- In parallel, multiple batches can be processed simultaneously, which + speeds up processing the entire dataset. +- In series, only the data for one batch must be stored in memory at one + time, which reduces the memory requirement. diff --git a/docs/_sources/manuals/workflow.rst.txt b/docs/_sources/manuals/workflow.rst.txt index 144f6275..008e84e8 100644 --- a/docs/_sources/manuals/workflow.rst.txt +++ b/docs/_sources/manuals/workflow.rst.txt @@ -10,8 +10,8 @@ There are two points from which you can begin the main workflow: or CRAM format), then begin at the step :ref:`wf_relate`. .. note:: - You can pass either (or both) types of input files into the command - ``seismic all``, which runs the entire workflow. + The command ``seismic all`` accepts both types of inputs and runs + the entire workflow. See :ref:`wf_all` for more details. @@ -175,10 +175,10 @@ There are three ways to align multiple FASTQ files (or pairs thereof): Thus, the given directory can have deeply nested subdirectories, and SEISMIC-RNA will still find and process any FASTQ files within them. -Align: Options +Align: Settings ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Align option: Phred score encoding +Align setting: Phred score encoding """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" SEISMIC-RNA defaults to using Phred+33 encoding for FASTQ files, which @@ -204,7 +204,7 @@ in the "Basic Statisics" section: - Otherwise, you will need to search elsewhere for your encoding scheme to determine the Phred score offset. -Align option: Quality assessment with FastQC +Align setting: Quality assessment with FastQC """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" By default, each FASTQ file is processed with `FastQC`_, both before and @@ -213,7 +213,7 @@ FastQC can be disabled with the flag ``--no-fastqc``. To enable automatic extraction of the zipped output files from FastQC, add the flag ``--qc-extract``. -Align option: Trimming reads with Cutadapt +Align setting: Trimming reads with Cutadapt """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" By default, each FASTQ file and pair of mated FASTQ files is trimmed for @@ -246,7 +246,7 @@ How to trim low-quality base calls Base calls on either end of a read that fall below a minimum Phred score quality are trimmed with Cutadapt. The default minimum quality is 25, which corresponds to a probability of -1 - 10 :sup:`-2.5` = 0.997 that the base call is correct. +1 - 10\ :sup:`-2.5` = 0.997 that the base call is correct. (See :ref:`phred_encodings` for more details). To change the quality threshold, use the option ``--min-phred``. @@ -276,7 +276,7 @@ your FASTQ files outside of SEISMIC-RNA, then perform alignment within SEISMIC-RNA, using the option ``--no-cut`` to disable additional adapter trimming. -Align option: Mapping reads with Bowtie 2 +Align setting: Mapping reads with Bowtie 2 """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" How to pre-build a Bowtie 2 index (optional) @@ -385,12 +385,26 @@ on setting this parameter. the read aligns with a high score to exactly one location, low quality if it aligns with similar scores to multiple locations in the reference. The default minimum quality is 25, which corresponds to a confidence of -1 - 10 :sup:`-2.5` = 0.997 that the read has aligned correctly. +1 - 10\ :sup:`-2.5` = 0.997 that the read has aligned correctly. To change the quality threshold, use the option ``--min-mapq``. For those searching for this option in Bowtie 2, you will not find it. Instead, reads with insufficient mapping quality are filtered out after alignment using the `view command in Samtools`_. +How to filter by number of aligned reads +'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' + +In general, alignment maps containing very few reads are not useful for +mutational profiling, due to their inherently low coverage per position. +Worse, if aligning to a very large number of references (e.g. an entire +transcriptome), most of the references would likely receive insufficient +reads, so most of the (many) output XAM files would be useless clutter. + +To remedy this inconvenience, after alignment has finished, XAM files +with fewer than a minimum number of reads are automatically deleted. +The default is 1000, which can be set using the option ``--min-reads``. +Setting ``--min-reads`` to 0 disables automatically deleting XAM files. + How to further customize alignment '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' @@ -402,10 +416,9 @@ If you require a more customized alignment workflow, you can align your your FASTQ files outside of SEISMIC-RNA, then pass the resulting XAM files into SEISMIC-RNA at the step :ref:`wf_relate`. - .. _bam_vs_cram: -Align option: Format of alignment maps +Align setting: Format of alignment maps """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" SEISMIC-RNA can output alignment map files in either BAM or CRAM format. @@ -450,6 +463,10 @@ that accompanies them). Align: output files ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ +All output files, except FastQC reports, are written into the directory +``{out}/{sample}/align``, where ``{out}`` is the output directory and +``{sample}`` is the name of the sample. + Align output file: FastQC reports """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" @@ -478,11 +495,9 @@ each read aligned, as well as the Phred quality scores, mapping quality, and mutated positions. SEISMIC-RNA outputs alignment maps where every read aligns to the same reference (although this is not a restriction outside of SEISMIC-RNA). -Each alignment map is written to ``{out}/{sample}/align/{ref}.{xam}``, -where ``{out}`` is the output directory (``--out-dir``), ``{sample}`` is -the name of the sample from which the reads came, ``{ref}`` is the name -of the reference to which the reads aligned, and ``{xam}`` is the file -extension (depending on the selected format). +Each alignment map is written to ``{ref}.{xam}``, where ``{ref}`` is the +name of the reference to which the reads aligned, and ``{xam}`` is the +file extension (depending on the selected format). SEISMIC-RNA can output alignment maps in either BAM or CRAM format. For a comparison of these formats, see :ref:`bam_vs_cram`. @@ -509,7 +524,7 @@ Align output file: Unaligned reads """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" In addition to the alignment maps, SEISMIC-RNA outputs FASTQ file(s) of -reads that Bowtie 2 could not align to ``{out}/{sample}/align``: +reads that Bowtie 2 could not align: - Each whole-sample FASTQ file of single-end (``-z``) or interleaved (``-y``) reads yields one file: ``unaligned.fq.gz`` @@ -526,26 +541,12 @@ where ``{ref}`` is the reference for demultiplexed FASTQ files. Outputting these files of unaligned reads can be disabled using the option ``--bt2-no-un``. -Align output file: Align report +Align output file: Report """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" -In addition to the alignment maps, SEISMIC-RNA outputs FASTQ file(s) of -reads that Bowtie 2 could not align to ``{out}/{sample}/align``: - -- Each whole-sample FASTQ file of single-end (``-z``) or interleaved - (``-y``) reads yields one file: ``unaligned.fq.gz`` -- Each pair of whole-sample FASTQ files of 1st and 2nd mates (``-x``) - yields two files: ``unaligned.fq.1.gz`` and ``unaligned.fq.2.gz`` -- Each demultiplexed FASTQ file of single-end (``-Z``) or interleaved - (``-Y``) reads yields one file: ``{ref}__unaligned.fq.gz`` -- Each pair of demultiplexed FASTQ files of 1st and 2nd mates (``-X``) - yields two files: - ``{ref}__unaligned.fq.1.gz`` and ``{ref}__unaligned.fq.2.gz`` - -where ``{ref}`` is the reference for demultiplexed FASTQ files. - -Outputting these files of unaligned reads can be disabled using the -option ``--bt2-no-un``. +A report file, ``align-report.json``, is also written that records the +settings used to run alignment and summarizes the results of alignment. +See :doc:`../formats/report/align` for more information. Align: Troubleshooting ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -587,14 +588,6 @@ DNA sequences (only A, C, G, T, and N) in FASTA format. If needed, you may clean the FASTA file with the :doc:`./faclean` tool. See :doc:`../formats/data/fasta` for details. -.. note:: - The references in the FASTA file must match those to which the reads - were aligned to produce the XAM file(s). - A XAM file whose reference is absent from the FASTA will be skipped. - A XAM file whose reference sequence in the FASTA differs from the - sequence to which the XAM file was actually aligned can can yield - erroneous relation vectors or fail to yield any output. - Relate input file: Alignment maps """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" @@ -602,13 +595,14 @@ Relate requires one or more alignment map files, each of which must be in SAM, BAM, or CRAM format (collectively, "XAM" format). See :doc:`../formats/data/xam` for details. -Alignment map file requirements -'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' - -Refer to - -How to relate one or more alignment maps -'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' +.. note:: + The references in the FASTA file must match those to which the reads + in the alignment map were aligned. + Discrepancies can cause the ``relate`` command to fail or produce + erroneous relation vectors. + This problem will not occur if you use the same (unaltered) FASTA + file for both the ``align`` and ``relate`` commands, or run both + at once using the command ``seismic all``. List every alignment map file after the FASTA file. Refer to :doc:`./inputs` for details on how to list multiple files. @@ -620,6 +614,108 @@ aligned to reference ``ref-1``, use the following command:: where ``{refs.fa}`` is the path to the file of reference sequences. +Relate: Settings +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Relate options shared with alignment +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +Because this workflow can be started from the ``align`` or ``relate`` +commands, the latter duplicates some of the options of the former: +``--phred-enc``, ``--min-mapq``, ``--min-reads``, and ``--out-dir`` have +the same functions in ``relate`` and ``align``. + +Relate setting: Minimum Phred score +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +Base calls with Phred scores below ``--min-phred`` are labeled ambiguous +matches or substitutions, as if they were ``N``\s. +For example, if the minimum Phred score is 25 (the default) and a base +``T`` is called as a match with a Phred score of 20, then it would be +marked as possibly a match and possibly a subsitution to A, C, or G. +See :ref:`relate_low_qual` for more information. + +Relate setting: Ambiguous insertions and deletions +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +The most tricky problem in computing relation vectors is that insertions +and deletions ("indels") in repetitive regions cause ambiguities. +SEISMIC-RNA introduces a new algorithm for identifying ambiguous indels +(see :doc:`../algos/ambrel` for more information). +This algorithm is enabled by default. +If it is not necessary to identify ambiguous indels, then the algorithm +can be disabled with ``--no-ambrel``, which will speed up ``relate`` at +the cost of reducing its accuracy on indels. + +Relate setting: Batch size +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +For an explanation of batching and how to use it, see :ref:`batches`. + +The dataset is partitioned into batches by the ``relate`` command. +The option ``--batch-size`` sets a target amount of data for each batch, +in millions of base calls (megabases). +This calculation considers the total number of relationships per read, +which equals the length of the reference sequence. +Thus, the number of base calls *B* is the product of the number of reads +*N* and the length of the reference sequence *L*: + +*B* = *NL* + +Since *L* is known and ``--batch-size`` specifies a target size for *B*, +*N* can be solved for: + +*N* = *B*/*L* + +SEISMIC-RNA will aim to put exactly *N* reads in each batch but the last +(the last batch can be smaller because it has just the leftover reads). +If the reads are single-ended or if alignment was not run in mixed mode, +then every batch but the last will contain exactly *N* reads. +If mixed mode was used, then batches may contain more than *N* reads, up +to a maximum of 2\ *N* in the extreme case that every read in the batch +belonged to a pair in which the other mate did not align. + +Relate: Output files +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +All output files go into the directory ``{out}/{sample}/relate/{ref}``, +where ``{out}`` is the output directory, ``{sample}`` is the sample, and +``{ref}`` is the name of the reference. + +Relate output file: Batch of relation vectors +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +The data of relationships is written in batches. +Each batch contains a ``RelateBatchIO`` object and is saved to the file +``relate-batch-{num}.brickle``, where ``{num}`` is the batch number. +See :doc:`../data/relate/relate` for details on the data structure. +See :doc:`../formats/data/brickle` for details on brickle files. + +Relate output file: Batch of read names +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +For each batch, the relate step assigns an index (a nonnegative integer) +to each read and writes a file mapping the indexes to the read names. +Each batch contains a ``QnamesBatchIO`` object and is saved to the file +``qnames-batch-{num}.brickle``, where ``{num}`` is the batch number. +See :doc:`../data/relate/qnames` for details on the data structure. +See :doc:`../formats/data/brickle` for details on brickle files. + +Relate output file: Reference sequence +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +The relate step writes the reference sequence as a ``RefseqIO`` object +to the file ``refseq.brickle``. +See :doc:`../data/relate/refseq` for details on the data structure. +See :doc:`../formats/data/brickle` for details on brickle files. + +Relate output file: Report +"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" + +A report file is written that records the settings used to generate +relation vectors summarizes the results. +See :doc:`../formats/report/relate` for more information. + .. _wf_all: Run the entire workflow with one command diff --git a/docs/_sources/writeme.rst.txt b/docs/_sources/writeme.rst.txt index db100b72..9d55f67c 100644 --- a/docs/_sources/writeme.rst.txt +++ b/docs/_sources/writeme.rst.txt @@ -4,11 +4,11 @@ How to Write this Manual .. figure:: https://imgs.xkcd.com/comics/manuals.png :align: center - :alt: A comic strip from xkcd joking that the most unhelpful tool is + :alt: A comic strip of xkcd joking that the most unhelpful tool is one whose manual starts with "How to read this manual". Image credit: "Manuals" from xkcd by Randall Munroe (https://xkcd.com/1343/) -Rather than begin with "How to read this manual" (hopefully, this manual -is sufficiently self-evident in that regard), this manual concludes with -guidelines for writing and editing future versions. +Rather than begin with "How to read this manual" (hopefully, that task +is sufficiently self-evident), this documentation ends with guidelines +for writing future versions. diff --git a/docs/algos/compseq.html b/docs/algos/compseq.html new file mode 100644 index 00000000..fb47a808 --- /dev/null +++ b/docs/algos/compseq.html @@ -0,0 +1,376 @@ + + + + + + + Algorithm for Sequence Compression/Decompression — seismic-rna documentation + + + + + + + + + + + + + + + + + +
+ + +
+ +
+
+
+ +
+
+
+
+ +
+

Algorithm for Sequence Compression/Decompression

+
+

Background on Sequence Compression/Decompression

+

In a nucleic acid sequence, each base is represented by one character in +ASCII code and hence occupies 1 byte (8 bits). +But because each base has only 4 possibilities, each base requires only +log2(4) = 2 bits, shown in the following table:

+ ++++ + + + + + + + + + + + + + + + + + + + +

Base

2-Bit Binary Code

A

00

C

01

G

10

T/U

11

+

For efficient storage of (long) nucleic acid sequences, the following +algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte, +and then restores the original sequence when needed.

+
+
+

Algorithm for Sequence Compression

+
+

Algorithm for Sequence Compression: Procedure

+

First, attributes of the nucleic acid sequence are recorded:

+
    +
  • rna (bool): whether the sequence is RNA

  • +
  • length (int): number of bases in the sequence

  • +
  • ns (tuple[int, ...]): positions of N bases, if any

  • +
+

The type and the length are scalar values with constant sizes regardless +of the length of the sequence. +The 0-indexed positions of ambiguous bases is an array that at worst (if +all bases are N) scales linearly with the length of the sequence and +at best (if no bases are N) is a constant (small) size. +Because most reference sequences contain no or very few bases that are +N, recording the N positions generally requires little space.

+

Then, each non-overlapping segment of four bases is encoded as one byte +by concatenating the 2-bit codes (above) of the bases in the segment, in +reverse order. +Ambiguous (N) bases are arbitrarily encoded as 00. +Because this step requires the length of the sequence to be a multiple +of 4, the sequence is padded on its 3’ side with A (an arbitrary +choice) until its length is a multiple of 4.

+
+
+

Algorithm for Sequence Compression: Example

+

Given the DNA sequence CAGNTTCGAN, the attributes are extracted:

+
    +
  • rna: False

  • +
  • length: 10

  • +
  • ns: (3, 9) (note: these positions are 0-indexed)

  • +
+

The sequence is then padded with A at the 3’ end until its length +becomes a multiple of 4 (in this case, length 12):

+

CAGNTTCGANAA

+

Then, each 4-base segment is transformed into one byte by encoding each +base and concatenating the codes in order from code 4 to code 1:

+ +++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Number

Sequence

Base 1

Base 2

Base 3

Base 4

Code 4

Code 3

Code 2

Code 1

Byte

1

CAGN

C

A

G

N

00

10

00

01

00100001

2

TTCG

T

T

C

G

10

01

11

11

10011111

3

ANAA

A

N

A

A

00

00

00

00

00000000

+

Thus, the compressed byte string is [00100001, 10011111, 00000000]. +Note that this string is only 3 bytes, compared to 10 for the original. +As a bytes object, the representation is b'!\x9f\x00'.

+
+
+
+

Algorithm for Sequence Decompression

+
+

Algorithm for Sequence Decompression: Procedure

+

Beginning with a compressed sequence of bytes, each byte is transformed +into four bases by decoding each 2-bit chunk to a base using the above +table, then reversing the order of the four bases. +The code 11 is decoded to U if the rna attribute is True +and to T if False. +The sequence at this point must have a number of bases divisible by 4. +It is cut to the correct number of bases using the length attribute. +Finally, every position in the attribute ns is masked to N.

+
+
+

Algorithm for Sequence Decompression: Example

+

Suppose that a compressed sequence has the following attributes:

+
    +
  • compressed byte string: [00100001, 10011111, 00000000]

  • +
  • rna: False

  • +
  • length: 10

  • +
  • ns: (3, 9)

  • +
+

To decompress the sequence, each byte is split into four 2-bit segments, +decoded, reversed, and reassembled:

+ +++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

Number

Byte

Code 1

Code 2

Code 3

Code 4

Base 4

Base 3

Base 2

Base 1

Sequence

1

00100001

00

10

00

01

C

A

G

A

CAGA

2

10011111

10

01

11

11

T

T

C

G

TTCG

3

00000000

00

00

00

00

A

A

A

A

AAAA

+

The resulting sequence, CAGATTCGAAAA, is trimmed to 10 nt: CAGATTCGAA. +Finally, (0-indexed) positions 3 and 9 are replaced with N: CAGNTTCGAN.

+
+
+
+ + +
+
+ +
+
+
+
+ + + + \ No newline at end of file diff --git a/docs/algos/index.html b/docs/algos/index.html index 2e9af7a7..3c109ebf 100644 --- a/docs/algos/index.html +++ b/docs/algos/index.html @@ -19,8 +19,8 @@ - - + + @@ -52,7 +52,10 @@
  • seismicrna package
  • File Formats
  • Data Structures
  • -
  • Algorithms
  • +
  • Algorithms +
  • Bugs and Requests
  • How to Write this Manual
  • @@ -83,14 +86,24 @@

    Algorithms

    +
    + +