rouskinlab · matthewfallan · Nov 17, 2023 · Nov 15, 2023 · Nov 16, 2023 · Nov 16, 2023
diff --git a/docs/.doctrees/algos/compseq.doctree b/docs/.doctrees/algos/compseq.doctree
diff --git a/docs/.doctrees/algos/index.doctree b/docs/.doctrees/algos/index.doctree
diff --git a/docs/.doctrees/api/seismicrna.clust.doctree b/docs/.doctrees/api/seismicrna.clust.doctree
diff --git a/docs/.doctrees/api/seismicrna.core.batch.doctree b/docs/.doctrees/api/seismicrna.core.batch.doctree
diff --git a/docs/.doctrees/api/seismicrna.core.doctree b/docs/.doctrees/api/seismicrna.core.doctree
diff --git a/docs/.doctrees/api/seismicrna.core.seq.doctree b/docs/.doctrees/api/seismicrna.core.seq.doctree
diff --git a/docs/.doctrees/api/seismicrna.core.seq.tests.doctree b/docs/.doctrees/api/seismicrna.core.seq.tests.doctree
diff --git a/docs/.doctrees/api/seismicrna.table.doctree b/docs/.doctrees/api/seismicrna.table.doctree
diff --git a/docs/.doctrees/api/seismicrna.workflow.doctree b/docs/.doctrees/api/seismicrna.workflow.doctree
diff --git a/docs/.doctrees/cli.doctree b/docs/.doctrees/cli.doctree
diff --git a/docs/.doctrees/data/index.doctree b/docs/.doctrees/data/index.doctree
diff --git a/docs/.doctrees/data/relate.doctree b/docs/.doctrees/data/relate.doctree
diff --git a/docs/.doctrees/data/relate/codes.doctree b/docs/.doctrees/data/relate/codes.doctree
diff --git a/docs/.doctrees/data/relate/index.doctree b/docs/.doctrees/data/relate/index.doctree
diff --git a/docs/.doctrees/data/relate/qnames.doctree b/docs/.doctrees/data/relate/qnames.doctree
diff --git a/docs/.doctrees/data/relate/refseq.doctree b/docs/.doctrees/data/relate/refseq.doctree
diff --git a/docs/.doctrees/data/relate/relate.doctree b/docs/.doctrees/data/relate/relate.doctree
diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle
diff --git a/docs/.doctrees/formats/data/fastq.doctree b/docs/.doctrees/formats/data/fastq.doctree
diff --git a/docs/.doctrees/manuals/index.doctree b/docs/.doctrees/manuals/index.doctree
diff --git a/docs/.doctrees/manuals/parallel.doctree b/docs/.doctrees/manuals/parallel.doctree
diff --git a/docs/.doctrees/manuals/workflow.doctree b/docs/.doctrees/manuals/workflow.doctree
diff --git a/docs/.doctrees/writeme.doctree b/docs/.doctrees/writeme.doctree
diff --git a/docs/_sources/algos/compseq.rst.txt b/docs/_sources/algos/compseq.rst.txt
@@ -0,0 +1,122 @@
+
+Algorithm for Sequence Compression/Decompression
+========================================================================
+
+Background on Sequence Compression/Decompression
+------------------------------------------------------------------------
+
+In a nucleic acid sequence, each base is represented by one character in
+`ASCII code`_ and hence occupies 1 byte (8 bits).
+But because each base has only 4 possibilities, each base requires only
+log\ :sub:`2`\ (4) = 2 bits, shown in the following table:
+
+==== =================
+Base 2-Bit Binary Code
+==== =================
+A    ``00``
+C    ``01``
+G    ``10``
+T/U  ``11``
+==== =================
+
+For efficient storage of (long) nucleic acid sequences, the following
+algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte,
+and then restores the original sequence when needed.
+
+Algorithm for Sequence Compression
+------------------------------------------------------------------------
+
+Algorithm for Sequence Compression: Procedure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First, attributes of the nucleic acid sequence are recorded:
+
+- rna (``bool``): whether the sequence is RNA
+- length (``int``): number of bases in the sequence
+- ns (``tuple[int, ...]``): positions of ``N`` bases, if any
+
+The type and the length are scalar values with constant sizes regardless
+of the length of the sequence.
+The 0-indexed positions of ambiguous bases is an array that at worst (if
+all bases are ``N``) scales linearly with the length of the sequence and
+at best (if no bases are ``N``) is a constant (small) size.
+Because most reference sequences contain no or very few bases that are
+``N``, recording the ``N`` positions generally requires little space.
+
+Then, each non-overlapping segment of four bases is encoded as one byte
+by concatenating the 2-bit codes (above) of the bases in the segment, in
+reverse order.
+Ambiguous (``N``) bases are arbitrarily encoded as ``00``.
+Because this step requires the length of the sequence to be a multiple
+of 4, the sequence is padded on its 3' side with ``A`` (an arbitrary
+choice) until its length is a multiple of 4.
+
+Algorithm for Sequence Compression: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Given the DNA sequence ``CAGNTTCGAN``, the attributes are extracted:
+
+- rna: ``False``
+- length: ``10``
+- ns: ``(3, 9)`` (note: these positions are 0-indexed)
+
+The sequence is then padded with ``A`` at the 3' end until its length
+becomes a multiple of 4 (in this case, length 12):
+
+``CAGNTTCGANAA``
+
+Then, each 4-base segment is transformed into one byte by encoding each
+base and concatenating the codes in order from code 4 to code 1:
+
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+Number Sequence Base 1 Base 2 Base 3 Base 4 Code 4 Code 3 Code 2 Code 1 Byte
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+1      CAGN     C      A      G      N      00     10     00     01     00100001
+2      TTCG     T      T      C      G      10     01     11     11     10011111
+3      ANAA     A      N      A      A      00     00     00     00     00000000
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+
+Thus, the compressed byte string is ``[00100001, 10011111, 00000000]``.
+Note that this string is only 3 bytes, compared to 10 for the original.
+As a ``bytes`` object, the representation is ``b'!\x9f\x00'``.
+
+Algorithm for Sequence Decompression
+------------------------------------------------------------------------
+
+Algorithm for Sequence Decompression: Procedure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Beginning with a compressed sequence of bytes, each byte is transformed
+into four bases by decoding each 2-bit chunk to a base using the above
+table, then reversing the order of the four bases.
+The code ``11`` is decoded to ``U`` if the `rna` attribute is ``True``
+and to ``T`` if ``False``.
+The sequence at this point must have a number of bases divisible by 4.
+It is cut to the correct number of bases using the `length` attribute.
+Finally, every position in the attribute `ns` is masked to ``N``.
+
+Algorithm for Sequence Decompression: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Suppose that a compressed sequence has the following attributes:
+
+- compressed byte string: ``[00100001, 10011111, 00000000]``
+- rna: ``False``
+- length: ``10``
+- ns: ``(3, 9)``
+
+To decompress the sequence, each byte is split into four 2-bit segments,
+decoded, reversed, and reassembled:
+
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+Number Byte     Code 1 Code 2 Code 3 Code 4 Base 4 Base 3 Base 2 Base 1 Sequence
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+1      00100001 00     10     00     01     C      A      G      A      CAGA
+2      10011111 10     01     11     11     T      T      C      G      TTCG
+3      00000000 00     00     00     00     A      A      A      A      AAAA
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+
+The resulting sequence, ``CAGATTCGAAAA``, is trimmed to 10 nt: ``CAGATTCGAA``.
+Finally, (0-indexed) positions 3 and 9 are replaced with ``N``: ``CAGNTTCGAN``.
+
+.. _`ASCII code`: https://en.wikipedia.org/wiki/ASCII
diff --git a/docs/_sources/algos/index.rst.txt b/docs/_sources/algos/index.rst.txt
@@ -1,3 +1,8 @@
 ************************************************************************
 Algorithms
 ************************************************************************
+
+.. toctree::
+    :maxdepth: 2
+
+    compseq
diff --git a/docs/_sources/data/index.rst.txt b/docs/_sources/data/index.rst.txt
@@ -6,4 +6,4 @@ Data Structures
 .. toctree::
     :maxdepth: 2
 
-    relate
+    relate/index
diff --git a/docs/_sources/data/relate.rst.txt → docs/_sources/data/relate/codes.rst.txt b/docs/_sources/data/relate.rst.txt → docs/_sources/data/relate/codes.rst.txt
@@ -61,7 +61,9 @@ for some bases to define the relationship:
 - low-quality base calls
 - ambiguous insertions and deletions
 
-Low-quality base calls
+.. _relate_low_qual:
+
+Encoding low-quality base calls
 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 
 A low-quality base call is defined as having a `Phred quality score`_
@@ -115,7 +117,7 @@ four columns "Read: A/C/G/T?".
     than once towards the total number of matches or mutations. To learn
     how mutations in relation vectors are counted, see [REF].
 
-Ambiguous insertions and deletions
+Encoding ambiguous insertions and deletions
 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 
 Insertions and deletions (collectively, "indels") in the read can cause
@@ -183,7 +185,7 @@ is covered in one mate but not in the other, one mate's Phred score is
 above the threshold and the other's is below, or (more rarely) the base
 calls themselves differ.
 
-Consensus relationships
+Encoding consensus relationships
 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 
 When finding the consensus of two mates, information in one mate should
@@ -219,7 +221,7 @@ position 4 is a match, but it is low quality at position 5, so even the
 consensus byte is ambiguous. Neither mate covers position 6, so the
 consensus byte is blank.
 
-Irreconcilable relationships
+Encoding irreconcilable relationships
 """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
 
 It is possible, although rare, for mates 1 and 2 to share no bits. For

diff --git a/docs/_sources/data/relate/index.rst.txt b/docs/_sources/data/relate/index.rst.txt
@@ -0,0 +1,11 @@
+
+Relate Data Structures
+========================================================================
+
+.. toctree::
+    :maxdepth: 2
+
+    codes
+    relate
+    qnames
+    refseq
diff --git a/docs/_sources/data/relate/qnames.rst.txt b/docs/_sources/data/relate/qnames.rst.txt
@@ -0,0 +1,39 @@
+
+Read Names Batch
+------------------------------------------------------------------------
+
+Each batch of relation vectors is a ``QnamesBatchIO`` object.
+
+Read names batch: Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The read names are in the attribute ``names``, a ``numpy.ndarray`` with
+data type ``str``.
+The name of the read with number ``i`` (see :ref:`relate_read_nums` for
+more information) is at index ``i`` of the ``names`` array.
+
+Read names batch: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Suppose a XAM file contains five reads:
+
+=========== ===========================================
+Read Number Read Name
+=========== ===========================================
+0           ``VL00355:66:AACYHM7M5:1:2106:54095:30496``
+1           ``VL00355:66:AACYHM7M5:1:2303:34554:55259``
+2           ``VL00355:66:AACYHM7M5:1:2411:30558:38352``
+3           ``VL00355:66:AACYHM7M5:1:1206:41825:52949``
+4           ``VL00355:66:AACYHM7M5:1:2506:10903:12605``
+=========== ===========================================
+
+The relate step would write a batch with the ``names`` attribute ::
+
+    ["VL00355:66:AACYHM7M5:1:2106:54095:30496",
+     "VL00355:66:AACYHM7M5:1:2303:34554:55259",
+     "VL00355:66:AACYHM7M5:1:2411:30558:38352",
+     "VL00355:66:AACYHM7M5:1:1206:41825:52949",
+     "VL00355:66:AACYHM7M5:1:2506:10903:12605"]
+
+Note that the attribute is shown as a ``list`` for visual simplicity,
+but would really be a ``numpy.ndarray``.
diff --git a/docs/_sources/data/relate/refseq.rst.txt b/docs/_sources/data/relate/refseq.rst.txt
@@ -0,0 +1,31 @@
+
+Reference Sequence
+------------------------------------------------------------------------
+
+The reference sequence is saved as a ``RefseqIO`` object.
+
+Reference sequence: Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A ``RefseqIO`` object stores a reference sequence as a space-efficient
+``CompressedSeq`` object in the private attribute ``_s``.
+(The original sequence is available via the cached property ``refseq``.)
+A ``CompressedSeq`` object has the following attributes:
+
+- ``b`` (``bytes``): sequence of bytes, each byte encoding 4 nucleotides
+- ``s`` (``int``): length of the sequence (number of nucleotides)
+- ``n`` (``tuple[int, ...]``): 0-indexed position of each ``N``
+- ``r`` (``bool``): ``True`` if the sequence is RNA, ``False`` if DNA
+
+Reference sequence: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Suppose that a DNA sequence is ``CAGNTTCGAN``.
+This sequence would be compressed into ::
+
+    b = b'!\x9f\x00'
+    s = 10
+    n = (3, 9)
+    r = False
+
+For details on the algorithm, see :doc:`../../algos/compseq`.
diff --git a/docs/_sources/data/relate/relate.rst.txt b/docs/_sources/data/relate/relate.rst.txt
@@ -0,0 +1,125 @@
+
+Relate Batch
+------------------------------------------------------------------------
+
+Each batch of relation vectors is a ``RelateBatchIO`` object.
+
+Relate batch: Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The following attributes encode the relationships between each read and
+each position in the reference sequence:
+
+========= ============================================ ====================================================================
+Attribute Data Type                                    Description
+========= ============================================ ====================================================================
+``end5s`` ``numpy.ndarray[int]``                       array of the first position of the most upstream mate in each read
+``mid5s`` ``numpy.ndarray[int]``                       array of the first position of the most downstream mate in each read
+``mid3s`` ``numpy.ndarray[int]``                       array of the last position of the most upstream mate in each read
+``end3s`` ``numpy.ndarray[int]``                       array of the last position of the most downstream mate in each read
+``muts``  ``dict[int, dict[int, numpy.ndarray[int]]]`` array of the reads with each type of mutation at each position
+========= ============================================ ====================================================================
+
+.. note::
+    The positions of the first and last bases in the reference sequence
+    are defined to be 1 and the length of the sequence, respectively.
+
+.. _relate_read_nums:
+
+Relate batch: Structure of read numbers
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Each read, or pair of paired-end reads, is labeled with a non-negative
+integer: 0 for the first read in each batch, and incrementing by 1 for
+each subsequent read.
+Within one batch, all read numbers are unique.
+However, two different batches can have reads that share numbers.
+
+Relate batch: Structure of 5' and 3' end positions
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+- ``end5s``, ``mid5s``, ``mid3s``, and ``end3s`` are all 1-dimensional
+  ``numpy.ndarray`` objects.
+- For any relate batch, ``end5s``, ``mid5s``, ``mid3s``, and ``end3s``
+  all have the same length (which may be any integer ≥ 0).
+- A read with index ``i`` corresponds to the ``i``\ th values of
+  ``end5s``, ``mid5s``, ``mid3s``, and ``end3s``; denoted (respectively)
+  ``end5s[i]``, ``mid5s[i]``, ``mid3s[i]``, and ``end3s[i]``.
+- For every read ``i``:
+
+  - 1 ≤ ``end5s[i]`` ≤ ``end3s[i]`` ≤ length of reference sequence
+  - If paired-end and there is a gap of ≥ 1 nt between mates 1 and 2:
+
+    - ``end5s[i]`` ≤ ``mid3s[i]`` < ``mid5s[i]`` ≤ ``end3s[i]``
+
+  - Otherwise:
+
+    - ``end5s[i]`` = ``mid5s[i]`` ≤ ``mid3s[i]`` = ``end3s[i]``
+
+Relate batch: Structure of mutations
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+``muts`` is a ``dict`` wherein
+
+- each key is a position in the reference sequence (``int``)
+- each value is a ``dict`` wherein
+
+  - each key is a type of mutation (``int``, see :doc:`./codes` for more
+    information)
+  - each value is an array of the numbers of the reads that have the
+    given type of mutation at the given position (``numpy.ndarray``)
+
+Relate batch: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For example, suppose that the reference sequence is ``TCAGAACC`` and a
+batch contains five paired-end reads, numbered 0 to 4:
+
+==== ==== ============
+Read Mate Alignment
+==== ==== ============
+0    1    ``_CAG____``
+0    2    ``_____AGC``
+1    1    ``___GTA__``
+1    2    ``TCT_____``
+2    1    ``____AAC_``
+2    2    ``_CA_____``
+3    1    ``TAAGT___``
+3    2    ``______CC``
+4    1    ``__AGA___``
+4    2    ``___GA-C_``
+Ref       ``TCAGAACC``
+==== ==== ============
+
+The positions, reads, and relationships can be shown explicitly as a
+matrix (see :doc:`./codes` for information on the relationship codes):
+
+==== === === === === === === === ===
+Read 1   2   3   4   5   6   7   8
+==== === === === === === === === ===
+0    255 1   1   1   255 1   64  1
+1    1   1   128 1   128 1   255 255
+2    255 1   1   255 1   1   1   255
+3    1   16  1   1   128 255 1   1
+4    255 255 1   1   3   3   1   255
+==== === === === === === === === ===
+
+In a relate batch, they would be encoded as follows:
+
+- ``end5s``: ``[2, 1, 2, 1, 3]``
+- ``mid5s``: ``[4, 1, 3, 5, 3]``
+- ``mid3s``: ``[6, 6, 5, 7, 7]``
+- ``end3s``: ``[8, 6, 7, 8, 7]``
+- ``muts``::
+
+    {1: {},
+     2: {16: [3]},
+     3: {128: [1]},
+     4: {},
+     5: {3: [4], 128: [1, 3]},
+     6: {3: [4]},
+     7: {64: [0]},
+     8: {}}
+
+  Note that the numbers are shown here for visual simplicity as ``list``
+  objects, but would really be ``numpy.ndarray`` objects.
diff --git a/docs/_sources/formats/data/fastq.rst.txt b/docs/_sources/formats/data/fastq.rst.txt
@@ -33,10 +33,10 @@ file was called correctly during sequencing.
 The probability *p* that a base was called incorrectly is 10 raised to
 the power of the quotient of the Phred score *s* and -10:
 
-*p* = 10 :sup:`-s/10`
+*p* = 10\ :sup:`-s/10`
 
 For example, if a base call has a Phred score of 30, the probability
-that the base call is incorrect is 10 :sup:`-30/10` = 0.001.
+that the base call is incorrect is 10\ :sup:`-30/10` = 0.001.
 
 In FASTQ files, each phred quality score (a non-negative integer) is
 encoded as one character of text by adding another integer *N* to the