diff --git a/docs/.doctrees/algos/compseq.doctree b/docs/.doctrees/algos/compseq.doctree
new file mode 100644
index 00000000..18c69f17
Binary files /dev/null and b/docs/.doctrees/algos/compseq.doctree differ
diff --git a/docs/.doctrees/algos/index.doctree b/docs/.doctrees/algos/index.doctree
index 2ab2a891..dff68b6b 100644
Binary files a/docs/.doctrees/algos/index.doctree and b/docs/.doctrees/algos/index.doctree differ
diff --git a/docs/.doctrees/api/seismicrna.clust.doctree b/docs/.doctrees/api/seismicrna.clust.doctree
index a9302d23..13a86170 100644
Binary files a/docs/.doctrees/api/seismicrna.clust.doctree and b/docs/.doctrees/api/seismicrna.clust.doctree differ
diff --git a/docs/.doctrees/api/seismicrna.core.batch.doctree b/docs/.doctrees/api/seismicrna.core.batch.doctree
index d251699f..43383446 100644
Binary files a/docs/.doctrees/api/seismicrna.core.batch.doctree and b/docs/.doctrees/api/seismicrna.core.batch.doctree differ
diff --git a/docs/.doctrees/api/seismicrna.core.doctree b/docs/.doctrees/api/seismicrna.core.doctree
index fbb016f6..279037e4 100644
Binary files a/docs/.doctrees/api/seismicrna.core.doctree and b/docs/.doctrees/api/seismicrna.core.doctree differ
diff --git a/docs/.doctrees/api/seismicrna.core.seq.doctree b/docs/.doctrees/api/seismicrna.core.seq.doctree
index a10fb2a1..b0f8f439 100644
Binary files a/docs/.doctrees/api/seismicrna.core.seq.doctree and b/docs/.doctrees/api/seismicrna.core.seq.doctree differ
diff --git a/docs/.doctrees/api/seismicrna.core.seq.tests.doctree b/docs/.doctrees/api/seismicrna.core.seq.tests.doctree
index 5f17e88e..bba06ef6 100644
Binary files a/docs/.doctrees/api/seismicrna.core.seq.tests.doctree and b/docs/.doctrees/api/seismicrna.core.seq.tests.doctree differ
diff --git a/docs/.doctrees/api/seismicrna.table.doctree b/docs/.doctrees/api/seismicrna.table.doctree
index 5dd0dd24..9b4f5e2a 100644
Binary files a/docs/.doctrees/api/seismicrna.table.doctree and b/docs/.doctrees/api/seismicrna.table.doctree differ
diff --git a/docs/.doctrees/api/seismicrna.workflow.doctree b/docs/.doctrees/api/seismicrna.workflow.doctree
index 26691a61..b33709a9 100644
Binary files a/docs/.doctrees/api/seismicrna.workflow.doctree and b/docs/.doctrees/api/seismicrna.workflow.doctree differ
diff --git a/docs/.doctrees/cli.doctree b/docs/.doctrees/cli.doctree
index 978569b6..d43bbc8e 100644
Binary files a/docs/.doctrees/cli.doctree and b/docs/.doctrees/cli.doctree differ
diff --git a/docs/.doctrees/data/index.doctree b/docs/.doctrees/data/index.doctree
index acf27098..035927ca 100644
Binary files a/docs/.doctrees/data/index.doctree and b/docs/.doctrees/data/index.doctree differ
diff --git a/docs/.doctrees/data/relate.doctree b/docs/.doctrees/data/relate.doctree
deleted file mode 100644
index d850ced3..00000000
Binary files a/docs/.doctrees/data/relate.doctree and /dev/null differ
diff --git a/docs/.doctrees/data/relate/codes.doctree b/docs/.doctrees/data/relate/codes.doctree
new file mode 100644
index 00000000..c0350117
Binary files /dev/null and b/docs/.doctrees/data/relate/codes.doctree differ
diff --git a/docs/.doctrees/data/relate/index.doctree b/docs/.doctrees/data/relate/index.doctree
new file mode 100644
index 00000000..65c56fc6
Binary files /dev/null and b/docs/.doctrees/data/relate/index.doctree differ
diff --git a/docs/.doctrees/data/relate/qnames.doctree b/docs/.doctrees/data/relate/qnames.doctree
new file mode 100644
index 00000000..9b2b653d
Binary files /dev/null and b/docs/.doctrees/data/relate/qnames.doctree differ
diff --git a/docs/.doctrees/data/relate/refseq.doctree b/docs/.doctrees/data/relate/refseq.doctree
new file mode 100644
index 00000000..2b299867
Binary files /dev/null and b/docs/.doctrees/data/relate/refseq.doctree differ
diff --git a/docs/.doctrees/data/relate/relate.doctree b/docs/.doctrees/data/relate/relate.doctree
new file mode 100644
index 00000000..5a0e6f3a
Binary files /dev/null and b/docs/.doctrees/data/relate/relate.doctree differ
diff --git a/docs/.doctrees/environment.pickle b/docs/.doctrees/environment.pickle
index 8ed028ff..a4103d74 100644
Binary files a/docs/.doctrees/environment.pickle and b/docs/.doctrees/environment.pickle differ
diff --git a/docs/.doctrees/formats/data/fastq.doctree b/docs/.doctrees/formats/data/fastq.doctree
index 9fdeab70..6b7acd91 100644
Binary files a/docs/.doctrees/formats/data/fastq.doctree and b/docs/.doctrees/formats/data/fastq.doctree differ
diff --git a/docs/.doctrees/manuals/index.doctree b/docs/.doctrees/manuals/index.doctree
index 1e0f8634..52f0fbca 100644
Binary files a/docs/.doctrees/manuals/index.doctree and b/docs/.doctrees/manuals/index.doctree differ
diff --git a/docs/.doctrees/manuals/parallel.doctree b/docs/.doctrees/manuals/parallel.doctree
new file mode 100644
index 00000000..a1866d27
Binary files /dev/null and b/docs/.doctrees/manuals/parallel.doctree differ
diff --git a/docs/.doctrees/manuals/workflow.doctree b/docs/.doctrees/manuals/workflow.doctree
index b2036cae..88b1846f 100644
Binary files a/docs/.doctrees/manuals/workflow.doctree and b/docs/.doctrees/manuals/workflow.doctree differ
diff --git a/docs/.doctrees/writeme.doctree b/docs/.doctrees/writeme.doctree
index ea77a4b6..5ec41993 100644
Binary files a/docs/.doctrees/writeme.doctree and b/docs/.doctrees/writeme.doctree differ
diff --git a/docs/_sources/algos/compseq.rst.txt b/docs/_sources/algos/compseq.rst.txt
new file mode 100644
index 00000000..09a3d904
--- /dev/null
+++ b/docs/_sources/algos/compseq.rst.txt
@@ -0,0 +1,122 @@
+
+Algorithm for Sequence Compression/Decompression
+========================================================================
+
+Background on Sequence Compression/Decompression
+------------------------------------------------------------------------
+
+In a nucleic acid sequence, each base is represented by one character in
+`ASCII code`_ and hence occupies 1 byte (8 bits).
+But because each base has only 4 possibilities, each base requires only
+log\ :sub:`2`\ (4) = 2 bits, shown in the following table:
+
+==== =================
+Base 2-Bit Binary Code
+==== =================
+A ``00``
+C ``01``
+G ``10``
+T/U ``11``
+==== =================
+
+For efficient storage of (long) nucleic acid sequences, the following
+algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte,
+and then restores the original sequence when needed.
+
+Algorithm for Sequence Compression
+------------------------------------------------------------------------
+
+Algorithm for Sequence Compression: Procedure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First, attributes of the nucleic acid sequence are recorded:
+
+- rna (``bool``): whether the sequence is RNA
+- length (``int``): number of bases in the sequence
+- ns (``tuple[int, ...]``): positions of ``N`` bases, if any
+
+The type and the length are scalar values with constant sizes regardless
+of the length of the sequence.
+The 0-indexed positions of ambiguous bases is an array that at worst (if
+all bases are ``N``) scales linearly with the length of the sequence and
+at best (if no bases are ``N``) is a constant (small) size.
+Because most reference sequences contain no or very few bases that are
+``N``, recording the ``N`` positions generally requires little space.
+
+Then, each non-overlapping segment of four bases is encoded as one byte
+by concatenating the 2-bit codes (above) of the bases in the segment, in
+reverse order.
+Ambiguous (``N``) bases are arbitrarily encoded as ``00``.
+Because this step requires the length of the sequence to be a multiple
+of 4, the sequence is padded on its 3' side with ``A`` (an arbitrary
+choice) until its length is a multiple of 4.
+
+Algorithm for Sequence Compression: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Given the DNA sequence ``CAGNTTCGAN``, the attributes are extracted:
+
+- rna: ``False``
+- length: ``10``
+- ns: ``(3, 9)`` (note: these positions are 0-indexed)
+
+The sequence is then padded with ``A`` at the 3' end until its length
+becomes a multiple of 4 (in this case, length 12):
+
+``CAGNTTCGANAA``
+
+Then, each 4-base segment is transformed into one byte by encoding each
+base and concatenating the codes in order from code 4 to code 1:
+
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+Number Sequence Base 1 Base 2 Base 3 Base 4 Code 4 Code 3 Code 2 Code 1 Byte
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+1 CAGN C A G N 00 10 00 01 00100001
+2 TTCG T T C G 10 01 11 11 10011111
+3 ANAA A N A A 00 00 00 00 00000000
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+
+Thus, the compressed byte string is ``[00100001, 10011111, 00000000]``.
+Note that this string is only 3 bytes, compared to 10 for the original.
+As a ``bytes`` object, the representation is ``b'!\x9f\x00'``.
+
+Algorithm for Sequence Decompression
+------------------------------------------------------------------------
+
+Algorithm for Sequence Decompression: Procedure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Beginning with a compressed sequence of bytes, each byte is transformed
+into four bases by decoding each 2-bit chunk to a base using the above
+table, then reversing the order of the four bases.
+The code ``11`` is decoded to ``U`` if the `rna` attribute is ``True``
+and to ``T`` if ``False``.
+The sequence at this point must have a number of bases divisible by 4.
+It is cut to the correct number of bases using the `length` attribute.
+Finally, every position in the attribute `ns` is masked to ``N``.
+
+Algorithm for Sequence Decompression: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Suppose that a compressed sequence has the following attributes:
+
+- compressed byte string: ``[00100001, 10011111, 00000000]``
+- rna: ``False``
+- length: ``10``
+- ns: ``(3, 9)``
+
+To decompress the sequence, each byte is split into four 2-bit segments,
+decoded, reversed, and reassembled:
+
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+Number Byte Code 1 Code 2 Code 3 Code 4 Base 4 Base 3 Base 2 Base 1 Sequence
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+1 00100001 00 10 00 01 C A G A CAGA
+2 10011111 10 01 11 11 T T C G TTCG
+3 00000000 00 00 00 00 A A A A AAAA
+====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
+
+The resulting sequence, ``CAGATTCGAAAA``, is trimmed to 10 nt: ``CAGATTCGAA``.
+Finally, (0-indexed) positions 3 and 9 are replaced with ``N``: ``CAGNTTCGAN``.
+
+.. _`ASCII code`: https://en.wikipedia.org/wiki/ASCII
diff --git a/docs/_sources/algos/index.rst.txt b/docs/_sources/algos/index.rst.txt
index 59e21ac4..fb935511 100644
--- a/docs/_sources/algos/index.rst.txt
+++ b/docs/_sources/algos/index.rst.txt
@@ -1,3 +1,8 @@
************************************************************************
Algorithms
************************************************************************
+
+.. toctree::
+ :maxdepth: 2
+
+ compseq
diff --git a/docs/_sources/data/index.rst.txt b/docs/_sources/data/index.rst.txt
index de902f1f..3f5855ae 100644
--- a/docs/_sources/data/index.rst.txt
+++ b/docs/_sources/data/index.rst.txt
@@ -6,4 +6,4 @@ Data Structures
.. toctree::
:maxdepth: 2
- relate
+ relate/index
diff --git a/docs/_sources/data/relate.rst.txt b/docs/_sources/data/relate/codes.rst.txt
similarity index 98%
rename from docs/_sources/data/relate.rst.txt
rename to docs/_sources/data/relate/codes.rst.txt
index 04ab01c5..aa9d24ba 100644
--- a/docs/_sources/data/relate.rst.txt
+++ b/docs/_sources/data/relate/codes.rst.txt
@@ -61,7 +61,9 @@ for some bases to define the relationship:
- low-quality base calls
- ambiguous insertions and deletions
-Low-quality base calls
+.. _relate_low_qual:
+
+Encoding low-quality base calls
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
A low-quality base call is defined as having a `Phred quality score`_
@@ -115,7 +117,7 @@ four columns "Read: A/C/G/T?".
than once towards the total number of matches or mutations. To learn
how mutations in relation vectors are counted, see [REF].
-Ambiguous insertions and deletions
+Encoding ambiguous insertions and deletions
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Insertions and deletions (collectively, "indels") in the read can cause
@@ -183,7 +185,7 @@ is covered in one mate but not in the other, one mate's Phred score is
above the threshold and the other's is below, or (more rarely) the base
calls themselves differ.
-Consensus relationships
+Encoding consensus relationships
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
When finding the consensus of two mates, information in one mate should
@@ -219,7 +221,7 @@ position 4 is a match, but it is low quality at position 5, so even the
consensus byte is ambiguous. Neither mate covers position 6, so the
consensus byte is blank.
-Irreconcilable relationships
+Encoding irreconcilable relationships
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
It is possible, although rare, for mates 1 and 2 to share no bits. For
diff --git a/docs/_sources/data/relate/index.rst.txt b/docs/_sources/data/relate/index.rst.txt
new file mode 100644
index 00000000..04b366c6
--- /dev/null
+++ b/docs/_sources/data/relate/index.rst.txt
@@ -0,0 +1,11 @@
+
+Relate Data Structures
+========================================================================
+
+.. toctree::
+ :maxdepth: 2
+
+ codes
+ relate
+ qnames
+ refseq
diff --git a/docs/_sources/data/relate/qnames.rst.txt b/docs/_sources/data/relate/qnames.rst.txt
new file mode 100644
index 00000000..b82c9052
--- /dev/null
+++ b/docs/_sources/data/relate/qnames.rst.txt
@@ -0,0 +1,39 @@
+
+Read Names Batch
+------------------------------------------------------------------------
+
+Each batch of relation vectors is a ``QnamesBatchIO`` object.
+
+Read names batch: Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The read names are in the attribute ``names``, a ``numpy.ndarray`` with
+data type ``str``.
+The name of the read with number ``i`` (see :ref:`relate_read_nums` for
+more information) is at index ``i`` of the ``names`` array.
+
+Read names batch: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Suppose a XAM file contains five reads:
+
+=========== ===========================================
+Read Number Read Name
+=========== ===========================================
+0 ``VL00355:66:AACYHM7M5:1:2106:54095:30496``
+1 ``VL00355:66:AACYHM7M5:1:2303:34554:55259``
+2 ``VL00355:66:AACYHM7M5:1:2411:30558:38352``
+3 ``VL00355:66:AACYHM7M5:1:1206:41825:52949``
+4 ``VL00355:66:AACYHM7M5:1:2506:10903:12605``
+=========== ===========================================
+
+The relate step would write a batch with the ``names`` attribute ::
+
+ ["VL00355:66:AACYHM7M5:1:2106:54095:30496",
+ "VL00355:66:AACYHM7M5:1:2303:34554:55259",
+ "VL00355:66:AACYHM7M5:1:2411:30558:38352",
+ "VL00355:66:AACYHM7M5:1:1206:41825:52949",
+ "VL00355:66:AACYHM7M5:1:2506:10903:12605"]
+
+Note that the attribute is shown as a ``list`` for visual simplicity,
+but would really be a ``numpy.ndarray``.
diff --git a/docs/_sources/data/relate/refseq.rst.txt b/docs/_sources/data/relate/refseq.rst.txt
new file mode 100644
index 00000000..59165633
--- /dev/null
+++ b/docs/_sources/data/relate/refseq.rst.txt
@@ -0,0 +1,31 @@
+
+Reference Sequence
+------------------------------------------------------------------------
+
+The reference sequence is saved as a ``RefseqIO`` object.
+
+Reference sequence: Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+A ``RefseqIO`` object stores a reference sequence as a space-efficient
+``CompressedSeq`` object in the private attribute ``_s``.
+(The original sequence is available via the cached property ``refseq``.)
+A ``CompressedSeq`` object has the following attributes:
+
+- ``b`` (``bytes``): sequence of bytes, each byte encoding 4 nucleotides
+- ``s`` (``int``): length of the sequence (number of nucleotides)
+- ``n`` (``tuple[int, ...]``): 0-indexed position of each ``N``
+- ``r`` (``bool``): ``True`` if the sequence is RNA, ``False`` if DNA
+
+Reference sequence: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Suppose that a DNA sequence is ``CAGNTTCGAN``.
+This sequence would be compressed into ::
+
+ b = b'!\x9f\x00'
+ s = 10
+ n = (3, 9)
+ r = False
+
+For details on the algorithm, see :doc:`../../algos/compseq`.
diff --git a/docs/_sources/data/relate/relate.rst.txt b/docs/_sources/data/relate/relate.rst.txt
new file mode 100644
index 00000000..89cab714
--- /dev/null
+++ b/docs/_sources/data/relate/relate.rst.txt
@@ -0,0 +1,125 @@
+
+Relate Batch
+------------------------------------------------------------------------
+
+Each batch of relation vectors is a ``RelateBatchIO`` object.
+
+Relate batch: Structure
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The following attributes encode the relationships between each read and
+each position in the reference sequence:
+
+========= ============================================ ====================================================================
+Attribute Data Type Description
+========= ============================================ ====================================================================
+``end5s`` ``numpy.ndarray[int]`` array of the first position of the most upstream mate in each read
+``mid5s`` ``numpy.ndarray[int]`` array of the first position of the most downstream mate in each read
+``mid3s`` ``numpy.ndarray[int]`` array of the last position of the most upstream mate in each read
+``end3s`` ``numpy.ndarray[int]`` array of the last position of the most downstream mate in each read
+``muts`` ``dict[int, dict[int, numpy.ndarray[int]]]`` array of the reads with each type of mutation at each position
+========= ============================================ ====================================================================
+
+.. note::
+ The positions of the first and last bases in the reference sequence
+ are defined to be 1 and the length of the sequence, respectively.
+
+.. _relate_read_nums:
+
+Relate batch: Structure of read numbers
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Each read, or pair of paired-end reads, is labeled with a non-negative
+integer: 0 for the first read in each batch, and incrementing by 1 for
+each subsequent read.
+Within one batch, all read numbers are unique.
+However, two different batches can have reads that share numbers.
+
+Relate batch: Structure of 5' and 3' end positions
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+- ``end5s``, ``mid5s``, ``mid3s``, and ``end3s`` are all 1-dimensional
+ ``numpy.ndarray`` objects.
+- For any relate batch, ``end5s``, ``mid5s``, ``mid3s``, and ``end3s``
+ all have the same length (which may be any integer ≥ 0).
+- A read with index ``i`` corresponds to the ``i``\ th values of
+ ``end5s``, ``mid5s``, ``mid3s``, and ``end3s``; denoted (respectively)
+ ``end5s[i]``, ``mid5s[i]``, ``mid3s[i]``, and ``end3s[i]``.
+- For every read ``i``:
+
+ - 1 ≤ ``end5s[i]`` ≤ ``end3s[i]`` ≤ length of reference sequence
+ - If paired-end and there is a gap of ≥ 1 nt between mates 1 and 2:
+
+ - ``end5s[i]`` ≤ ``mid3s[i]`` < ``mid5s[i]`` ≤ ``end3s[i]``
+
+ - Otherwise:
+
+ - ``end5s[i]`` = ``mid5s[i]`` ≤ ``mid3s[i]`` = ``end3s[i]``
+
+Relate batch: Structure of mutations
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+``muts`` is a ``dict`` wherein
+
+- each key is a position in the reference sequence (``int``)
+- each value is a ``dict`` wherein
+
+ - each key is a type of mutation (``int``, see :doc:`./codes` for more
+ information)
+ - each value is an array of the numbers of the reads that have the
+ given type of mutation at the given position (``numpy.ndarray``)
+
+Relate batch: Example
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+For example, suppose that the reference sequence is ``TCAGAACC`` and a
+batch contains five paired-end reads, numbered 0 to 4:
+
+==== ==== ============
+Read Mate Alignment
+==== ==== ============
+0 1 ``_CAG____``
+0 2 ``_____AGC``
+1 1 ``___GTA__``
+1 2 ``TCT_____``
+2 1 ``____AAC_``
+2 2 ``_CA_____``
+3 1 ``TAAGT___``
+3 2 ``______CC``
+4 1 ``__AGA___``
+4 2 ``___GA-C_``
+Ref ``TCAGAACC``
+==== ==== ============
+
+The positions, reads, and relationships can be shown explicitly as a
+matrix (see :doc:`./codes` for information on the relationship codes):
+
+==== === === === === === === === ===
+Read 1 2 3 4 5 6 7 8
+==== === === === === === === === ===
+0 255 1 1 1 255 1 64 1
+1 1 1 128 1 128 1 255 255
+2 255 1 1 255 1 1 1 255
+3 1 16 1 1 128 255 1 1
+4 255 255 1 1 3 3 1 255
+==== === === === === === === === ===
+
+In a relate batch, they would be encoded as follows:
+
+- ``end5s``: ``[2, 1, 2, 1, 3]``
+- ``mid5s``: ``[4, 1, 3, 5, 3]``
+- ``mid3s``: ``[6, 6, 5, 7, 7]``
+- ``end3s``: ``[8, 6, 7, 8, 7]``
+- ``muts``::
+
+ {1: {},
+ 2: {16: [3]},
+ 3: {128: [1]},
+ 4: {},
+ 5: {3: [4], 128: [1, 3]},
+ 6: {3: [4]},
+ 7: {64: [0]},
+ 8: {}}
+
+ Note that the numbers are shown here for visual simplicity as ``list``
+ objects, but would really be ``numpy.ndarray`` objects.
diff --git a/docs/_sources/formats/data/fastq.rst.txt b/docs/_sources/formats/data/fastq.rst.txt
index b91570d4..d81c6b4a 100644
--- a/docs/_sources/formats/data/fastq.rst.txt
+++ b/docs/_sources/formats/data/fastq.rst.txt
@@ -33,10 +33,10 @@ file was called correctly during sequencing.
The probability *p* that a base was called incorrectly is 10 raised to
the power of the quotient of the Phred score *s* and -10:
-*p* = 10 :sup:`-s/10`
+*p* = 10\ :sup:`-s/10`
For example, if a base call has a Phred score of 30, the probability
-that the base call is incorrect is 10 :sup:`-30/10` = 0.001.
+that the base call is incorrect is 10\ :sup:`-30/10` = 0.001.
In FASTQ files, each phred quality score (a non-negative integer) is
encoded as one character of text by adding another integer *N* to the
diff --git a/docs/_sources/manuals/index.rst.txt b/docs/_sources/manuals/index.rst.txt
index 6103830c..f684acf2 100644
--- a/docs/_sources/manuals/index.rst.txt
+++ b/docs/_sources/manuals/index.rst.txt
@@ -18,4 +18,5 @@ For a complete list of commands and their parameters, see :doc:`../cli`.
workflow
faclean
inputs
+ parallel
logging
diff --git a/docs/_sources/manuals/parallel.rst.txt b/docs/_sources/manuals/parallel.rst.txt
new file mode 100644
index 00000000..3e398788
--- /dev/null
+++ b/docs/_sources/manuals/parallel.rst.txt
@@ -0,0 +1,18 @@
+
+Parallelizing Tasks
+========================================================================
+
+
+.. _batches:
+
+Batches and Benefits
+------------------------------------------------------------------------
+
+SEISMIC-RNA can divide up data processing into multiple batches.
+The benefits of batching depend on whether the batches are processed in
+parallel or in series:
+
+- In parallel, multiple batches can be processed simultaneously, which
+ speeds up processing the entire dataset.
+- In series, only the data for one batch must be stored in memory at one
+ time, which reduces the memory requirement.
diff --git a/docs/_sources/manuals/workflow.rst.txt b/docs/_sources/manuals/workflow.rst.txt
index 144f6275..008e84e8 100644
--- a/docs/_sources/manuals/workflow.rst.txt
+++ b/docs/_sources/manuals/workflow.rst.txt
@@ -10,8 +10,8 @@ There are two points from which you can begin the main workflow:
or CRAM format), then begin at the step :ref:`wf_relate`.
.. note::
- You can pass either (or both) types of input files into the command
- ``seismic all``, which runs the entire workflow.
+ The command ``seismic all`` accepts both types of inputs and runs
+ the entire workflow.
See :ref:`wf_all` for more details.
@@ -175,10 +175,10 @@ There are three ways to align multiple FASTQ files (or pairs thereof):
Thus, the given directory can have deeply nested subdirectories, and
SEISMIC-RNA will still find and process any FASTQ files within them.
-Align: Options
+Align: Settings
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-Align option: Phred score encoding
+Align setting: Phred score encoding
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
SEISMIC-RNA defaults to using Phred+33 encoding for FASTQ files, which
@@ -204,7 +204,7 @@ in the "Basic Statisics" section:
- Otherwise, you will need to search elsewhere for your encoding scheme
to determine the Phred score offset.
-Align option: Quality assessment with FastQC
+Align setting: Quality assessment with FastQC
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
By default, each FASTQ file is processed with `FastQC`_, both before and
@@ -213,7 +213,7 @@ FastQC can be disabled with the flag ``--no-fastqc``.
To enable automatic extraction of the zipped output files from FastQC,
add the flag ``--qc-extract``.
-Align option: Trimming reads with Cutadapt
+Align setting: Trimming reads with Cutadapt
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
By default, each FASTQ file and pair of mated FASTQ files is trimmed for
@@ -246,7 +246,7 @@ How to trim low-quality base calls
Base calls on either end of a read that fall below a minimum Phred score
quality are trimmed with Cutadapt.
The default minimum quality is 25, which corresponds to a probability of
-1 - 10 :sup:`-2.5` = 0.997 that the base call is correct.
+1 - 10\ :sup:`-2.5` = 0.997 that the base call is correct.
(See :ref:`phred_encodings` for more details).
To change the quality threshold, use the option ``--min-phred``.
@@ -276,7 +276,7 @@ your FASTQ files outside of SEISMIC-RNA, then perform alignment within
SEISMIC-RNA, using the option ``--no-cut`` to disable additional adapter
trimming.
-Align option: Mapping reads with Bowtie 2
+Align setting: Mapping reads with Bowtie 2
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
How to pre-build a Bowtie 2 index (optional)
@@ -385,12 +385,26 @@ on setting this parameter.
the read aligns with a high score to exactly one location, low quality
if it aligns with similar scores to multiple locations in the reference.
The default minimum quality is 25, which corresponds to a confidence of
-1 - 10 :sup:`-2.5` = 0.997 that the read has aligned correctly.
+1 - 10\ :sup:`-2.5` = 0.997 that the read has aligned correctly.
To change the quality threshold, use the option ``--min-mapq``.
For those searching for this option in Bowtie 2, you will not find it.
Instead, reads with insufficient mapping quality are filtered out after
alignment using the `view command in Samtools`_.
+How to filter by number of aligned reads
+''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+
+In general, alignment maps containing very few reads are not useful for
+mutational profiling, due to their inherently low coverage per position.
+Worse, if aligning to a very large number of references (e.g. an entire
+transcriptome), most of the references would likely receive insufficient
+reads, so most of the (many) output XAM files would be useless clutter.
+
+To remedy this inconvenience, after alignment has finished, XAM files
+with fewer than a minimum number of reads are automatically deleted.
+The default is 1000, which can be set using the option ``--min-reads``.
+Setting ``--min-reads`` to 0 disables automatically deleting XAM files.
+
How to further customize alignment
''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
@@ -402,10 +416,9 @@ If you require a more customized alignment workflow, you can align your
your FASTQ files outside of SEISMIC-RNA, then pass the resulting XAM
files into SEISMIC-RNA at the step :ref:`wf_relate`.
-
.. _bam_vs_cram:
-Align option: Format of alignment maps
+Align setting: Format of alignment maps
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
SEISMIC-RNA can output alignment map files in either BAM or CRAM format.
@@ -450,6 +463,10 @@ that accompanies them).
Align: output files
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+All output files, except FastQC reports, are written into the directory
+``{out}/{sample}/align``, where ``{out}`` is the output directory and
+``{sample}`` is the name of the sample.
+
Align output file: FastQC reports
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
@@ -478,11 +495,9 @@ each read aligned, as well as the Phred quality scores, mapping quality,
and mutated positions.
SEISMIC-RNA outputs alignment maps where every read aligns to the same
reference (although this is not a restriction outside of SEISMIC-RNA).
-Each alignment map is written to ``{out}/{sample}/align/{ref}.{xam}``,
-where ``{out}`` is the output directory (``--out-dir``), ``{sample}`` is
-the name of the sample from which the reads came, ``{ref}`` is the name
-of the reference to which the reads aligned, and ``{xam}`` is the file
-extension (depending on the selected format).
+Each alignment map is written to ``{ref}.{xam}``, where ``{ref}`` is the
+name of the reference to which the reads aligned, and ``{xam}`` is the
+file extension (depending on the selected format).
SEISMIC-RNA can output alignment maps in either BAM or CRAM format.
For a comparison of these formats, see :ref:`bam_vs_cram`.
@@ -509,7 +524,7 @@ Align output file: Unaligned reads
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
In addition to the alignment maps, SEISMIC-RNA outputs FASTQ file(s) of
-reads that Bowtie 2 could not align to ``{out}/{sample}/align``:
+reads that Bowtie 2 could not align:
- Each whole-sample FASTQ file of single-end (``-z``) or interleaved
(``-y``) reads yields one file: ``unaligned.fq.gz``
@@ -526,26 +541,12 @@ where ``{ref}`` is the reference for demultiplexed FASTQ files.
Outputting these files of unaligned reads can be disabled using the
option ``--bt2-no-un``.
-Align output file: Align report
+Align output file: Report
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-In addition to the alignment maps, SEISMIC-RNA outputs FASTQ file(s) of
-reads that Bowtie 2 could not align to ``{out}/{sample}/align``:
-
-- Each whole-sample FASTQ file of single-end (``-z``) or interleaved
- (``-y``) reads yields one file: ``unaligned.fq.gz``
-- Each pair of whole-sample FASTQ files of 1st and 2nd mates (``-x``)
- yields two files: ``unaligned.fq.1.gz`` and ``unaligned.fq.2.gz``
-- Each demultiplexed FASTQ file of single-end (``-Z``) or interleaved
- (``-Y``) reads yields one file: ``{ref}__unaligned.fq.gz``
-- Each pair of demultiplexed FASTQ files of 1st and 2nd mates (``-X``)
- yields two files:
- ``{ref}__unaligned.fq.1.gz`` and ``{ref}__unaligned.fq.2.gz``
-
-where ``{ref}`` is the reference for demultiplexed FASTQ files.
-
-Outputting these files of unaligned reads can be disabled using the
-option ``--bt2-no-un``.
+A report file, ``align-report.json``, is also written that records the
+settings used to run alignment and summarizes the results of alignment.
+See :doc:`../formats/report/align` for more information.
Align: Troubleshooting
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
@@ -587,14 +588,6 @@ DNA sequences (only A, C, G, T, and N) in FASTA format.
If needed, you may clean the FASTA file with the :doc:`./faclean` tool.
See :doc:`../formats/data/fasta` for details.
-.. note::
- The references in the FASTA file must match those to which the reads
- were aligned to produce the XAM file(s).
- A XAM file whose reference is absent from the FASTA will be skipped.
- A XAM file whose reference sequence in the FASTA differs from the
- sequence to which the XAM file was actually aligned can can yield
- erroneous relation vectors or fail to yield any output.
-
Relate input file: Alignment maps
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
@@ -602,13 +595,14 @@ Relate requires one or more alignment map files, each of which must be
in SAM, BAM, or CRAM format (collectively, "XAM" format).
See :doc:`../formats/data/xam` for details.
-Alignment map file requirements
-''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
-
-Refer to
-
-How to relate one or more alignment maps
-''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
+.. note::
+ The references in the FASTA file must match those to which the reads
+ in the alignment map were aligned.
+ Discrepancies can cause the ``relate`` command to fail or produce
+ erroneous relation vectors.
+ This problem will not occur if you use the same (unaltered) FASTA
+ file for both the ``align`` and ``relate`` commands, or run both
+ at once using the command ``seismic all``.
List every alignment map file after the FASTA file.
Refer to :doc:`./inputs` for details on how to list multiple files.
@@ -620,6 +614,108 @@ aligned to reference ``ref-1``, use the following command::
where ``{refs.fa}`` is the path to the file of reference sequences.
+Relate: Settings
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Relate options shared with alignment
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Because this workflow can be started from the ``align`` or ``relate``
+commands, the latter duplicates some of the options of the former:
+``--phred-enc``, ``--min-mapq``, ``--min-reads``, and ``--out-dir`` have
+the same functions in ``relate`` and ``align``.
+
+Relate setting: Minimum Phred score
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+Base calls with Phred scores below ``--min-phred`` are labeled ambiguous
+matches or substitutions, as if they were ``N``\s.
+For example, if the minimum Phred score is 25 (the default) and a base
+``T`` is called as a match with a Phred score of 20, then it would be
+marked as possibly a match and possibly a subsitution to A, C, or G.
+See :ref:`relate_low_qual` for more information.
+
+Relate setting: Ambiguous insertions and deletions
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+The most tricky problem in computing relation vectors is that insertions
+and deletions ("indels") in repetitive regions cause ambiguities.
+SEISMIC-RNA introduces a new algorithm for identifying ambiguous indels
+(see :doc:`../algos/ambrel` for more information).
+This algorithm is enabled by default.
+If it is not necessary to identify ambiguous indels, then the algorithm
+can be disabled with ``--no-ambrel``, which will speed up ``relate`` at
+the cost of reducing its accuracy on indels.
+
+Relate setting: Batch size
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+For an explanation of batching and how to use it, see :ref:`batches`.
+
+The dataset is partitioned into batches by the ``relate`` command.
+The option ``--batch-size`` sets a target amount of data for each batch,
+in millions of base calls (megabases).
+This calculation considers the total number of relationships per read,
+which equals the length of the reference sequence.
+Thus, the number of base calls *B* is the product of the number of reads
+*N* and the length of the reference sequence *L*:
+
+*B* = *NL*
+
+Since *L* is known and ``--batch-size`` specifies a target size for *B*,
+*N* can be solved for:
+
+*N* = *B*/*L*
+
+SEISMIC-RNA will aim to put exactly *N* reads in each batch but the last
+(the last batch can be smaller because it has just the leftover reads).
+If the reads are single-ended or if alignment was not run in mixed mode,
+then every batch but the last will contain exactly *N* reads.
+If mixed mode was used, then batches may contain more than *N* reads, up
+to a maximum of 2\ *N* in the extreme case that every read in the batch
+belonged to a pair in which the other mate did not align.
+
+Relate: Output files
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+All output files go into the directory ``{out}/{sample}/relate/{ref}``,
+where ``{out}`` is the output directory, ``{sample}`` is the sample, and
+``{ref}`` is the name of the reference.
+
+Relate output file: Batch of relation vectors
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+The data of relationships is written in batches.
+Each batch contains a ``RelateBatchIO`` object and is saved to the file
+``relate-batch-{num}.brickle``, where ``{num}`` is the batch number.
+See :doc:`../data/relate/relate` for details on the data structure.
+See :doc:`../formats/data/brickle` for details on brickle files.
+
+Relate output file: Batch of read names
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+For each batch, the relate step assigns an index (a nonnegative integer)
+to each read and writes a file mapping the indexes to the read names.
+Each batch contains a ``QnamesBatchIO`` object and is saved to the file
+``qnames-batch-{num}.brickle``, where ``{num}`` is the batch number.
+See :doc:`../data/relate/qnames` for details on the data structure.
+See :doc:`../formats/data/brickle` for details on brickle files.
+
+Relate output file: Reference sequence
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+The relate step writes the reference sequence as a ``RefseqIO`` object
+to the file ``refseq.brickle``.
+See :doc:`../data/relate/refseq` for details on the data structure.
+See :doc:`../formats/data/brickle` for details on brickle files.
+
+Relate output file: Report
+""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
+
+A report file is written that records the settings used to generate
+relation vectors summarizes the results.
+See :doc:`../formats/report/relate` for more information.
+
.. _wf_all:
Run the entire workflow with one command
diff --git a/docs/_sources/writeme.rst.txt b/docs/_sources/writeme.rst.txt
index db100b72..9d55f67c 100644
--- a/docs/_sources/writeme.rst.txt
+++ b/docs/_sources/writeme.rst.txt
@@ -4,11 +4,11 @@ How to Write this Manual
.. figure:: https://imgs.xkcd.com/comics/manuals.png
:align: center
- :alt: A comic strip from xkcd joking that the most unhelpful tool is
+ :alt: A comic strip of xkcd joking that the most unhelpful tool is
one whose manual starts with "How to read this manual".
Image credit: "Manuals" from xkcd by Randall Munroe (https://xkcd.com/1343/)
-Rather than begin with "How to read this manual" (hopefully, this manual
-is sufficiently self-evident in that regard), this manual concludes with
-guidelines for writing and editing future versions.
+Rather than begin with "How to read this manual" (hopefully, that task
+is sufficiently self-evident), this documentation ends with guidelines
+for writing future versions.
diff --git a/docs/algos/compseq.html b/docs/algos/compseq.html
new file mode 100644
index 00000000..fb47a808
--- /dev/null
+++ b/docs/algos/compseq.html
@@ -0,0 +1,376 @@
+
+
+
In a nucleic acid sequence, each base is represented by one character in
+ASCII code and hence occupies 1 byte (8 bits).
+But because each base has only 4 possibilities, each base requires only
+log2(4) = 2 bits, shown in the following table:
+
+
+
+
+
+
+
Base
+
2-Bit Binary Code
+
+
+
+
A
+
00
+
+
C
+
01
+
+
G
+
10
+
+
T/U
+
11
+
+
+
+
For efficient storage of (long) nucleic acid sequences, the following
+algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte,
+and then restores the original sequence when needed.
First, attributes of the nucleic acid sequence are recorded:
+
+
rna (bool): whether the sequence is RNA
+
length (int): number of bases in the sequence
+
ns (tuple[int,...]): positions of N bases, if any
+
+
The type and the length are scalar values with constant sizes regardless
+of the length of the sequence.
+The 0-indexed positions of ambiguous bases is an array that at worst (if
+all bases are N) scales linearly with the length of the sequence and
+at best (if no bases are N) is a constant (small) size.
+Because most reference sequences contain no or very few bases that are
+N, recording the N positions generally requires little space.
+
Then, each non-overlapping segment of four bases is encoded as one byte
+by concatenating the 2-bit codes (above) of the bases in the segment, in
+reverse order.
+Ambiguous (N) bases are arbitrarily encoded as 00.
+Because this step requires the length of the sequence to be a multiple
+of 4, the sequence is padded on its 3’ side with A (an arbitrary
+choice) until its length is a multiple of 4.
Given the DNA sequence CAGNTTCGAN, the attributes are extracted:
+
+
rna: False
+
length: 10
+
ns: (3,9) (note: these positions are 0-indexed)
+
+
The sequence is then padded with A at the 3’ end until its length
+becomes a multiple of 4 (in this case, length 12):
+
CAGNTTCGANAA
+
Then, each 4-base segment is transformed into one byte by encoding each
+base and concatenating the codes in order from code 4 to code 1:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Number
+
Sequence
+
Base 1
+
Base 2
+
Base 3
+
Base 4
+
Code 4
+
Code 3
+
Code 2
+
Code 1
+
Byte
+
+
+
+
1
+
CAGN
+
C
+
A
+
G
+
N
+
00
+
10
+
00
+
01
+
00100001
+
+
2
+
TTCG
+
T
+
T
+
C
+
G
+
10
+
01
+
11
+
11
+
10011111
+
+
3
+
ANAA
+
A
+
N
+
A
+
A
+
00
+
00
+
00
+
00
+
00000000
+
+
+
+
Thus, the compressed byte string is [00100001,10011111,00000000].
+Note that this string is only 3 bytes, compared to 10 for the original.
+As a bytes object, the representation is b'!\x9f\x00'.
Beginning with a compressed sequence of bytes, each byte is transformed
+into four bases by decoding each 2-bit chunk to a base using the above
+table, then reversing the order of the four bases.
+The code 11 is decoded to U if the rna attribute is True
+and to T if False.
+The sequence at this point must have a number of bases divisible by 4.
+It is cut to the correct number of bases using the length attribute.
+Finally, every position in the attribute ns is masked to N.