Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.9.5 #7

Merged
merged 11 commits into from
Nov 17, 2023
Binary file added docs/.doctrees/algos/compseq.doctree
Binary file not shown.
Binary file modified docs/.doctrees/algos/index.doctree
Binary file not shown.
Binary file modified docs/.doctrees/api/seismicrna.clust.doctree
Binary file not shown.
Binary file modified docs/.doctrees/api/seismicrna.core.batch.doctree
Binary file not shown.
Binary file modified docs/.doctrees/api/seismicrna.core.doctree
Binary file not shown.
Binary file modified docs/.doctrees/api/seismicrna.core.seq.doctree
Binary file not shown.
Binary file modified docs/.doctrees/api/seismicrna.core.seq.tests.doctree
Binary file not shown.
Binary file modified docs/.doctrees/api/seismicrna.table.doctree
Binary file not shown.
Binary file modified docs/.doctrees/api/seismicrna.workflow.doctree
Binary file not shown.
Binary file modified docs/.doctrees/cli.doctree
Binary file not shown.
Binary file modified docs/.doctrees/data/index.doctree
Binary file not shown.
Binary file removed docs/.doctrees/data/relate.doctree
Binary file not shown.
Binary file added docs/.doctrees/data/relate/codes.doctree
Binary file not shown.
Binary file added docs/.doctrees/data/relate/index.doctree
Binary file not shown.
Binary file added docs/.doctrees/data/relate/qnames.doctree
Binary file not shown.
Binary file added docs/.doctrees/data/relate/refseq.doctree
Binary file not shown.
Binary file added docs/.doctrees/data/relate/relate.doctree
Binary file not shown.
Binary file modified docs/.doctrees/environment.pickle
Binary file not shown.
Binary file modified docs/.doctrees/formats/data/fastq.doctree
Binary file not shown.
Binary file modified docs/.doctrees/manuals/index.doctree
Binary file not shown.
Binary file added docs/.doctrees/manuals/parallel.doctree
Binary file not shown.
Binary file modified docs/.doctrees/manuals/workflow.doctree
Binary file not shown.
Binary file modified docs/.doctrees/writeme.doctree
Binary file not shown.
122 changes: 122 additions & 0 deletions docs/_sources/algos/compseq.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@

Algorithm for Sequence Compression/Decompression
========================================================================

Background on Sequence Compression/Decompression
------------------------------------------------------------------------

In a nucleic acid sequence, each base is represented by one character in
`ASCII code`_ and hence occupies 1 byte (8 bits).
But because each base has only 4 possibilities, each base requires only
log\ :sub:`2`\ (4) = 2 bits, shown in the following table:

==== =================
Base 2-Bit Binary Code
==== =================
A ``00``
C ``01``
G ``10``
T/U ``11``
==== =================

For efficient storage of (long) nucleic acid sequences, the following
algorithm compresses 4 bases (= 8 bits/byte ÷ 2 bits/nt) in each byte,
and then restores the original sequence when needed.

Algorithm for Sequence Compression
------------------------------------------------------------------------

Algorithm for Sequence Compression: Procedure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

First, attributes of the nucleic acid sequence are recorded:

- rna (``bool``): whether the sequence is RNA
- length (``int``): number of bases in the sequence
- ns (``tuple[int, ...]``): positions of ``N`` bases, if any

The type and the length are scalar values with constant sizes regardless
of the length of the sequence.
The 0-indexed positions of ambiguous bases is an array that at worst (if
all bases are ``N``) scales linearly with the length of the sequence and
at best (if no bases are ``N``) is a constant (small) size.
Because most reference sequences contain no or very few bases that are
``N``, recording the ``N`` positions generally requires little space.

Then, each non-overlapping segment of four bases is encoded as one byte
by concatenating the 2-bit codes (above) of the bases in the segment, in
reverse order.
Ambiguous (``N``) bases are arbitrarily encoded as ``00``.
Because this step requires the length of the sequence to be a multiple
of 4, the sequence is padded on its 3' side with ``A`` (an arbitrary
choice) until its length is a multiple of 4.

Algorithm for Sequence Compression: Example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Given the DNA sequence ``CAGNTTCGAN``, the attributes are extracted:

- rna: ``False``
- length: ``10``
- ns: ``(3, 9)`` (note: these positions are 0-indexed)

The sequence is then padded with ``A`` at the 3' end until its length
becomes a multiple of 4 (in this case, length 12):

``CAGNTTCGANAA``

Then, each 4-base segment is transformed into one byte by encoding each
base and concatenating the codes in order from code 4 to code 1:

====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
Number Sequence Base 1 Base 2 Base 3 Base 4 Code 4 Code 3 Code 2 Code 1 Byte
====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
1 CAGN C A G N 00 10 00 01 00100001
2 TTCG T T C G 10 01 11 11 10011111
3 ANAA A N A A 00 00 00 00 00000000
====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========

Thus, the compressed byte string is ``[00100001, 10011111, 00000000]``.
Note that this string is only 3 bytes, compared to 10 for the original.
As a ``bytes`` object, the representation is ``b'!\x9f\x00'``.

Algorithm for Sequence Decompression
------------------------------------------------------------------------

Algorithm for Sequence Decompression: Procedure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Beginning with a compressed sequence of bytes, each byte is transformed
into four bases by decoding each 2-bit chunk to a base using the above
table, then reversing the order of the four bases.
The code ``11`` is decoded to ``U`` if the `rna` attribute is ``True``
and to ``T`` if ``False``.
The sequence at this point must have a number of bases divisible by 4.
It is cut to the correct number of bases using the `length` attribute.
Finally, every position in the attribute `ns` is masked to ``N``.

Algorithm for Sequence Decompression: Example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Suppose that a compressed sequence has the following attributes:

- compressed byte string: ``[00100001, 10011111, 00000000]``
- rna: ``False``
- length: ``10``
- ns: ``(3, 9)``

To decompress the sequence, each byte is split into four 2-bit segments,
decoded, reversed, and reassembled:

====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
Number Byte Code 1 Code 2 Code 3 Code 4 Base 4 Base 3 Base 2 Base 1 Sequence
====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========
1 00100001 00 10 00 01 C A G A CAGA
2 10011111 10 01 11 11 T T C G TTCG
3 00000000 00 00 00 00 A A A A AAAA
====== ======== ====== ====== ====== ====== ====== ====== ====== ====== ========

The resulting sequence, ``CAGATTCGAAAA``, is trimmed to 10 nt: ``CAGATTCGAA``.
Finally, (0-indexed) positions 3 and 9 are replaced with ``N``: ``CAGNTTCGAN``.

.. _`ASCII code`: https://en.wikipedia.org/wiki/ASCII
5 changes: 5 additions & 0 deletions docs/_sources/algos/index.rst.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
************************************************************************
Algorithms
************************************************************************

.. toctree::
:maxdepth: 2

compseq
2 changes: 1 addition & 1 deletion docs/_sources/data/index.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,4 @@ Data Structures
.. toctree::
:maxdepth: 2

relate
relate/index
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,9 @@ for some bases to define the relationship:
- low-quality base calls
- ambiguous insertions and deletions

Low-quality base calls
.. _relate_low_qual:

Encoding low-quality base calls
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

A low-quality base call is defined as having a `Phred quality score`_
Expand Down Expand Up @@ -115,7 +117,7 @@ four columns "Read: A/C/G/T?".
than once towards the total number of matches or mutations. To learn
how mutations in relation vectors are counted, see [REF].

Ambiguous insertions and deletions
Encoding ambiguous insertions and deletions
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Insertions and deletions (collectively, "indels") in the read can cause
Expand Down Expand Up @@ -183,7 +185,7 @@ is covered in one mate but not in the other, one mate's Phred score is
above the threshold and the other's is below, or (more rarely) the base
calls themselves differ.

Consensus relationships
Encoding consensus relationships
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

When finding the consensus of two mates, information in one mate should
Expand Down Expand Up @@ -219,7 +221,7 @@ position 4 is a match, but it is low quality at position 5, so even the
consensus byte is ambiguous. Neither mate covers position 6, so the
consensus byte is blank.

Irreconcilable relationships
Encoding irreconcilable relationships
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

It is possible, although rare, for mates 1 and 2 to share no bits. For
Expand Down
11 changes: 11 additions & 0 deletions docs/_sources/data/relate/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@

Relate Data Structures
========================================================================

.. toctree::
:maxdepth: 2

codes
relate
qnames
refseq
39 changes: 39 additions & 0 deletions docs/_sources/data/relate/qnames.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@

Read Names Batch
------------------------------------------------------------------------

Each batch of relation vectors is a ``QnamesBatchIO`` object.

Read names batch: Structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The read names are in the attribute ``names``, a ``numpy.ndarray`` with
data type ``str``.
The name of the read with number ``i`` (see :ref:`relate_read_nums` for
more information) is at index ``i`` of the ``names`` array.

Read names batch: Example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Suppose a XAM file contains five reads:

=========== ===========================================
Read Number Read Name
=========== ===========================================
0 ``VL00355:66:AACYHM7M5:1:2106:54095:30496``
1 ``VL00355:66:AACYHM7M5:1:2303:34554:55259``
2 ``VL00355:66:AACYHM7M5:1:2411:30558:38352``
3 ``VL00355:66:AACYHM7M5:1:1206:41825:52949``
4 ``VL00355:66:AACYHM7M5:1:2506:10903:12605``
=========== ===========================================

The relate step would write a batch with the ``names`` attribute ::

["VL00355:66:AACYHM7M5:1:2106:54095:30496",
"VL00355:66:AACYHM7M5:1:2303:34554:55259",
"VL00355:66:AACYHM7M5:1:2411:30558:38352",
"VL00355:66:AACYHM7M5:1:1206:41825:52949",
"VL00355:66:AACYHM7M5:1:2506:10903:12605"]

Note that the attribute is shown as a ``list`` for visual simplicity,
but would really be a ``numpy.ndarray``.
31 changes: 31 additions & 0 deletions docs/_sources/data/relate/refseq.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@

Reference Sequence
------------------------------------------------------------------------

The reference sequence is saved as a ``RefseqIO`` object.

Reference sequence: Structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

A ``RefseqIO`` object stores a reference sequence as a space-efficient
``CompressedSeq`` object in the private attribute ``_s``.
(The original sequence is available via the cached property ``refseq``.)
A ``CompressedSeq`` object has the following attributes:

- ``b`` (``bytes``): sequence of bytes, each byte encoding 4 nucleotides
- ``s`` (``int``): length of the sequence (number of nucleotides)
- ``n`` (``tuple[int, ...]``): 0-indexed position of each ``N``
- ``r`` (``bool``): ``True`` if the sequence is RNA, ``False`` if DNA

Reference sequence: Example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Suppose that a DNA sequence is ``CAGNTTCGAN``.
This sequence would be compressed into ::

b = b'!\x9f\x00'
s = 10
n = (3, 9)
r = False

For details on the algorithm, see :doc:`../../algos/compseq`.
125 changes: 125 additions & 0 deletions docs/_sources/data/relate/relate.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@

Relate Batch
------------------------------------------------------------------------

Each batch of relation vectors is a ``RelateBatchIO`` object.

Relate batch: Structure
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The following attributes encode the relationships between each read and
each position in the reference sequence:

========= ============================================ ====================================================================
Attribute Data Type Description
========= ============================================ ====================================================================
``end5s`` ``numpy.ndarray[int]`` array of the first position of the most upstream mate in each read
``mid5s`` ``numpy.ndarray[int]`` array of the first position of the most downstream mate in each read
``mid3s`` ``numpy.ndarray[int]`` array of the last position of the most upstream mate in each read
``end3s`` ``numpy.ndarray[int]`` array of the last position of the most downstream mate in each read
``muts`` ``dict[int, dict[int, numpy.ndarray[int]]]`` array of the reads with each type of mutation at each position
========= ============================================ ====================================================================

.. note::
The positions of the first and last bases in the reference sequence
are defined to be 1 and the length of the sequence, respectively.

.. _relate_read_nums:

Relate batch: Structure of read numbers
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Each read, or pair of paired-end reads, is labeled with a non-negative
integer: 0 for the first read in each batch, and incrementing by 1 for
each subsequent read.
Within one batch, all read numbers are unique.
However, two different batches can have reads that share numbers.

Relate batch: Structure of 5' and 3' end positions
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

- ``end5s``, ``mid5s``, ``mid3s``, and ``end3s`` are all 1-dimensional
``numpy.ndarray`` objects.
- For any relate batch, ``end5s``, ``mid5s``, ``mid3s``, and ``end3s``
all have the same length (which may be any integer ≥ 0).
- A read with index ``i`` corresponds to the ``i``\ th values of
``end5s``, ``mid5s``, ``mid3s``, and ``end3s``; denoted (respectively)
``end5s[i]``, ``mid5s[i]``, ``mid3s[i]``, and ``end3s[i]``.
- For every read ``i``:

- 1 ≤ ``end5s[i]`` ≤ ``end3s[i]`` ≤ length of reference sequence
- If paired-end and there is a gap of ≥ 1 nt between mates 1 and 2:

- ``end5s[i]`` ≤ ``mid3s[i]`` < ``mid5s[i]`` ≤ ``end3s[i]``

- Otherwise:

- ``end5s[i]`` = ``mid5s[i]`` ≤ ``mid3s[i]`` = ``end3s[i]``

Relate batch: Structure of mutations
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

``muts`` is a ``dict`` wherein

- each key is a position in the reference sequence (``int``)
- each value is a ``dict`` wherein

- each key is a type of mutation (``int``, see :doc:`./codes` for more
information)
- each value is an array of the numbers of the reads that have the
given type of mutation at the given position (``numpy.ndarray``)

Relate batch: Example
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For example, suppose that the reference sequence is ``TCAGAACC`` and a
batch contains five paired-end reads, numbered 0 to 4:

==== ==== ============
Read Mate Alignment
==== ==== ============
0 1 ``_CAG____``
0 2 ``_____AGC``
1 1 ``___GTA__``
1 2 ``TCT_____``
2 1 ``____AAC_``
2 2 ``_CA_____``
3 1 ``TAAGT___``
3 2 ``______CC``
4 1 ``__AGA___``
4 2 ``___GA-C_``
Ref ``TCAGAACC``
==== ==== ============

The positions, reads, and relationships can be shown explicitly as a
matrix (see :doc:`./codes` for information on the relationship codes):

==== === === === === === === === ===
Read 1 2 3 4 5 6 7 8
==== === === === === === === === ===
0 255 1 1 1 255 1 64 1
1 1 1 128 1 128 1 255 255
2 255 1 1 255 1 1 1 255
3 1 16 1 1 128 255 1 1
4 255 255 1 1 3 3 1 255
==== === === === === === === === ===

In a relate batch, they would be encoded as follows:

- ``end5s``: ``[2, 1, 2, 1, 3]``
- ``mid5s``: ``[4, 1, 3, 5, 3]``
- ``mid3s``: ``[6, 6, 5, 7, 7]``
- ``end3s``: ``[8, 6, 7, 8, 7]``
- ``muts``::

{1: {},
2: {16: [3]},
3: {128: [1]},
4: {},
5: {3: [4], 128: [1, 3]},
6: {3: [4]},
7: {64: [0]},
8: {}}

Note that the numbers are shown here for visual simplicity as ``list``
objects, but would really be ``numpy.ndarray`` objects.
4 changes: 2 additions & 2 deletions docs/_sources/formats/data/fastq.rst.txt
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,10 @@ file was called correctly during sequencing.
The probability *p* that a base was called incorrectly is 10 raised to
the power of the quotient of the Phred score *s* and -10:

*p* = 10 :sup:`-s/10`
*p* = 10\ :sup:`-s/10`

For example, if a base call has a Phred score of 30, the probability
that the base call is incorrect is 10 :sup:`-30/10` = 0.001.
that the base call is incorrect is 10\ :sup:`-30/10` = 0.001.

In FASTQ files, each phred quality score (a non-negative integer) is
encoded as one character of text by adding another integer *N* to the
Expand Down
Loading