Merge pull request #26 from COMBINE-lab/dev

feat: speed up EM
COMBINE-lab · Oct 10, 2023 · 2ad45bf · 2ad45bf
2 parents 272ad05 + 878cf72
commit 2ad45bf
Show file tree

Hide file tree

Showing 6 changed files with 3,586 additions and 35 deletions.
diff --git a/docs/source/_static/simple_de_example.html b/docs/source/_static/simple_de_example.html
diff --git a/docs/source/api.rst b/docs/source/api.rst
diff --git a/docs/source/conf.py b/docs/source/conf.py
@@ -29,6 +29,8 @@
 
 templates_path = ['_templates']
 
+html_static_path = ['_static']
+
 # -- Options for HTML output
 
 html_theme = 'furo'

diff --git a/docs/source/formats.rst b/docs/source/formats.rst
@@ -66,5 +66,3 @@ like to have provided, please let us know.
 
 .. autosummary::
    :toctree: generated
-
-    piscem-infer
diff --git a/docs/source/usage.rst b/docs/source/usage.rst
@@ -3,8 +3,8 @@ Usage
 
 .. _installation:
 
-Installation
-------------
+Installation of ``piscem-infer``
+--------------------------------
 
 To install piscem-infer, use `cargo <https://github.com/rust-lang/cargo>`_:
 
@@ -24,6 +24,15 @@ or install it from source
 An example run
 --------------
 
+Here, we demonstrate how to process some data, that can then be used for a simple differential 
+expression analysis.  We'll keep all of our analysis in a single directory that we'll create now.
+
+.. code-block:: console
+
+   $ mkdir piscem_tutorial
+   $ cd piscem_tutorial
+
+
 Obtaining the reference
 ~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -37,7 +46,6 @@ of the human transcriptome.
 Building the index
 ~~~~~~~~~~~~~~~~~~
 
-
 Next, we'll build the index. You'll only have to do this once (or whenever you want to update the annotation you're using). To 
 build the index and map the reads, we'll need ``piscem``. You can either build it from source according to the instructions 
 on the `GitHub page <https://github.com/COMBINE-lab/piscem>`_, or you can install it from ``biconda`` using ``conda install piscem``. 
@@ -54,8 +62,11 @@ To obtain some sample read data, we'll use the excellent |fastqdl|_ tool that yo
 via either ``pip`` or bioconda (through ``conda`` or ``mamba``).
 
 .. code-block:: console
+    
+   $ accessions=(SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521)
+   $ for sra in ${accessions[@]}; do fastq-dl -a $sra; done
 
-   $ fastq-dl -a SRR1039508
+This will retrieve 8 accessions (16 files in total since each sample is a paired-end sequencing run).
 
 Mapping the reads
 ~~~~~~~~~~~~~~~~~
@@ -65,34 +76,63 @@ a descripton of all the options):
 
 .. code-block:: console
 
-   $ piscem map-bulk -t 16 -i gencode_v44_idx -1 SRR1039508_1.fastq.gz -2 SRR1039508_1.fastq.gz -o SRR1039508_mapped
+   $ mkdir -mappings
+   $ for acc in ${accessions[@]}; do 
+      piscem map-bulk -t 16 -i gencode_v44_idx -1 ${acc}_1.fastq.gz -2 ${acc}_1.fastq.gz -o mappings/${acc}
 
 
 Quantification with ``piscem-infer``
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
 Now that we've mapped the reads to produce a bulk-oriented ``RAD`` file, we're ready to quantify with ``piscem-infer``!
-There exist other options we can pass (e.g. we can perform bootstrap sampling to produce inferential replicates, in which 
-case we can also request the use of multiple threads, but here we just invoke the most basic quantification process).
+Here, in addition to performing the basic quantification, we will be creating inferential replicates (i.e. bootstrap
+samples) for each sample we quantify. This is designated by the ``--num-bootstraps`` parameter. To perform the 
+bootsrapping in parallel, we'll make use of multiple threads (``--num-threads 16``).
 
 .. code-block:: console
-
-   $ piscem-infer quant -i SRR1039508_map -l IU -o quant/SRR1039508
+  
+   $ for acc in ${accessions[@]}; do
+      piscem-infer quant --num-bootstraps 16 --num-threads 16 -i mappings/${acc} -l IU -o quant/${acc}
 
 Note that we pass to the ``-o`` flag a file *stem* prefixed with a path (in this case ``quant``). This is because ``piscem-infer``
 will produce several output files.  All of them will share the same *stem*.  If we pass a stem that is prefixed with some path 
 (e.g. a directory) then this directory will be created if it doesn't exist. We also let ``piscem-infer`` know the library type 
 (i.e. how we expect the reads to map), where ``piscem-infer`` uses `salmon's library type specification <https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype>`_.
 Here we expect the library to be unstranded and the paired-end reads to map "inward" (i.e. facing each other).
 
+If we look at the files generated with the stem corresponding to, say, the second sample (``SRR1039509``), we 
+see the following:
+
+.. code-block:: console
+
+    $ ls -la quant/${accessions[1]}*
+      .rw-rw-r--@ 3.1k rob  5 Oct 15:12 quant/SRR1039509.fld.pq
+      .rw-rw-r--@  12M rob  5 Oct 15:15 quant/SRR1039509.infreps.pq
+      .rw-rw-r--@ 1.1k rob  5 Oct 15:15 quant/SRR1039509.meta_info.json
+      .rw-rw-r--@  36M rob  5 Oct 15:12 quant/SRR1039509.quant 
+
+The file ``SRR1039509.quant`` contains the quantification estimates, and is of a very similar format to e.g. a ``salmon`` ("quant.sf") format file.  The file format for the quantification result, as well as that of other outputs, is described in the :ref:`format section of this documentation<Quantification output>`. The file ``SRR1039509.meta_info.json`` contains 
+information about the quantification run.  The files ``SRR1039509.fld.pq`` and ``SRR1039509.infreps.pq`` are `Apache Parquet <https://parquet.apache.org/>`_ format files and contain, respectively, information about the inferred fragment length distribution of the sample and the inferential replicates that we requested to be computed.
+
+
+Subsequent differential analysis using ``tximport`` and ``DESeq2``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Next we'll show how to perform differential analysis (at the gene level) with the quantification 
+estimates we just computed using ``tximport`` and ``DESeq2``.  First, we'll need 
+*just a little bit more information*. We'll need a file containing the run information about these 
+samples (which includes, e.g. the metadata about how they were treated), and a file containing the 
+transcript-to-gene mapping. To make this tutorial easier to follow, these can be obtained directly 
+using the following commands (we'll download them into our current working directory, where we will 
+also perform our differential analysis).
+
 .. code-block:: console
 
-    $ ls -la quant/
-    .rw-rw-r--@ 3.1k rob 30 Sep 12:33 SRR1039508.fld.pq
-    .rw-rw-r--@  628 rob 30 Sep 12:33 SRR1039508.meta_info.json
-    .rw-rw-r--@  33M rob 30 Sep 12:33 SRR1039508.quant
+  $wget -O SraRunTable.txt -r --no-check-certificate 'https://drive.google.com/uc?export=download&id=1Qt93SG0rAI-GJ9LCmyl1gqM-9T5JlJBx'
+  $wget -O t2g.csv -r --no-check-certificate 'https://drive.google.com/uc?export=download&id=1fUpx-0HHI8msRaZm2UUKf-d5lDD0gYXZ'
 
-The file ``SRR1039508.quant`` contains the quantification estimates, and is of a very similar format to e.g. a ``salmon`` ("quant.sf") format file.  The file format for the quantification result, as well as that of other outputs, is described in the :ref:`format section of this documentation<Quantification output>`.
+Now, we're ready to perform our DE analysis. That part of the tutorial can be found 
+in this `Quarto document <_static/simple_de_example.html>`_.
 
 
 .. |fastqdl| replace:: ``fastq-dl``

diff --git a/src/utils/em.rs b/src/utils/em.rs
@@ -125,6 +125,11 @@ impl<'a> Iterator for PackedEqLabelIter<'a> {
             None
         }
     }
+
+    fn size_hint(&self) -> (usize, Option<usize>) {
+        let rem = self.underlying_packed_map.len() - self.counter as usize;
+        (rem, Some(rem))
+    }
 }
 
 pub struct PackedEqEntryIter<'a> {
@@ -148,6 +153,12 @@ impl<'a> Iterator for PackedEqEntryIter<'a> {
     }
 }
 
+impl<'a> ExactSizeIterator for PackedEqEntryIter<'a> {
+    fn len(&self) -> usize {
+        self.underlying_packed_map.len() - self.counter as usize
+    }
+}
+
 impl EqMap {
     /// Create a new equivalence class map, if
     /// `cotntains_ori` is true, the equivalence class
@@ -280,22 +291,27 @@ fn m_step(
     eq_map: &PackedEqMap,
     eq_counts: &[usize],
     prev_count: &[f64],
-    eff_lens: &[f64],
+    inv_eff_lens: &[f64],
     curr_counts: &mut [f64],
 ) {
+    let mut weights: Vec<f64> = Vec::with_capacity(64);
+
     for (k, v) in eq_map.iter_labels().zip(eq_counts.iter()) {
-        let mut denom = 0.0_f64;
         let count = *v as f64;
-        for target_id in k {
-            denom += prev_count[*target_id as usize] / eff_lens[*target_id as usize];
-        }
 
+        let mut denom = 0.0_f64;
+        for e in k.iter() {
+            let w = prev_count[*e as usize] * inv_eff_lens[*e as usize];
+            weights.push(w);
+            denom += w;
+        }
         if denom > 1e-8 {
-            for target_id in k {
-                curr_counts[*target_id as usize] += count
-                    * ((prev_count[*target_id as usize] / eff_lens[*target_id as usize]) / denom);
+            let count_over_denom = count / denom;
+            for (target_id, w) in k.iter().zip(weights.iter()) {
+                curr_counts[*target_id as usize] += count_over_denom * w;
             }
         }
+        weights.clear();
     }
 }
 
@@ -311,6 +327,17 @@ pub fn do_bootstrap(em_info: &EMInfo, num_boot: usize) -> Vec<Vec<f64>> {
     let eq_map = em_info.eq_map;
     let max_iter = em_info.max_iter;
     let eff_lens = em_info.eff_lens;
+    let inv_eff_lens = eff_lens
+        .iter()
+        .map(|x| {
+            let y = 1.0_f64 / *x;
+            if y.is_finite() {
+                y
+            } else {
+                0_f64
+            }
+        })
+        .collect::<Vec<f64>>();
     let total_weight = em_info.eq_map.counts.iter().sum::<usize>();
     // init
     let avg = (total_weight as f64) / (eff_lens.len() as f64);
@@ -336,7 +363,7 @@ pub fn do_bootstrap(em_info: &EMInfo, num_boot: usize) -> Vec<Vec<f64>> {
                     eq_map,
                     &base_counts,
                     &prev_counts,
-                    eff_lens,
+                    &inv_eff_lens,
                     &mut curr_counts,
                 );
 
@@ -367,7 +394,7 @@ pub fn do_bootstrap(em_info: &EMInfo, num_boot: usize) -> Vec<Vec<f64>> {
                 eq_map,
                 &base_counts,
                 &prev_counts,
-                eff_lens,
+                &inv_eff_lens,
                 &mut curr_counts,
             );
 
@@ -379,6 +406,17 @@ pub fn do_bootstrap(em_info: &EMInfo, num_boot: usize) -> Vec<Vec<f64>> {
 pub fn em(em_info: &EMInfo) -> Vec<f64> {
     let eq_map = em_info.eq_map;
     let eff_lens = em_info.eff_lens;
+    let inv_eff_lens = eff_lens
+        .iter()
+        .map(|x| {
+            let y = 1.0_f64 / *x;
+            if y.is_finite() {
+                y
+            } else {
+                0_f64
+            }
+        })
+        .collect::<Vec<f64>>();
     let max_iter = em_info.max_iter;
     let total_weight: f64 = eq_map.counts.iter().sum::<usize>() as f64;
 
@@ -395,7 +433,7 @@ pub fn em(em_info: &EMInfo) -> Vec<f64> {
             eq_map,
             &eq_map.counts,
             &prev_counts,
-            eff_lens,
+            &inv_eff_lens,
             &mut curr_counts,
         );
 
@@ -429,7 +467,7 @@ pub fn em(em_info: &EMInfo) -> Vec<f64> {
         eq_map,
         &eq_map.counts,
         &prev_counts,
-        eff_lens,
+        &inv_eff_lens,
         &mut curr_counts,
     );