Skip to content

Commit

Permalink
Merge pull request #26 from COMBINE-lab/dev
Browse files Browse the repository at this point in the history
feat: speed up EM
  • Loading branch information
rob-p authored Oct 10, 2023
2 parents 272ad05 + 878cf72 commit 2ad45bf
Show file tree
Hide file tree
Showing 6 changed files with 3,586 additions and 35 deletions.
3,480 changes: 3,480 additions & 0 deletions docs/source/_static/simple_de_example.html

Large diffs are not rendered by default.

7 changes: 0 additions & 7 deletions docs/source/api.rst

This file was deleted.

2 changes: 2 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@

templates_path = ['_templates']

html_static_path = ['_static']

# -- Options for HTML output

html_theme = 'furo'
Expand Down
2 changes: 0 additions & 2 deletions docs/source/formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -66,5 +66,3 @@ like to have provided, please let us know.

.. autosummary::
:toctree: generated

piscem-infer
68 changes: 54 additions & 14 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ Usage

.. _installation:

Installation
------------
Installation of ``piscem-infer``
--------------------------------

To install piscem-infer, use `cargo <https://github.com/rust-lang/cargo>`_:

Expand All @@ -24,6 +24,15 @@ or install it from source
An example run
--------------

Here, we demonstrate how to process some data, that can then be used for a simple differential
expression analysis. We'll keep all of our analysis in a single directory that we'll create now.

.. code-block:: console
$ mkdir piscem_tutorial
$ cd piscem_tutorial
Obtaining the reference
~~~~~~~~~~~~~~~~~~~~~~~

Expand All @@ -37,7 +46,6 @@ of the human transcriptome.
Building the index
~~~~~~~~~~~~~~~~~~


Next, we'll build the index. You'll only have to do this once (or whenever you want to update the annotation you're using). To
build the index and map the reads, we'll need ``piscem``. You can either build it from source according to the instructions
on the `GitHub page <https://github.com/COMBINE-lab/piscem>`_, or you can install it from ``biconda`` using ``conda install piscem``.
Expand All @@ -54,8 +62,11 @@ To obtain some sample read data, we'll use the excellent |fastqdl|_ tool that yo
via either ``pip`` or bioconda (through ``conda`` or ``mamba``).

.. code-block:: console
$ accessions=(SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521)
$ for sra in ${accessions[@]}; do fastq-dl -a $sra; done
$ fastq-dl -a SRR1039508
This will retrieve 8 accessions (16 files in total since each sample is a paired-end sequencing run).

Mapping the reads
~~~~~~~~~~~~~~~~~
Expand All @@ -65,34 +76,63 @@ a descripton of all the options):

.. code-block:: console
$ piscem map-bulk -t 16 -i gencode_v44_idx -1 SRR1039508_1.fastq.gz -2 SRR1039508_1.fastq.gz -o SRR1039508_mapped
$ mkdir -mappings
$ for acc in ${accessions[@]}; do
piscem map-bulk -t 16 -i gencode_v44_idx -1 ${acc}_1.fastq.gz -2 ${acc}_1.fastq.gz -o mappings/${acc}
Quantification with ``piscem-infer``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now that we've mapped the reads to produce a bulk-oriented ``RAD`` file, we're ready to quantify with ``piscem-infer``!
There exist other options we can pass (e.g. we can perform bootstrap sampling to produce inferential replicates, in which
case we can also request the use of multiple threads, but here we just invoke the most basic quantification process).
Here, in addition to performing the basic quantification, we will be creating inferential replicates (i.e. bootstrap
samples) for each sample we quantify. This is designated by the ``--num-bootstraps`` parameter. To perform the
bootsrapping in parallel, we'll make use of multiple threads (``--num-threads 16``).

.. code-block:: console
$ piscem-infer quant -i SRR1039508_map -l IU -o quant/SRR1039508
$ for acc in ${accessions[@]}; do
piscem-infer quant --num-bootstraps 16 --num-threads 16 -i mappings/${acc} -l IU -o quant/${acc}
Note that we pass to the ``-o`` flag a file *stem* prefixed with a path (in this case ``quant``). This is because ``piscem-infer``
will produce several output files. All of them will share the same *stem*. If we pass a stem that is prefixed with some path
(e.g. a directory) then this directory will be created if it doesn't exist. We also let ``piscem-infer`` know the library type
(i.e. how we expect the reads to map), where ``piscem-infer`` uses `salmon's library type specification <https://salmon.readthedocs.io/en/latest/salmon.html#what-s-this-libtype>`_.
Here we expect the library to be unstranded and the paired-end reads to map "inward" (i.e. facing each other).

If we look at the files generated with the stem corresponding to, say, the second sample (``SRR1039509``), we
see the following:

.. code-block:: console
$ ls -la quant/${accessions[1]}*
.rw-rw-r--@ 3.1k rob 5 Oct 15:12 quant/SRR1039509.fld.pq
.rw-rw-r--@ 12M rob 5 Oct 15:15 quant/SRR1039509.infreps.pq
.rw-rw-r--@ 1.1k rob 5 Oct 15:15 quant/SRR1039509.meta_info.json
.rw-rw-r--@ 36M rob 5 Oct 15:12 quant/SRR1039509.quant
The file ``SRR1039509.quant`` contains the quantification estimates, and is of a very similar format to e.g. a ``salmon`` ("quant.sf") format file. The file format for the quantification result, as well as that of other outputs, is described in the :ref:`format section of this documentation<Quantification output>`. The file ``SRR1039509.meta_info.json`` contains
information about the quantification run. The files ``SRR1039509.fld.pq`` and ``SRR1039509.infreps.pq`` are `Apache Parquet <https://parquet.apache.org/>`_ format files and contain, respectively, information about the inferred fragment length distribution of the sample and the inferential replicates that we requested to be computed.


Subsequent differential analysis using ``tximport`` and ``DESeq2``
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Next we'll show how to perform differential analysis (at the gene level) with the quantification
estimates we just computed using ``tximport`` and ``DESeq2``. First, we'll need
*just a little bit more information*. We'll need a file containing the run information about these
samples (which includes, e.g. the metadata about how they were treated), and a file containing the
transcript-to-gene mapping. To make this tutorial easier to follow, these can be obtained directly
using the following commands (we'll download them into our current working directory, where we will
also perform our differential analysis).

.. code-block:: console
$ ls -la quant/
.rw-rw-r--@ 3.1k rob 30 Sep 12:33 SRR1039508.fld.pq
.rw-rw-r--@ 628 rob 30 Sep 12:33 SRR1039508.meta_info.json
.rw-rw-r--@ 33M rob 30 Sep 12:33 SRR1039508.quant
$wget -O SraRunTable.txt -r --no-check-certificate 'https://drive.google.com/uc?export=download&id=1Qt93SG0rAI-GJ9LCmyl1gqM-9T5JlJBx'
$wget -O t2g.csv -r --no-check-certificate 'https://drive.google.com/uc?export=download&id=1fUpx-0HHI8msRaZm2UUKf-d5lDD0gYXZ'
The file ``SRR1039508.quant`` contains the quantification estimates, and is of a very similar format to e.g. a ``salmon`` ("quant.sf") format file. The file format for the quantification result, as well as that of other outputs, is described in the :ref:`format section of this documentation<Quantification output>`.
Now, we're ready to perform our DE analysis. That part of the tutorial can be found
in this `Quarto document <_static/simple_de_example.html>`_.


.. |fastqdl| replace:: ``fastq-dl``
Expand Down
62 changes: 50 additions & 12 deletions src/utils/em.rs
Original file line number Diff line number Diff line change
Expand Up @@ -125,6 +125,11 @@ impl<'a> Iterator for PackedEqLabelIter<'a> {
None
}
}

fn size_hint(&self) -> (usize, Option<usize>) {
let rem = self.underlying_packed_map.len() - self.counter as usize;
(rem, Some(rem))
}
}

pub struct PackedEqEntryIter<'a> {
Expand All @@ -148,6 +153,12 @@ impl<'a> Iterator for PackedEqEntryIter<'a> {
}
}

impl<'a> ExactSizeIterator for PackedEqEntryIter<'a> {
fn len(&self) -> usize {
self.underlying_packed_map.len() - self.counter as usize
}
}

impl EqMap {
/// Create a new equivalence class map, if
/// `cotntains_ori` is true, the equivalence class
Expand Down Expand Up @@ -280,22 +291,27 @@ fn m_step(
eq_map: &PackedEqMap,
eq_counts: &[usize],
prev_count: &[f64],
eff_lens: &[f64],
inv_eff_lens: &[f64],
curr_counts: &mut [f64],
) {
let mut weights: Vec<f64> = Vec::with_capacity(64);

for (k, v) in eq_map.iter_labels().zip(eq_counts.iter()) {
let mut denom = 0.0_f64;
let count = *v as f64;
for target_id in k {
denom += prev_count[*target_id as usize] / eff_lens[*target_id as usize];
}

let mut denom = 0.0_f64;
for e in k.iter() {
let w = prev_count[*e as usize] * inv_eff_lens[*e as usize];
weights.push(w);
denom += w;
}
if denom > 1e-8 {
for target_id in k {
curr_counts[*target_id as usize] += count
* ((prev_count[*target_id as usize] / eff_lens[*target_id as usize]) / denom);
let count_over_denom = count / denom;
for (target_id, w) in k.iter().zip(weights.iter()) {
curr_counts[*target_id as usize] += count_over_denom * w;
}
}
weights.clear();
}
}

Expand All @@ -311,6 +327,17 @@ pub fn do_bootstrap(em_info: &EMInfo, num_boot: usize) -> Vec<Vec<f64>> {
let eq_map = em_info.eq_map;
let max_iter = em_info.max_iter;
let eff_lens = em_info.eff_lens;
let inv_eff_lens = eff_lens
.iter()
.map(|x| {
let y = 1.0_f64 / *x;
if y.is_finite() {
y
} else {
0_f64
}
})
.collect::<Vec<f64>>();
let total_weight = em_info.eq_map.counts.iter().sum::<usize>();
// init
let avg = (total_weight as f64) / (eff_lens.len() as f64);
Expand All @@ -336,7 +363,7 @@ pub fn do_bootstrap(em_info: &EMInfo, num_boot: usize) -> Vec<Vec<f64>> {
eq_map,
&base_counts,
&prev_counts,
eff_lens,
&inv_eff_lens,
&mut curr_counts,
);

Expand Down Expand Up @@ -367,7 +394,7 @@ pub fn do_bootstrap(em_info: &EMInfo, num_boot: usize) -> Vec<Vec<f64>> {
eq_map,
&base_counts,
&prev_counts,
eff_lens,
&inv_eff_lens,
&mut curr_counts,
);

Expand All @@ -379,6 +406,17 @@ pub fn do_bootstrap(em_info: &EMInfo, num_boot: usize) -> Vec<Vec<f64>> {
pub fn em(em_info: &EMInfo) -> Vec<f64> {
let eq_map = em_info.eq_map;
let eff_lens = em_info.eff_lens;
let inv_eff_lens = eff_lens
.iter()
.map(|x| {
let y = 1.0_f64 / *x;
if y.is_finite() {
y
} else {
0_f64
}
})
.collect::<Vec<f64>>();
let max_iter = em_info.max_iter;
let total_weight: f64 = eq_map.counts.iter().sum::<usize>() as f64;

Expand All @@ -395,7 +433,7 @@ pub fn em(em_info: &EMInfo) -> Vec<f64> {
eq_map,
&eq_map.counts,
&prev_counts,
eff_lens,
&inv_eff_lens,
&mut curr_counts,
);

Expand Down Expand Up @@ -429,7 +467,7 @@ pub fn em(em_info: &EMInfo) -> Vec<f64> {
eq_map,
&eq_map.counts,
&prev_counts,
eff_lens,
&inv_eff_lens,
&mut curr_counts,
);

Expand Down

0 comments on commit 2ad45bf

Please sign in to comment.