scp_processing.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Data processing with scp</title>
    <meta charset="utf-8" />
    <meta name="author" content="Laurent Gatto and Christophe Vanderaa" />
    <script src="scp_processing_files/header-attrs-2.10/header-attrs.js"></script>
    <script src="scp_processing_files/xaringanExtra-webcam-0.0.1/webcam.js"></script>
    <script id="xaringanExtra-webcam-options" type="application/json">{"width":"200","height":"200","margin":"1em"}</script>
    <link href="scp_processing_files/tile-view-0.2.6/tile-view.css" rel="stylesheet" />
    <script src="scp_processing_files/tile-view-0.2.6/tile-view.js"></script>
    <script src="scp_processing_files/xaringanExtra_fit-screen-0.2.6/fit-screen.js"></script>
    <link href="scp_processing_files/xaringanExtra-extra-styles-0.2.6/xaringanExtra-extra-styles.css" rel="stylesheet" />
    <link href="scp_processing_files/panelset-0.2.6/panelset.css" rel="stylesheet" />
    <script src="scp_processing_files/panelset-0.2.6/panelset.js"></script>
    <link rel="stylesheet" href="xaringan-themer.css" type="text/css" />
    <link rel="stylesheet" href="custom.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# Data processing with scp
### Laurent Gatto and Christophe Vanderaa
### 2021/08/10

---


class: middle
name: cc-by

### Get the slides at https://bit.ly/202108SCP

These slides are available under a **creative common
[CC-BY license](http://creativecommons.org/licenses/by/4.0/)**. You are
free to share (copy and redistribute the material in any medium or
format) and adapt (remix, transform, and build upon the material) for
any purpose, even commercially
&lt;img height="20px" alt="CC-BY" src="https://raw.githubusercontent.com/UCLouvain-CBIO/scp-teaching/main/img/cc1.jpg" /&gt;.

???
In this presentation, I will discuss about data processing and will
show you how to perform two important steps: quality control at
feature level and quality control at sample level. I will also briefly
mention and demonstrate how to perfor dimension reduction as an example
of downstream analysis.

The slides are available at the given link and are shared under CC-BY
license.

---

class: middle, inverse, center

# How to process single-cell proteomics data?

???
A burning question you might have once you acquired you single-cell
proteomics data is: how should I process single-cell proteomics data?

---

class: middle

## How to process single-cell proteomics data?

Overview of the workflow suggested in the SCoPE2 seminal paper [1].

&lt;img src="./figs/scp_processing_workflow.svg" width="50%" style="display: block; margin: auto;" /&gt;

A full reproduction of the workflow using `scp` and `QFeatures` is
available from our
[preprint](http://dx.doi.org/10.1101/2021.04.12.439408) [2] and
[replication vignette](https://uclouvain-cbio.github.io/SCP.replication/articles/SCoPE2.html).

&lt;p style="color:grey;font-size:0.75em;"&gt;
[1] Specht, Harrison, Edward Emmott, Aleksandra A. Petelski, R. Gray
Huffman, David H. Perlman, Marco Serra, Peter Kharchenko, Antonius
Koller, and Nikolai Slavov. 2021. “Single-Cell Proteomic and
Transcriptomic Analysis of Macrophage Heterogeneity Using SCoPE2.”
Genome Biology 22 (1): 50.
&lt;br&gt;
[2] Vanderaa, Christophe, and Laurent Gatto. 2021. “Utilizing Scp for
the Analysis and Replication of Single-Cell Proteomics Data.” bioRxiv.
&lt;/p&gt;

???
Answering this question is not trivial. While further research is
required to provide principled guidelines about single-cell proteomics
data processing, we can already rely on existing processing workflows.
For instance, I show here the workflow from the SCoPE2 seminal paper
published by Nikolai's lab.

There are many different steps. Steps shown in yellow are steps commonly
applied to bulk proteomics and are therefore available from the `QFeatures`
package. However, some steps are specific to single cells and the functions
to performs them are available in the `scp` package.

Due to time constrains, I won't cover all steps, but you can find a full
reproduction of this workflow in our replication vignette and a discussion
in our preprint.

---

class: middle, center, inverse

# How to perform PSM quality control?

???
Let's start with one of the steps and let me show you how to perform
PSM quality control

---

class:

## PSM quality control

.panelset[
.panel[.panel-name[Description]

&lt;br&gt; PSM quality control removes low-quality PSMs to **improve the
reliability** of downstream analysis results.

&lt;br&gt;
**Example**: sample to carrier ratio

The sample to carrier ratio is the signal from samples divided by the
signal of the carrier (tens to hundreds of cell equivalent).

PSMs with high ratios indicate issues during acquisition and/or
quantification and need to be removed.

]
.panel[.panel-name[Plot]

Distribution of the sample to carrier ratio computed for all PSMs.

&lt;img src="scp_processing_files/figure-html/unnamed-chunk-2-1.png" width="80%" style="display: block; margin: auto;" /&gt;


]
.panel[.panel-name[Code]

Code for computing signal to carrier ratio for each PSM:


```r
scope2Data &lt;- computeSCR(scope2Data, i = 1:3, colvar = "SampleType",
                         samplePattern = "Macrophage|Monocyte|Blank",
                         carrierPattern = "Carrier", rowDataName = "SCR")
```

Minimal code for plotting:


```r
rd &lt;- rbindRowData(scope2Data, i = 1:3)
ggplot(data.frame(rd)) +
    aes(x = SCR) +
    geom_histogram() +
    scale_x_log10()
```

Code for filtering:


```r
scope2Data &lt;- filterFeatures(scope2Data, ~ SCR &lt; 0.1)
```

]
]

???
### Description

The objective of PSM quality control is to identify and remove
low-quality PSMs in order to improve the reliability of the downstream
analysis results.

Let me use an example of quality control: the sample to carrier
ratio which is the signal from samples divided by the signal from
carriers. Carriers can contain tens to hundreds of cell equivalent. The logic
is: when a PSM that exhibits a high ratio, meaning the signal in samples
becomes close to the signal in carriers or even exceeds it, then this
means that an issues has occurred during acquisition and/or
quantification and we need to remove those PSMs.

### Plot

Let's see an example distribution of the sample to carrier ratios.
Most ratios are close to 1% as expected by the experimental design,
but you may have noticed there is a trail with much higher ratios.
Those indicate an issue and will be removed.

### Code

Let's quikcly see how to do that. More details about the code are
provided in the replication vignette. First, we need to compute the the sample to carrier ratio
using `computeSCR()`. It takes a `QFeatures` object and we tell how the
samples and how the carriers are defined. The results are stored as
feature annotation, here called `SCR`.

Then, the second code shunk plots the distribution. We extract the feature
annotations and provide it to the `ggplot()` function for plotting.
This will create the plot I just showed.

The last command is how you can apply the filter. The `filterFeatures()`
function takes the `QFeatures` object and a filtering statement, in this
case we keep only features that have a ratio lower than 10\%.

---

class: middle, center, inverse

# How to perform single-cell quality control?

???
Similarly to what we just saw, I will now demonstrate the quality
control on single-cell samples.

---

class:

## Single-cell quality control

.panelset[
.panel[.panel-name[Description]

&lt;br&gt; Single-cell quality control removes failed cell to **improve the
reliability** of downstream analysis results.

&lt;br&gt;
**Example**: median coefficient of variation

The coefficient of variation measures the robustness of quantification
for a protein in a sample. Taking the median across a single cell
provides an estimate of the robustness of quantification within that
single cell.

Single cells with a high median coefficient of variation indicate
issues during acquisition and need to be removed.

]
.panel[.panel-name[Plot]

Distribution of the median coefficient of variation computed for all
single-cell and blank samples.

&lt;img src="scp_processing_files/figure-html/unnamed-chunk-6-1.png" width="70%" style="display: block; margin: auto;" /&gt;

]
.panel[.panel-name[Code]

Code for computing the median coefficient of variation:


```r
scope2Data &lt;- medianCVperCell(scope2Data, i = 1:3, groupBy = "Leading.razor.protein",
                              colDataName = "medianCV", nobs = 6,  norm = "SCoPE2")
```

Code for plotting:


```r
cd &lt;- colData(scope2Data)
ggplot(data.frame(cd)) +
    aes(x = medianCV, fill = SampleType) +
    geom_histogram() +
    scale_x_log10() +
    geom_vline(xintercept = 0.5, lty = "dashed")
```

Code for filtering:


```r
scope2Data &lt;- subsetByColData(scope2Data, scope2Data$medianCV &lt; 0.5)
```

]
]

???
### Description

The objective of single-cell quality control is to identify and remove
low-quality cells in order to improve the reliability of the downstream
analysis results, just like for feature quality control.

Again, let me use an example: The coefficient of variation measures
the robustness of quantification for a protein in a sample. Taking the
median across a single cell provides an estimate of the robustness of
quantification within that single cell. Single cells with a high
median coefficient of variation indicate issues during acquisition and
need to be removed.

### Plot

We can plot the distribution of the median coefficient of variation
computed for all single-cells. Including blank samples, that should
theoretically contain mostly noise signal, helps to define a threshold
where single cells are considered too noisy. Here you can see we set
the threshold at 0.5 to exclude all blank samples but this excludes
also a few low-quality single cells.

### Code

I here show you the code to perform the quality control, again split
in three chunks. The first chunk computes the median coefficient of
variation using the `medianCVperCell` function on a `QFeatures` object.
Various arguments allow to control how the coefficients are computed.
The computed coefficients are directly stored in the `QFeatures`
object as sample annotation.

The second chunk shows how to create the plot I just showed. We retrieve
the sample annotation and provide it to the `ggplot` function to
create the histogram.

The last chunk applies the filtering. `subsetByColData()` means that
we take a subset of the samples based on the sample annotation. Here, we
select all samples that have a median coefficient of variation lower
than 0.5, the threshold we defined based on the blanks.

And so this is how to perform two of the many steps of the workflow. Again,
we a comprehensive demonstration is provided in the replication vignette.

---

class: middle, inverse, center

# What's next after data processing?

???
You may ask yourself: what can I do after data processing?

---

class:

## What's next after data processing?

.panelset[
.panel[.panel-name[Description]

Downstream analysis is were the fun begins!

Common analyses: dimension reduction, cluster analysis, cluster
annotation, differential protein abundance analysis, ...

Many methods are readily applicable to the data thanks to the
**Bioconductor**
[`SingleCellExperiment`](https://www.bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html)
container [1]

**Example**: dimension reduction using the `scater` R package.

&lt;br&gt;
&lt;p style="color:grey;font-size:0.75em;"&gt;
[1] Amezquita, Robert A., Aaron T. L. Lun, Etienne Becht, Vince J. Carey, Lindsay N. Carpp, Ludwig Geistlinger, Federico Martini, et al. 2019. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods, December, 1–9.
&lt;/p&gt;

]
.panel[.panel-name[Plot]

t-SNE plot of the protein data.

&lt;img src="scp_processing_files/figure-html/unnamed-chunk-10-1.png" width="80%" style="display: block; margin: auto;" /&gt;

]
.panel[.panel-name[Code]

Code for getting the protein data with sample annotations:


```r
prots &lt;- getWithColData(scope2Data, "proteins")
```

Code for computing the tSNE (after imputation by zero):


```r
prots &lt;- impute(prots, method = "zero")
prots &lt;- runTSNE(prots, exprs_values = 1, perplexity = 5,  name = "TSNE")
```

Code for plotting:


```r
plotTSNE(prots, colour_by = "Set")
```

]
]

???
### Description

Once your data is processed, you are ready to perform downstream
analyses and that's were the fun begins! Common downstream analyses are
for instance dimension reduction, cluster analysis, cluster annotation,
differential protein abundance analysis and many more.

Most of the methods to perform single-cell analyses are actually
already available. We can directly apply those methods
to our data because our data is contained in `SingleCellExperiment`
objects within the `QFeatures` object.

As an example, let me demonstrate how to apply dimension reduction
using the `scater` package.

### Plot

Here is a tSNE plot generated from an example protein data obtained by
aggregating the PSMs to proteins without further processing. You can
see on this plot that the main source of variability is the acquisition
batch, and this is often not desirable. So, a thorough data processing is critical for principled data
analysis in order to generate robust biological knowledge. If we do not correctly
model the data, we may pass through interesting information or focus
on confounding effects.

### Code

Again, here is a bit of code to perform dimension reduction. First, we extract
the data of interest from the `QFeatures` object. In this case, we
want the protein data using the `getWithColData(à` function that will
also transfer all the available sample annotations that we use for
colouring the plot.

Next, we compute the tSNE after a little imputation of missing data by
zero values. `runTSNE()` is a function from the `scater` package.

Finally, we generate the tSNE plot and colour it by the acquisition
set. Again, `plotTSNE()` is a function from `scater`.

I hope this presentation could help you understand how to perform
data processing using our software, namely quality control, and what
you could do with your data after processing.

---

class: middle, inverse, center

# Exercise

???
Let's know test your understanding with a small exercise.

---

class: middle

#### Given the distribution of the median coefficient of variation per cell, what command would you use to remove low-quality cells?

&lt;img src="./figs/scp_processing_exercise.svg" width="50%" style="display: block; margin: auto;" /&gt;


```r
1. subsetByColData(scope2Data, scope2Data$medianCV &lt; 0.25)
2. subsetByColData(scope2Data, scope2Data$medianCV &gt; 0.25)
3. subsetByColData(scope2Data, scope2Data$medianCV &lt; 0.32)
4. subsetByColData(scope2Data, scope2Data$medianCV &gt; 0.32)
5. subsetByColData(scope2Data, scope2Data$medianCV &lt; 0.45)
6. subsetByColData(scope2Data, scope2Data$medianCV &gt; 0.45)
```

Connect to **http://www.wooclap.com/YYRQRM**

???
Given the distribution of the median coefficient of variation per cell,
what command would you use to remove low-quality cells?

Command 1, 2, 3, 4, 5 or 6? You can pause the video to think about it...
The solution is command 3. Setting a threshold at 0.32 best separates
between blank samples and the single-cells and we are interested to
keep samples with a low median coefficient of variation.

---

class: middle

### Further information

* Check out our vignette where we fully replicated the SCoPE2 workflow:
  https://uclouvain-cbio.github.io/SCP.replication/articles/SCoPE2.html
* Check out our prepint where we discuss the current challenges in
  single-cell proteomics data anylysis:
  http://dx.doi.org/10.1101/2021.04.12.439408)
* Try it yourself in this online workshop:
  https://lgatto.github.io/QFeaturesScpWorkshop2021/

### Funding

Fonds de la Recherche Scientifique (FNRS), Belgium

???
I hope you found the correct solution. If you're looking for more
detailed information, you can have a look at our replication vignette
and our preprint. If your interested to get your hands on our software,
check out our online workshop! Thank you very much for watching and
see you in another video.
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "github",
"highlightLines": true,
"ratio": "16:9",
"countIncrementalSlides": true
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>