Batch_effects.Rmd

---
title: 'Bioinformatics for Big Omics Data: The danger of batch effects!'
author: "Raphael Gottardo"
date: "February 2, 2015"
output:
  ioslides_presentation:
    fig_caption: yes
    fig_retina: 1
    keep_md: yes
    smaller: yes
---


## Setting up some options

Let's first turn on the cache for increased performance and improved styling
```{r, cache=FALSE}
# Set some global knitr options
library("knitr")
opts_chunk$set(tidy=TRUE, tidy.opts=list(blank=FALSE, width.cutoff=60), cache=TRUE, messages=FALSE)
```


## Reading

Before we start, you should read the following papers:

1. Leek, J. T. & Storey, J. D. Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis. PLoS Genet 3, e161 (2007).
2. Leek, J. T. & Storey, J. D. A general framework for multiple testing dependence. Proc. Natl. Acad. Sci. U.S.A. 105, 18718–18723 (2008).
3. Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Reviews Genetics 11, 733-739 (2010).
4. Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118-127 (2007).
5. Gagnon-Bartsch, J. A. & Speed, T. P. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539-552 (2012).

## Motivation

Batch effects are technical sources of variation that have been added to the samples during handling.
Example of batch variables: lot number, technician, instrument settings, etc. 

If not adjusted for, these batch variables can have a substantial effects on downstream analysis


## Normalization and batch effects

Unfortunately, normalization will not always correct for batch effects. Technical variation due to batch effects might only affect a subset of the genes.

<img src="http://www.nature.com/nrg/journal/v11/n10/images/nrg2825-f1.jpg" width=300>

"For a published bladder cancer microarray data set obtained using an Affymetrix platform, we obtained the raw data for only the normal samples. Here, green and orange represent two different processing dates."

## Adjusting for batch effects

**Two scenarios:**

1. You have information about the batch variable
    - Use your batch effect as a covariate in your analysis (e.g. limma)
2. You suspect a batch effect, but you don't know where it is coming from
    - The batch effect needs to be estimated first and then corrected for, by adding the estimated variables as co-variates


## Singular value decomposition

Let $X$ be a matrix of size $m\times n$ ($m \ge n$) and rank $r\le n$
then we can decompose $X$ as 

$$X=USV^T$$

- U is the matrix of left singular vectors (eigenassays)
- V is the matrix of right singular vectors (eigengenes)
- S is the matrix of singular values (eigenvalues)

$U^TU=VV^T=I$ (orthogonal vectors)


$S=diag(s_1, \dots, s_n)$ where $s_l\ge 0$ and $s_{r+1}=\dots=s_n=0$

$X_i=\sum_j u_{ij}s_j\mathbf{v}_j$, which can be interpreted as a change of coordinate

## Relationship to principale component analysis

$X=USV^T$, and we have $X^TX=VSU^TUSV^T=VS^2V^T$

What happens if the rows of X are scaled?


## Surrogate variable analysis

Let $X_{m\times n}=(x_1,..,x_m)^T$ be the matrix of normalized expression values, with $n$ arrays and $m$ genes. 
Let $y=(y_1,..,y_n)^T$ be a vector of length $n$ representing the primary variable of interest (e.g covariates, vector of unknown coefficients). Without loss of generality model $x_{ij}=\mu_i+f_i( y_j) + e_{ij}$, where $\mu_i$ is the baseline level of expression, $f_i(y_j)=\mathbb{E}(x_{ij} | y_j)-\mu_i$ gives the relationship between measured variable of interest and gene $i$, and $e_{ij}$ is random noise with mean zero.

Suppose there are $L$ biologically meaningful unmodeled factors, such as age, environmental exposure, genotype,
etc. Let $g_l = (g_{l1},...,g_{ln})$ be an _arbitrarily complicated function_ of
the lth factor across all $n$ arrays, for $l=1,2,...,L$. Our model becomes:

$$x_{ij}=\mu_i + f_i(y_j) +\sum_{l=1}^L \gamma_{l_i}g_{l_j} + e^*_{ij}$$

and if factor $l$ does not influence the expression of gene $i$, we have $\gamma_{l_i}=0$.

## Surrogate variable analysis

In practice it is impossible to estimate $\sum_{l=1}^L \gamma_{l_i}g_{l_j}$, so Leek and Storey propose to use singular value decomposition to approximate the matrix $(\sum_{l=1}^L \gamma_{l_i}g_{l_j})_{ij}$ by its singular value decomposition. Computationally, this is done in two steps:


1. Detect unmodeled factors
2. Construct surrogate variables

## Detect unmodel factor
  
The main idea is as follows:

- Compute the residual matrix $r_{ij} = x_{ij}- \hat{\mu}_i - \hat{f}_i(y_j)$
- Perform the SVD of $R=(r_{ij})$
- Permute the rows of the matrix $R$ to obtain $R^*$. Regress  $r^*_{ij} = x_{ij}- \hat{\mu}_i - \hat{f}_i(y_j)$ to get residual matrix $R_0$, and perform the SVD of $R_0$. Repeat this many times to generate a null distribution for the residuals, given that $y$ is accounted for. 
- Compare the observed eigenvalues to those generated from the null distribution to obtain significance p-values
- Record the $K$ significant variables

## Construct surrogate variables 

1. Compute the residual matrix $r_{ij} = x_{ij} - \hat{\mu}_i - \hat{f}_i(y_j)$
2. Perform the SVD of $R=(r_{ij})$

Let $e_k=(e_{k1},...,e_{kn})^T$ be the $k$-th column of $V$ (for $k=1,...,n$). These $e_k$ are the residual eigengenes and represent orthogonal residual signals independent of the signal due to the primary variable.

2. Regress $e_k$ on the $x_i$ to access the significance of the $k$-th factor on gene $i$
3. Use the selected genes to form a reduced expression matrix and repeat 1. The estimated factor will form the basis for the surrogate variables
4. In any subsequent analysis include these factors in your model

## Using the sva package

We're going to look at the dataset used in:

Nakaya, H. I., Wrammert, J., Lee, E. K., Racioppi, L., Marie-Kunze, S., Haining, W. N., et al. (2011). Systems biology of vaccination for seasonal influenza in humans. Nature Immunology, 12(8), 786795. doi:10.1038/ni.2067

```{r query-GEO, cache = TRUE}
library(GEOquery)
# Download the mapping information and processed data
gds <- getGEO("GSE29619", destdir="Data/GEO/") 
#main serie #gds[[1]] = LAIV/TIV 0809, gds[[2]] = FACS, gds[[3]] = TIV 0708
```

but before we can use this, we need to clean up the pData a bit (see code in .Rpres file).

```{r sanitize-pdata, cache=TRUE, echo=FALSE}
### Sanitize data and metadata
gds_new <- gds
sanitize_pdata <- function(pd){
keepCols <- c(
  "characteristics_ch1.1", "characteristics_ch1.2",
  "description", 
  "supplementary_file")
pd <- pd[, keepCols]
colnames(pd) <- c("ptid", "time", "description", "filename")
pd$ptid <- gsub(".*: ", "", pd$ptid)
pd$time <- gsub(".*: ", "", pd$time)
pd$time<-gsub("Day", "D", pd$time)
pd$description<-gsub("(-\\w*){2}$", "", pd$description)
pd$filename<-basename(as.character(pd$filename))
pd$filename<-gsub(".CEL.gz", "", pd$filename)
pd
}

pData(gds_new[[1]]) <- sanitize_pdata(pData(gds_new[[1]]))
pData(gds_new[[2]]) <- sanitize_pdata(pData(gds_new[[2]]))
pData(gds_new[[3]]) <- sanitize_pdata(pData(gds_new[[3]]))
```

## Using the sva package 

Let us estimate the (surrogate) factors as follows:
```{r}
library(sva)

TIV_08 <- gds_new[[1]][ , grepl("2008-TIV", pData(gds_new[[1]])$description)]
mm_TIV_08 <- model.matrix(~ptid+time, TIV_08)
mm0_TIV_08 <- model.matrix(~ptid, TIV_08)

# Estimate the surrogate variables
sv_TIV_08 <- sva(exprs(TIV_08), mm_TIV_08, mod0 = mm0_TIV_08)
```

## Using the limma with estimated SVs

Then we can use these variables in limma, as follows:
```{r}
library(limma)
# Add the surrogate variables to the design matrix
modSv <- cbind(mm_TIV_08, sv_TIV_08$sv)

# Use the new design matrix
fit_TIV_08 <- lmFit(TIV_08, modSv)
ebay_TIV_08 <- eBayes(fit_TIV_08)
topT7_sv <- topTable(ebay_TIV_08, coef="timeD7", number=Inf)

# Compare to the old analysis
fit_TIV_08 <- lmFit(TIV_08, mm_TIV_08)
ebay_TIV_08 <- eBayes(fit_TIV_08)
topT7 <- topTable(ebay_TIV_08, coef="timeD7", number=Inf)
```

## Using the limma with estimated SVs

The result from the adjusted analysis:

```{r}
topT7_sv[1:10, c("ID","Gene.Symbol")]
```

## Using the limma with estimated SVs

The result from the un-adjusted analysis:

```{r}
topT7[1:10, c("ID","Gene.Symbol")]
```

What do you think?