diff --git a/RNA-seq/02-gastric_cancer_tximeta.nb.html b/RNA-seq/02-gastric_cancer_tximeta.nb.html index 960eadc4..ac9605f5 100644 --- a/RNA-seq/02-gastric_cancer_tximeta.nb.html +++ b/RNA-seq/02-gastric_cancer_tximeta.nb.html @@ -3298,8 +3298,8 @@

Summarize to gene

# Summarize to the gene level
 gene_summarized <- summarizeToGene(txi_data) 
- -
loading existing EnsDb created: 2022-09-13 21:36:34
+ +
loading existing EnsDb created: 2022-10-05 12:54:35
obtaining transcript-to-gene mapping from database
diff --git a/RNA-seq/05-nb_cell_line_DESeq2.nb.html b/RNA-seq/05-nb_cell_line_DESeq2.nb.html index f5090b10..0f67f9d2 100644 --- a/RNA-seq/05-nb_cell_line_DESeq2.nb.html +++ b/RNA-seq/05-nb_cell_line_DESeq2.nb.html @@ -3576,7 +3576,7 @@

Shrinking log2 fold change estimates

@@ -3623,7 +3623,7 @@

Making a Volcano Plot

theme(legend.position = "bottom") -

+

diff --git a/RNA-seq/06-openpbta_heatmap.nb.html b/RNA-seq/06-openpbta_heatmap.nb.html index 12f70dbd..c166a53a 100644 --- a/RNA-seq/06-openpbta_heatmap.nb.html +++ b/RNA-seq/06-openpbta_heatmap.nb.html @@ -3514,7 +3514,7 @@

Heatmap itself!

Set `ht_opt$message = FALSE` to turn off this message. -

+

diff --git a/intro-to-R-tidyverse/01-intro_to_base_R-live.Rmd b/intro-to-R-tidyverse/01-intro_to_base_R-live.Rmd index 25303b96..4baa4fa1 100644 --- a/intro-to-R-tidyverse/01-intro_to_base_R-live.Rmd +++ b/intro-to-R-tidyverse/01-intro_to_base_R-live.Rmd @@ -361,7 +361,7 @@ mean(values_1_to_20) We have learned functions such as `c`, `length`, `sum`, and etc. Imagine defining a variable called `c`: This will work, but it will lead to a -lot of unintended bugs, so its best to avoid this. +lot of unintended bugs, so it's best to avoid this. ### The `%in%` logical operator diff --git a/intro-to-R-tidyverse/01-intro_to_base_R.nb.html b/intro-to-R-tidyverse/01-intro_to_base_R.nb.html index 66fe91fd..f07bdbb9 100644 --- a/intro-to-R-tidyverse/01-intro_to_base_R.nb.html +++ b/intro-to-R-tidyverse/01-intro_to_base_R.nb.html @@ -3586,7 +3586,7 @@

A note on variable naming

We have learned functions such as c, length, sum, and etc. Imagine defining a variable called c: This will work, but it will lead to a -lot of unintended bugs, so its best to avoid this.

+lot of unintended bugs, so it’s best to avoid this.

The %in% logical operator

@@ -3958,7 +3958,7 @@

Session Info

-
---
title: "Introduction to R and RStudio"
author: Originally authored by Stephanie J. Spielman,<br>adapted by CCDL for ALSF
date: 2021
output:
  html_notebook:
    toc: true
    toc_float: true
---

## Objectives

This notebook will demonstrate how to:  

- Navigate the RStudio environment  
- Use R for simple calculations, both mathematical and logical  
- Define and use variables in base R  
- Understand and apply base R functions   
- Understand, define, and use R data types, including vector manipulation and indexing  
- Understand the anatomy of a data frame  

---

#### *More resources for learning R* 

- [Swirl, an interactive tutorial](https://swirlstats.com/)  
- [_R for Data Science_ book](https://r4ds.had.co.nz/)  
- [Tutorial on R, RStudio and R Markdown](https://ismayc.github.io/rbasics-book/)  
- [Handy R cheatsheets](https://www.rstudio.com/resources/cheatsheets/)  
- [R Markdown website](https://rmarkdown.rstudio.com)  
- [_R Markdown: The Definitive Guide_](https://bookdown.org/yihui/rmarkdown/)  

## What is R?

**R** is a statistical computing language that is _open source_, meaning the underlying code for the language is freely available to anyone. 
You do not need a special license or set of permissions to use and develop code in R. 

R itself is an _interpreted computer language_ and comes with functionality that comes bundled with the language itself, known as **"base R"**.
But there is also rich additional functionality provided by **external packages**, or libraries of code that assist in accomplishing certain tasks and can be freely downloaded and loaded for use. 

In the next notebook and subsequent modules, we will be using a suite of packages collectively known as [**The Tidyverse**](https://tidyverse.org). 
The `tidyverse` is geared towards intuitive data science applications that follow a shared data philosophy.
But there are still many core features of base R which are important to be aware of, and we will be using concepts from both base R and the tidyverse in our analyses, as well as task specific packages for analyses such as gene expression. 

### What is RStudio?

RStudio is a _graphical environment_ ("integrated development environment" or IDE) for writing and developing R code. RStudio is NOT a separate programming language - it is an interface we use to facilitate R programming. 
In other words, you can program in R without RStudio, but you can't use the RStudio environment without R.

For more information about RStudio than you ever wanted to know, see this [RStudio IDE Cheatsheet (pdf)](https://github.com/rstudio/cheatsheets/raw/main/rstudio-ide.pdf).

## The RStudio Environment

The RStudio environment has four main **panes**, each of which may have a number of tabs that display different information or functionality. (their specific location can be changed under Tools -> Global Options -> Pane Layout).
![RStudio Appearance](screenshots/rstudio-panes.png) 

1. The **Editor** pane is where you can write R scripts and other documents. Each tab here is its own document.
This is your _text editor_, which will allow you to save your R code for future use. 
Note that change code here will not run automatically until you run it. 

2. The **Console** pane is where you can _interactively_ run R code. 
  + There is also a **Terminal** tab here which can be used for running programs outside R on your computer
  
3. The **Environment** pane primarily displays the variables, sometimes known as _objects_ that are defined during a given R session, and what data or values they might hold.

4. The **Help viewer** pane has several tabs all of which are pretty important:
    + The **Files** tab shows the structure and contents of files and folders (also known as directories) on your computer.
    + The **Plots** tab will reveal plots when you make them
    + The **Packages** tab shows which installed packages have been loaded into your R session
    + The **Help** tab will show the help page when you look up a function
    + The **Viewer** pane will reveal compiled R Markdown documents 

## Basic Calculations

### Mathematical operators

The most basic use of R is as a regular calculator:

| Operation | Symbol |
|-----------|--------|
| Add  | `+` | 
| Subtract  | `-` | 
| Multiply  | `*` | 
| Divide  | `/` | 
| Exponentiate | `^` or `**` | 

For example, we can do some simple multiplication like this. 
When you execute code within the notebook, the results appear beneath the code. 
Try executing this chunk by clicking the *Run* button within the chunk or by 
placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

```{r calculator}
5 * 6
```

Use the console to calculate other expressions. Standard order of operations applies (mostly), and  you can use parentheses `()` as you might expect (but not brackets `[]` or braces`{}`, which have special meanings). Note however, that you must **always** specify multiplication with `*`; implicit multiplication such as `10(3 + 4)` or `10x` will not work and will generate an error, or worse.

```{r expressions, live = TRUE}
10 * (3 + 4)^2
```


### Defining and using variables 

To define a variable, we use the _assignment operator_ which looks like an arrow: `<-`, for example `x <- 7` takes the value on the right-hand side of the operator and assigns it to the variable name on the left-hand side. 

```{r var-define, live = TRUE}
# Define a variable x to equal 7, and print out the value of x
x <- 7

# We can have R repeat back to us what `x` is by just using `x`
x
```

Some features of variables, considering the example `x <- 7`:
Every variable has a **name**, a **value**, and a **type**. 
This variable's name is `x`, its value is `7`, and its type is `numeric` (7 is a number!).
Re-defining a variable will overwrite the value.

```{r var-redefine}
x <- 5.5

x
```

We can modify an existing variable by reassigning it to its same name. 
Here we'll add `2` to `x` and reassign the result back to `x`. 

```{r var-modify, live = TRUE}
x <- x + 2

x
```

### Variable naming note:
As best you can, it is a good idea to make your variable names informative (e.g. `x` doesn't mean anything, but `sandwich_price` is meaningful... if we're talking about the cost of sandwiches, that is..). 

### Comments

Arguably the __most important__ aspect of your coding is comments: Small pieces of explanatory text you leave in your code to explain what the code is doing and/or leave notes to yourself or others. 
Comments are invaluable for communicating your code to others, but they are most important for **Future You**. 
Future You comes into existence about one second after you write code, and has no idea what on earth Past You was thinking. 

Comments in R code are indicated with pound signs (*aka* hashtags, octothorps). R will _ignore_ any text in a line after the pound sign, so you can put whatever text you like there.

```{r comments}
22/7 # not quite pi

# If we need a better approximation of pi, we can use Euler's formula
# This uses atan(), which calculates arctangent.
20 * atan(1/7) + 8 * atan(3/79) 
```

Help out Future You by adding lots of comments! 
Future You next week thinks Today You is an idiot, and the only way you can convince Future You that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad.

## Functions
We can use pre-built computation methods called "functions" for other operations. 
Functions have the following format, where the _argument_ is the information we are providing to the function for it to run. 
An example of this was the `atan()` function used above.

```r
function_name(argument)
```

To learn about functions, we'll examine one called `log()` first. 

To know what a function does and how to use it, use the question mark which will reveal documentation in the **help pane**: `?log`
![rhelp](screenshots/rhelp-log.png) 

The documentation tells us that `log()` is derived from `{base}`, meaning it is a function that is part of base R. 
It provides a brief description of what the function does and shows several examples of to how use it.

In particular, the documentation tells us about what argument(s) to provide:

+ The first _required_ argument is the value we'd like to take the log of, by default its _natural log_
+ The second _optional_ argument can specify a different base rather than the default `e`.

Functions also _return_ values for us to use. 
In the case of `log()`, the returned value is the log'd value the function computed.

```{r log}
log(73)
```

Here we can specify an _argument_ of `base` to calculate log base 3. 

```{r log3}
log(81, base = 3)
```

If we don't specify the _argument_ names, it assumes they are in the order that `log` defines them. 
See `?log` to see more about its arguments. 

```{r log2, live = TRUE}
log(8, 2)
```

We can switch the order if we specify the argument names. 

```{r log-order}
log(base = 10, x = 4342)
```

We can also provide variables as arguments in the same way as the raw values. 

```{r log-variable}
meaning <- 42
log(meaning)
```

## Working with variables

### Variable Types

Variable types in R can sometimes be _coerced_ (converted) from one type to another.

```{r}
# Define a variable with a number
x <- 15
```

The function `class()` will tell us the variable's type.

```{r}
class(x)
```

Let's coerce it to a character. 

```{r}
x <- as.character(x)
class(x)
```

See it now has quotes around it? It's now a character and will behave as such.

```{r}
x
```

Use this chunk to try to perform calculations with `x`, now that it is a character, what happens? 

```{r live = TRUE}
# Try to perform calculations on `x`
```

But we can't coerce everything:

```{r}
# Let's create a character variable
x <- "look at my character variable"
```

Let's try making this a numeric variable:

```{r coerce-char, error=TRUE}
x <- as.numeric(x)
```

Print out `x`.

```{r}
x
```

R is telling us it doesn't know how to convert this to a numeric variable, so it has returned `NA` instead.

For reference, here's a summary of some of the most important variable types. 

| Variable Type | Definition | Examples | Coercion |
|---------------|------------|----------| --------|
| `numeric`       | Any number value | `5`<br>`7.5` <br>`-1`| `as.numeric()`
| `integer`       | Any _whole_ number value (no decimals) | `5` <br> `-100` | `as.integer()`
|`character`      | Any collection of characters defined within _quotation marks_. Also known as a "string". | `"a"` (a single letter) <br>`"stringofletters"` (a whole bunch of characters put together as one) <br> `"string of letters and spaces"` <br> `"5"` <br> `'single quotes are also good'` | `as.character()`
|`logical`      | A value of `TRUE`, `FALSE`, or `NA` | `TRUE` <br> `FALSE` <br> `NA` (not defined) | `as.logical()` 
|`factor`       | A special type of variable that denotes specific categories of a categorical variable | (stay tuned..) | `as.factor()`

### Vectors

You will have noticed that all your computations tend to pop up with a `[1]` preceding them in R's output. 
This is because, in fact, all (ok mostly all) variables are _by default_  vectors, and our answers are the first (in these cases only) value in the vector. 
As vectors get longer, new index indicators will appear at the start of new lines. 

```{r}
# This is actually an vector that has one item in it.
x <- 7
```

```{r vector-length}
# The length() functions tells us how long an vector is:
length(x)
```

We can define vectors with the function `c()`, which stands for "combine". 
This function takes a comma-separated set of values to place in the vector, and returns the vector itself:

```{r make-vector}
my_numeric_vector <- c(1, 1, 2, 3, 5, 8, 13, 21)
my_numeric_vector
```

We can build on vectors in place by redefining them:

```{r fibbonacci, live = TRUE}
# add the next two Fibonacci numbers to the series.
my_numeric_vector <- c(my_numeric_vector, 34, 55)
my_numeric_vector
```

We can pull out specific items from an vector using a process called _indexing_, which uses brackets `[]` to specify the position of an item. 

```{r subset1}
# Grab the fourth value from my_numeric_vector
# This gives us an vector of length 1 
my_numeric_vector[4]
```

Colons are also a nice way to quickly make ordered numeric vectors
Use a colon to specify an inclusive range of indices
This will return an vector with 2, 3, 4, and 5.

```{r subset-many}
my_numeric_vector[2:5]
```

One major benefit of vectors is the concept of **vectorization**, where R by default performs operations on the _entire vector at once_. 
For example, we can get the log of all numbers 1-20 with a single, simple call, and more!

```{r vectorize}
values_1_to_20 <- 1:20
```


```{r vectorize-log, live = TRUE}
# calculate the log of values_1_to_20
log(values_1_to_20)
```

Finally, we can apply logical expressions to vectors, just as we can do for single values.
The output here is a logical vector telling us whether each value in example_vector is TRUE or FALSE

```{r vector-compare}
# Which values are <= 3?
values_1_to_20 <= 3
```

There are several key functions which can be used on vectors containing numeric values, some of which are below.

+ `mean()`: The average value in the vector
+ `min()`: The minimum value in the vector
+ `max()`: The maximum value in the vector
+ `sum()`: The sum of all values in the vector

We can try out these functions on the vector `values_1_to_20` we've created. 

```{r vector-funcs}
mean(values_1_to_20)

# Try out some of the other functions we've listed above 

```

### A note on variable naming

We have learned functions such as `c`, `length`, `sum`, and etc. 
Imagine defining a variable called `c`: This will work, but it will lead to a 
lot of unintended bugs, so its best to avoid this. 

### The `%in%` logical operator 

`%in%` is useful for determining whether a given item(s) are in an vector.

```{r in-operator}
# is `7` in our vector? 
7 %in% values_1_to_20
```

```{r in2, live = TRUE}
# is `50` in our vector? 
50 %in% values_1_to_20
```

We can test a vector of values being within another vector of values. 

```{r vector-in, live = TRUE}
question_values <- c(1:3, 7, 50)
# Are these values in our vector?
question_values %in% values_1_to_20
```

## Data frames

_Data frames are one of the most useful tools for data analysis in R._ 
They are tables which consist of rows and columns, much like a _spreadsheet_. 
Each column is a variable which behaves as a _vector_, and each row is an observation. 
We will begin our exploration with dataset of measurements from three penguin species measured, which we can find in the [`palmerpenguins` package](https://allisonhorst.github.io/palmerpenguins/). 
We'll talk more about packages soon!
To use this dataset, we will load it from the `palmerpenguins` package using a `::` (more on this later) and assign it to a variable named `penguins` in our current environment.

```{r penguin-library}
penguins <- palmerpenguins::penguins
```

![drawings of penguin species](diagrams/lter_penguins.png) Artwork by [@allison_horst](https://twitter.com/allison_horst)

### Exploring data frames

The first step to using any data is to look at it!!! 
RStudio contains a special function `View()` which allows you to literally view a variable.
You can also click on the object in the environment pane to see its overall properties, or click the table icon on the object's row to automatically view the variable. 

Some useful functions for exploring our data frame include:

+ `head()` to see the first 6 rows of a data frame. Additional arguments supplied can change the number of rows.
+ `tail()` to see the last 6 rows of a data frame. Additional arguments supplied can change the number of rows.
+ `names()` to see the column names of the data frame.
+ `nrow()` to see how many rows are in the data frame
+ `ncol()` to see how many columns are in the data frame.

We can additionally explore _overall properties_ of the data frame with two different functions: `summary()` and `str()`.

This provides summary statistics for each column:

```{r penguins-summary}
summary(penguins)
```

This provides a short view of the **str**ucture and contents of the data frame.

```{r penguins-str}
str(penguins)
```

You'll notice that the column `species` is a _factor_: This is a special type of character variable that represents distinct categories known as "levels". 
We have learned here that there are three levels in the `species` column: Adelie, Chinstrap, and Gentoo.
We might want to explore individual columns of the data frame more in-depth. 
We can examine individual columns using the dollar sign `$` to select one by name:

```{r penguins-subset}
# Extract bill_length_mm as a vector
penguins$bill_length_mm

# indexing operators can be used on these vectors too
penguins$bill_length_mm[1:10]
```

We can perform our regular vector operations on columns directly.

```{r penguins-col-mean, live = TRUE}
# calculate the mean of the bill_length_mm column
mean(penguins$bill_length_mm,
     na.rm = TRUE) # remove missing values before calculating the mean
```

We can also calculate the full summary statistics for a single column directly. 

```{r penguins-col-summary, live = TRUE}
# show a summary of the bill_length_mm column
summary(penguins$bill_length_mm)
```

Extract `Species` as a vector and subset it to see a preview.

```{r penguins-col-subset, live = TRUE}
# get the first 10 values of the species column
penguins$species[1:10]
```

And view its _levels_ with the `levels()` function.

```{r penguin-levels}
levels(penguins$species)
```

## Files and directories

In many situations, we will be reading in tabular data from a file and using it as a data frame. 
To practice, we will read in a file we will be using in the next notebook as well, `gene_results_GSE44971.tsv`, in the `data` folder. 
File paths are relative to the location where this notebook file (.Rmd) is saved.

Here we will use a function, `read_tsv()` from the `readr` package.
Before we are able to use the function, we have to load the package using `library()`. 

```{r readr}
library(readr)
```

`file.path()` creates a properly formatted file path by adding a path separator (`/` on Mac and Linux operating systems, the latter of which is the operating system that our RStudio Server runs on) between separate folders or directories.
Because file path separators can differ between your computer and the computer of someone who wants to use your code, we use `file.path()` instead of typing out `"data/gene_results_GSE44971.tsv"`.
Each _argument_ to `file.path()` is a directory or file name.
You'll notice each argument is in quotes, we specify `data` first because the file, `gene_results_GSE44971.tsv` is in the `data` folder. 

```{r file.path}
file.path("data", "gene_results_GSE44971.tsv")
```

We can store this file path as a variable in our environment. 

```{r file.path-variable}
gene_file_path <- file.path("data", "gene_results_GSE44971.tsv")
```

Now we are ready to use `read_tsv()` to read the file into R.
The resulting data frame will be stored in a variable named `stats_df`.
Note the `<-` (assignment operator!) is responsible for saving this to our global environment. 

```{r read-stats}
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(gene_file_path)
```

Take a look at your environment panel to see what `stats_df` looks like. 
We can also print out a preview of the `stats_df` data frame here. 

```{r show-stats, live = TRUE}
# display stats_df
stats_df
```

### Session Info

At the end of every notebook, you will see us print out `sessionInfo`. 
This aids in the reproducibility of your code by showing exactly what packages 
and versions were being used the last time the notebook was run.

```{r}
sessionInfo()
```

+
---
title: "Introduction to R and RStudio"
author: Originally authored by Stephanie J. Spielman,<br>adapted by CCDL for ALSF
date: 2021
output:
  html_notebook:
    toc: true
    toc_float: true
---

## Objectives

This notebook will demonstrate how to:  

- Navigate the RStudio environment  
- Use R for simple calculations, both mathematical and logical  
- Define and use variables in base R  
- Understand and apply base R functions   
- Understand, define, and use R data types, including vector manipulation and indexing  
- Understand the anatomy of a data frame  

---

#### *More resources for learning R* 

- [Swirl, an interactive tutorial](https://swirlstats.com/)  
- [_R for Data Science_ book](https://r4ds.had.co.nz/)  
- [Tutorial on R, RStudio and R Markdown](https://ismayc.github.io/rbasics-book/)  
- [Handy R cheatsheets](https://www.rstudio.com/resources/cheatsheets/)  
- [R Markdown website](https://rmarkdown.rstudio.com)  
- [_R Markdown: The Definitive Guide_](https://bookdown.org/yihui/rmarkdown/)  

## What is R?

**R** is a statistical computing language that is _open source_, meaning the underlying code for the language is freely available to anyone. 
You do not need a special license or set of permissions to use and develop code in R. 

R itself is an _interpreted computer language_ and comes with functionality that comes bundled with the language itself, known as **"base R"**.
But there is also rich additional functionality provided by **external packages**, or libraries of code that assist in accomplishing certain tasks and can be freely downloaded and loaded for use. 

In the next notebook and subsequent modules, we will be using a suite of packages collectively known as [**The Tidyverse**](https://tidyverse.org). 
The `tidyverse` is geared towards intuitive data science applications that follow a shared data philosophy.
But there are still many core features of base R which are important to be aware of, and we will be using concepts from both base R and the tidyverse in our analyses, as well as task specific packages for analyses such as gene expression. 

### What is RStudio?

RStudio is a _graphical environment_ ("integrated development environment" or IDE) for writing and developing R code. RStudio is NOT a separate programming language - it is an interface we use to facilitate R programming. 
In other words, you can program in R without RStudio, but you can't use the RStudio environment without R.

For more information about RStudio than you ever wanted to know, see this [RStudio IDE Cheatsheet (pdf)](https://github.com/rstudio/cheatsheets/raw/main/rstudio-ide.pdf).

## The RStudio Environment

The RStudio environment has four main **panes**, each of which may have a number of tabs that display different information or functionality. (their specific location can be changed under Tools -> Global Options -> Pane Layout).
![RStudio Appearance](screenshots/rstudio-panes.png) 

1. The **Editor** pane is where you can write R scripts and other documents. Each tab here is its own document.
This is your _text editor_, which will allow you to save your R code for future use. 
Note that change code here will not run automatically until you run it. 

2. The **Console** pane is where you can _interactively_ run R code. 
  + There is also a **Terminal** tab here which can be used for running programs outside R on your computer
  
3. The **Environment** pane primarily displays the variables, sometimes known as _objects_ that are defined during a given R session, and what data or values they might hold.

4. The **Help viewer** pane has several tabs all of which are pretty important:
    + The **Files** tab shows the structure and contents of files and folders (also known as directories) on your computer.
    + The **Plots** tab will reveal plots when you make them
    + The **Packages** tab shows which installed packages have been loaded into your R session
    + The **Help** tab will show the help page when you look up a function
    + The **Viewer** pane will reveal compiled R Markdown documents 

## Basic Calculations

### Mathematical operators

The most basic use of R is as a regular calculator:

| Operation | Symbol |
|-----------|--------|
| Add  | `+` | 
| Subtract  | `-` | 
| Multiply  | `*` | 
| Divide  | `/` | 
| Exponentiate | `^` or `**` | 

For example, we can do some simple multiplication like this. 
When you execute code within the notebook, the results appear beneath the code. 
Try executing this chunk by clicking the *Run* button within the chunk or by 
placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

```{r calculator}
5 * 6
```

Use the console to calculate other expressions. Standard order of operations applies (mostly), and  you can use parentheses `()` as you might expect (but not brackets `[]` or braces`{}`, which have special meanings). Note however, that you must **always** specify multiplication with `*`; implicit multiplication such as `10(3 + 4)` or `10x` will not work and will generate an error, or worse.

```{r expressions, live = TRUE}
10 * (3 + 4)^2
```


### Defining and using variables 

To define a variable, we use the _assignment operator_ which looks like an arrow: `<-`, for example `x <- 7` takes the value on the right-hand side of the operator and assigns it to the variable name on the left-hand side. 

```{r var-define, live = TRUE}
# Define a variable x to equal 7, and print out the value of x
x <- 7

# We can have R repeat back to us what `x` is by just using `x`
x
```

Some features of variables, considering the example `x <- 7`:
Every variable has a **name**, a **value**, and a **type**. 
This variable's name is `x`, its value is `7`, and its type is `numeric` (7 is a number!).
Re-defining a variable will overwrite the value.

```{r var-redefine}
x <- 5.5

x
```

We can modify an existing variable by reassigning it to its same name. 
Here we'll add `2` to `x` and reassign the result back to `x`. 

```{r var-modify, live = TRUE}
x <- x + 2

x
```

### Variable naming note:
As best you can, it is a good idea to make your variable names informative (e.g. `x` doesn't mean anything, but `sandwich_price` is meaningful... if we're talking about the cost of sandwiches, that is..). 

### Comments

Arguably the __most important__ aspect of your coding is comments: Small pieces of explanatory text you leave in your code to explain what the code is doing and/or leave notes to yourself or others. 
Comments are invaluable for communicating your code to others, but they are most important for **Future You**. 
Future You comes into existence about one second after you write code, and has no idea what on earth Past You was thinking. 

Comments in R code are indicated with pound signs (*aka* hashtags, octothorps). R will _ignore_ any text in a line after the pound sign, so you can put whatever text you like there.

```{r comments}
22/7 # not quite pi

# If we need a better approximation of pi, we can use Euler's formula
# This uses atan(), which calculates arctangent.
20 * atan(1/7) + 8 * atan(3/79) 
```

Help out Future You by adding lots of comments! 
Future You next week thinks Today You is an idiot, and the only way you can convince Future You that Today You is reasonably competent is by adding comments in your code explaining why Today You is actually not so bad.

## Functions
We can use pre-built computation methods called "functions" for other operations. 
Functions have the following format, where the _argument_ is the information we are providing to the function for it to run. 
An example of this was the `atan()` function used above.

```r
function_name(argument)
```

To learn about functions, we'll examine one called `log()` first. 

To know what a function does and how to use it, use the question mark which will reveal documentation in the **help pane**: `?log`
![rhelp](screenshots/rhelp-log.png) 

The documentation tells us that `log()` is derived from `{base}`, meaning it is a function that is part of base R. 
It provides a brief description of what the function does and shows several examples of to how use it.

In particular, the documentation tells us about what argument(s) to provide:

+ The first _required_ argument is the value we'd like to take the log of, by default its _natural log_
+ The second _optional_ argument can specify a different base rather than the default `e`.

Functions also _return_ values for us to use. 
In the case of `log()`, the returned value is the log'd value the function computed.

```{r log}
log(73)
```

Here we can specify an _argument_ of `base` to calculate log base 3. 

```{r log3}
log(81, base = 3)
```

If we don't specify the _argument_ names, it assumes they are in the order that `log` defines them. 
See `?log` to see more about its arguments. 

```{r log2, live = TRUE}
log(8, 2)
```

We can switch the order if we specify the argument names. 

```{r log-order}
log(base = 10, x = 4342)
```

We can also provide variables as arguments in the same way as the raw values. 

```{r log-variable}
meaning <- 42
log(meaning)
```

## Working with variables

### Variable Types

Variable types in R can sometimes be _coerced_ (converted) from one type to another.

```{r}
# Define a variable with a number
x <- 15
```

The function `class()` will tell us the variable's type.

```{r}
class(x)
```

Let's coerce it to a character. 

```{r}
x <- as.character(x)
class(x)
```

See it now has quotes around it? It's now a character and will behave as such.

```{r}
x
```

Use this chunk to try to perform calculations with `x`, now that it is a character, what happens? 

```{r live = TRUE}
# Try to perform calculations on `x`
```

But we can't coerce everything:

```{r}
# Let's create a character variable
x <- "look at my character variable"
```

Let's try making this a numeric variable:

```{r coerce-char, error=TRUE}
x <- as.numeric(x)
```

Print out `x`.

```{r}
x
```

R is telling us it doesn't know how to convert this to a numeric variable, so it has returned `NA` instead.

For reference, here's a summary of some of the most important variable types. 

| Variable Type | Definition | Examples | Coercion |
|---------------|------------|----------| --------|
| `numeric`       | Any number value | `5`<br>`7.5` <br>`-1`| `as.numeric()`
| `integer`       | Any _whole_ number value (no decimals) | `5` <br> `-100` | `as.integer()`
|`character`      | Any collection of characters defined within _quotation marks_. Also known as a "string". | `"a"` (a single letter) <br>`"stringofletters"` (a whole bunch of characters put together as one) <br> `"string of letters and spaces"` <br> `"5"` <br> `'single quotes are also good'` | `as.character()`
|`logical`      | A value of `TRUE`, `FALSE`, or `NA` | `TRUE` <br> `FALSE` <br> `NA` (not defined) | `as.logical()` 
|`factor`       | A special type of variable that denotes specific categories of a categorical variable | (stay tuned..) | `as.factor()`

### Vectors

You will have noticed that all your computations tend to pop up with a `[1]` preceding them in R's output. 
This is because, in fact, all (ok mostly all) variables are _by default_  vectors, and our answers are the first (in these cases only) value in the vector. 
As vectors get longer, new index indicators will appear at the start of new lines. 

```{r}
# This is actually an vector that has one item in it.
x <- 7
```

```{r vector-length}
# The length() functions tells us how long an vector is:
length(x)
```

We can define vectors with the function `c()`, which stands for "combine". 
This function takes a comma-separated set of values to place in the vector, and returns the vector itself:

```{r make-vector}
my_numeric_vector <- c(1, 1, 2, 3, 5, 8, 13, 21)
my_numeric_vector
```

We can build on vectors in place by redefining them:

```{r fibbonacci, live = TRUE}
# add the next two Fibonacci numbers to the series.
my_numeric_vector <- c(my_numeric_vector, 34, 55)
my_numeric_vector
```

We can pull out specific items from an vector using a process called _indexing_, which uses brackets `[]` to specify the position of an item. 

```{r subset1}
# Grab the fourth value from my_numeric_vector
# This gives us an vector of length 1 
my_numeric_vector[4]
```

Colons are also a nice way to quickly make ordered numeric vectors
Use a colon to specify an inclusive range of indices
This will return an vector with 2, 3, 4, and 5.

```{r subset-many}
my_numeric_vector[2:5]
```

One major benefit of vectors is the concept of **vectorization**, where R by default performs operations on the _entire vector at once_. 
For example, we can get the log of all numbers 1-20 with a single, simple call, and more!

```{r vectorize}
values_1_to_20 <- 1:20
```


```{r vectorize-log, live = TRUE}
# calculate the log of values_1_to_20
log(values_1_to_20)
```

Finally, we can apply logical expressions to vectors, just as we can do for single values.
The output here is a logical vector telling us whether each value in example_vector is TRUE or FALSE

```{r vector-compare}
# Which values are <= 3?
values_1_to_20 <= 3
```

There are several key functions which can be used on vectors containing numeric values, some of which are below.

+ `mean()`: The average value in the vector
+ `min()`: The minimum value in the vector
+ `max()`: The maximum value in the vector
+ `sum()`: The sum of all values in the vector

We can try out these functions on the vector `values_1_to_20` we've created. 

```{r vector-funcs}
mean(values_1_to_20)

# Try out some of the other functions we've listed above 

```

### A note on variable naming

We have learned functions such as `c`, `length`, `sum`, and etc. 
Imagine defining a variable called `c`: This will work, but it will lead to a 
lot of unintended bugs, so it's best to avoid this. 

### The `%in%` logical operator 

`%in%` is useful for determining whether a given item(s) are in an vector.

```{r in-operator}
# is `7` in our vector? 
7 %in% values_1_to_20
```

```{r in2, live = TRUE}
# is `50` in our vector? 
50 %in% values_1_to_20
```

We can test a vector of values being within another vector of values. 

```{r vector-in, live = TRUE}
question_values <- c(1:3, 7, 50)
# Are these values in our vector?
question_values %in% values_1_to_20
```

## Data frames

_Data frames are one of the most useful tools for data analysis in R._ 
They are tables which consist of rows and columns, much like a _spreadsheet_. 
Each column is a variable which behaves as a _vector_, and each row is an observation. 
We will begin our exploration with dataset of measurements from three penguin species measured, which we can find in the [`palmerpenguins` package](https://allisonhorst.github.io/palmerpenguins/). 
We'll talk more about packages soon!
To use this dataset, we will load it from the `palmerpenguins` package using a `::` (more on this later) and assign it to a variable named `penguins` in our current environment.

```{r penguin-library}
penguins <- palmerpenguins::penguins
```

![drawings of penguin species](diagrams/lter_penguins.png) Artwork by [@allison_horst](https://twitter.com/allison_horst)

### Exploring data frames

The first step to using any data is to look at it!!! 
RStudio contains a special function `View()` which allows you to literally view a variable.
You can also click on the object in the environment pane to see its overall properties, or click the table icon on the object's row to automatically view the variable. 

Some useful functions for exploring our data frame include:

+ `head()` to see the first 6 rows of a data frame. Additional arguments supplied can change the number of rows.
+ `tail()` to see the last 6 rows of a data frame. Additional arguments supplied can change the number of rows.
+ `names()` to see the column names of the data frame.
+ `nrow()` to see how many rows are in the data frame
+ `ncol()` to see how many columns are in the data frame.

We can additionally explore _overall properties_ of the data frame with two different functions: `summary()` and `str()`.

This provides summary statistics for each column:

```{r penguins-summary}
summary(penguins)
```

This provides a short view of the **str**ucture and contents of the data frame.

```{r penguins-str}
str(penguins)
```

You'll notice that the column `species` is a _factor_: This is a special type of character variable that represents distinct categories known as "levels". 
We have learned here that there are three levels in the `species` column: Adelie, Chinstrap, and Gentoo.
We might want to explore individual columns of the data frame more in-depth. 
We can examine individual columns using the dollar sign `$` to select one by name:

```{r penguins-subset}
# Extract bill_length_mm as a vector
penguins$bill_length_mm

# indexing operators can be used on these vectors too
penguins$bill_length_mm[1:10]
```

We can perform our regular vector operations on columns directly.

```{r penguins-col-mean, live = TRUE}
# calculate the mean of the bill_length_mm column
mean(penguins$bill_length_mm,
     na.rm = TRUE) # remove missing values before calculating the mean
```

We can also calculate the full summary statistics for a single column directly. 

```{r penguins-col-summary, live = TRUE}
# show a summary of the bill_length_mm column
summary(penguins$bill_length_mm)
```

Extract `Species` as a vector and subset it to see a preview.

```{r penguins-col-subset, live = TRUE}
# get the first 10 values of the species column
penguins$species[1:10]
```

And view its _levels_ with the `levels()` function.

```{r penguin-levels}
levels(penguins$species)
```

## Files and directories

In many situations, we will be reading in tabular data from a file and using it as a data frame. 
To practice, we will read in a file we will be using in the next notebook as well, `gene_results_GSE44971.tsv`, in the `data` folder. 
File paths are relative to the location where this notebook file (.Rmd) is saved.

Here we will use a function, `read_tsv()` from the `readr` package.
Before we are able to use the function, we have to load the package using `library()`. 

```{r readr}
library(readr)
```

`file.path()` creates a properly formatted file path by adding a path separator (`/` on Mac and Linux operating systems, the latter of which is the operating system that our RStudio Server runs on) between separate folders or directories.
Because file path separators can differ between your computer and the computer of someone who wants to use your code, we use `file.path()` instead of typing out `"data/gene_results_GSE44971.tsv"`.
Each _argument_ to `file.path()` is a directory or file name.
You'll notice each argument is in quotes, we specify `data` first because the file, `gene_results_GSE44971.tsv` is in the `data` folder. 

```{r file.path}
file.path("data", "gene_results_GSE44971.tsv")
```

We can store this file path as a variable in our environment. 

```{r file.path-variable}
gene_file_path <- file.path("data", "gene_results_GSE44971.tsv")
```

Now we are ready to use `read_tsv()` to read the file into R.
The resulting data frame will be stored in a variable named `stats_df`.
Note the `<-` (assignment operator!) is responsible for saving this to our global environment. 

```{r read-stats}
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(gene_file_path)
```

Take a look at your environment panel to see what `stats_df` looks like. 
We can also print out a preview of the `stats_df` data frame here. 

```{r show-stats, live = TRUE}
# display stats_df
stats_df
```

### Session Info

At the end of every notebook, you will see us print out `sessionInfo`. 
This aids in the reproducibility of your code by showing exactly what packages 
and versions were being used the last time the notebook was run.

```{r}
sessionInfo()
```

diff --git a/intro-to-R-tidyverse/02-intro_to_ggplot2-live.Rmd b/intro-to-R-tidyverse/02-intro_to_ggplot2-live.Rmd index 923ccf8f..389346f8 100644 --- a/intro-to-R-tidyverse/02-intro_to_ggplot2-live.Rmd +++ b/intro-to-R-tidyverse/02-intro_to_ggplot2-live.Rmd @@ -163,6 +163,8 @@ Remember, the name of this package is `ggplot2`, but the function we use is call 1. `data`, which is the data frame that contains the data we want to plot. 2. `mapping`, which is a special list made with the `aes()` function to describe which values will be used for each **aes**thetic component of the plot, such as the x and y coordinates of each point. (If you find calling things like the x and y coordinates "aesthetics" confusing, don't worry, you are not alone.) +Specifically, the `aes()` function is used to specify that a given column (variable) in your data frame be mapped to a given aesthetic component of the plot. + ```{r ggplot-base} ggplot( @@ -346,11 +348,17 @@ volcano_plot <- ggplot( ``` When we are happy with our plot, we can save the plot using `ggsave`. +It's a good idea to also specify `width` and `height` arguments (units in inches) +to ensure the saved plot is always the same size every time you run this code. +Here, we'll save a 6"x6" plot. + ```{r ggsave} ggsave( plot = volcano_plot, - filename = file.path(plots_dir, "volcano_plot.png") + filename = file.path(plots_dir, "volcano_plot.png"), + width = 6, + height = 6 ) ``` diff --git a/intro-to-R-tidyverse/02-intro_to_ggplot2.nb.html b/intro-to-R-tidyverse/02-intro_to_ggplot2.nb.html index 59d873ca..26f963ae 100644 --- a/intro-to-R-tidyverse/02-intro_to_ggplot2.nb.html +++ b/intro-to-R-tidyverse/02-intro_to_ggplot2.nb.html @@ -3209,7 +3209,10 @@

Plotting this data

each aesthetic component of the plot, such as the x and y coordinates of each point. (If you find calling things like the x and y coordinates “aesthetics” confusing, don’t worry, you are not -alone.) +alone.)
+Specifically, the aes() function is used to specify that a +given column (variable) in your data frame be mapped to a given +aesthetic component of the plot. @@ -3507,18 +3510,20 @@

Adjust our ggplot

When we are happy with our plot, we can save the plot using -ggsave.

+ggsave. It’s a good idea to also specify width +and height arguments (units in inches) to ensure the saved +plot is always the same size every time you run this code. Here, we’ll +save a 6”x6” plot.

- +
ggsave(
   plot = volcano_plot,
-  filename = file.path(plots_dir, "volcano_plot.png")
+  filename = file.path(plots_dir, "volcano_plot.png"),
+  width = 6, 
+  height = 6 
 )
- -
Saving 7 x 5 in image
- @@ -3578,7 +3583,7 @@

Session Info

-
---
title: "Introduction to ggplot2"
author: "CCDL for ALSF" 
date: 2021
output:   
  html_notebook: 
    toc: true
    toc_float: true
---


## Objectives

This notebook will demonstrate how to: 

- Load and use R packages  
- Read in and perform simple manipulations of data frames 
- Use `ggplot2` to plot and visualize data
- Customize plots using features of `ggplot2`

---

We'll use a real gene expression dataset to get comfortable making visualizations using ggplot2. 
We've [performed differential expression analyses](./scripts/00-setup-intro-to-R.R) on a pre-processed [astrocytoma microarray dataset](https://www.refine.bio/experiments/GSE44971/gene-expression-data-from-pilocytic-astrocytoma-tumour-samples-and-normal-cerebellum-controls).
We'll start by making a volcano plot of differential gene expression results from this experiment.
We performed three sets of contrasts:  

1) `sex` category contrasting: `Male` vs `Female`  
2) `tissue` category contrasting : `Pilocytic astrocytoma tumor` samples vs `normal cerebellum` samples  
3) An interaction of both `sex` and `tissue`. 

**More ggplot2 resources:**  

- [ggplot2 website](https://ggplot2.tidyverse.org/)  
- [Handy cheatsheet for ggplot2 (pdf)](https://github.com/rstudio/cheatsheets/raw/main/data-visualization.pdf)
- [_Data Visualization, A practical introduction_](https://socviz.co/)
- [Data visualization chapter of _R for Data Science_](https://r4ds.had.co.nz/data-visualisation.html)  
- [ggplot2 online tutorial](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)  

## Set Up

We saved these results to a tab separated values (TSV) file called `gene_results_GSE44971.tsv`.
It's been saved to the `data` folder. 
File paths are relative to where this notebook file (.Rmd) is saved.
So we can reference it later, let's make a variable with our data directory name. 

```{r}
data_dir <- "data"
```

Let's declare our output folder name as its own variable. 

```{r}
plots_dir <- "plots"
```

We can also create a directory if it doesn't already exist. 

```{r createif}
# The if statement here tests whether the plot directory exists and
# only executes the  expressions between the braces if it does not.
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}
```

In this notebook we will be using functions from the Tidyverse set of packages, so we need to load in those functions using `library()`.
We could load the individual packages we need one at a time, but it is convenient for now to load them all with the `tidyverse` "package," which groups many of them together as a shortcut.
Keep a look out for where we tell you which individual package different functions come from.

```{r tidyverse}
library(tidyverse)
```

## Read in the differential expression analysis results file

Here we are using a `tidyverse` function `read_tsv()` from the `readr` package.
Like we did in the previous notebook, we will store the resulting data frame as `stats_df`.

```{r read-stats}
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(file.path(
  data_dir,
  "gene_results_GSE44971.tsv"
))
```

We can take a look at a column individually by using a `$`. 
Note we are using `head()` so the whole thing doesn't print out. 

```{r column}
head(stats_df$contrast)
```

If we want to see a specific set of values, we can use brackets with the indices of the values we'd like returned.

```{r}
stats_df$avg_expression[6:10]
```

Let's look at some basic statistics from the data set using `summary()`

```{r stats-summary, live = TRUE}
# summary of stats_df
summary(stats_df)
```

The statistics for `contrast` are not very informative, so let's do that again with just the `contrast` column after converting it to a `factor`
```{r factor-summary, live = TRUE}
# summary of `stats_df$contrast` as a factor
summary(as.factor(stats_df$contrast))
```

## Set up the dataset

Before we make our plot, we want to calculate a set of new values for each row; transformations of the raw statistics in our table.
To do this we will use a function from the `dplyr` package called `mutate()` to make a new column of -log10 p values.

```{r mutate}
# add a `neg_log10_p` column to the data frame
stats_df <- mutate(stats_df, # data frame we'd like to add a variable to
  neg_log10_p = -log10(p_value) # column name and values
)
```

Let's filter to only `male_female` contrast data. 
First let's try out a logical expression: 

```{r eval = FALSE}
stats_df$contrast == "male_female"
```

Now we can try out the `filter()` function.
Notice that we are not assigning the results to a variable, so this filtered dataset will not be saved to the environment.

```{r filter, live = TRUE}
# filter stats_df to "male_female" only
filter(stats_df, contrast == "male_female")
```

Now we can assign the results to a new data frame: `male_female_df`. 

```{r filter-save, live = TRUE}
# filter and save to male_female_df
male_female_df <- filter(stats_df, contrast == "male_female")
```

## Plotting this data

Let's make a volcano plot with this data. 
First let's take a look at only the tumor vs. normal comparison. 
Let's save this as a separate data frame by assigning it a new name. 

```{r filter-tumor}
tumor_normal_df <- filter(stats_df, contrast == "astrocytoma_normal")
```

To make this plot we will be using functions from the `ggplot2` package, the main plotting package of the tidyverse.
We use the first function, `ggplot()` to define the data that will be plotted.
Remember, the name of this package is `ggplot2`, but the function we use is called `ggplot()` without the `2`.
`ggplot()` takes two main arguments:  

1. `data`, which is the data frame that contains the data we want to plot.  
2. `mapping`, which is a special list made with the `aes()` function to describe which values will be used for each **aes**thetic component of the plot, such as the x and y coordinates of each point. 
(If you find calling things like the x and y coordinates "aesthetics" confusing, don't worry, you are not alone.)  

```{r ggplot-base}
ggplot(
  tumor_normal_df, # This first argument is the data frame with the data we want to plot
  aes(
    x = log_fold_change, # This is the column name of the values we want to use
    # for the x coordinates
    y = neg_log10_p
  ) # This is the column name of the data we want for the y-axis
)
```

You'll notice this plot doesn't have anything on it because we haven't 
specified a plot type yet.
To do that, we will add another ggplot layer with `+` which will specify exactly what we want to plot.
A volcano plot is a special kind of scatter plot, so to make that we will want to plot individual points, which we can do with `geom_point()`.

```{r ggplot-points, live = TRUE}
# This first part is the same as before
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p
  )
) +
  # Now we are adding on a layer to specify what kind of plot we want
  geom_point()
```

Here's a brief summary of ggplot2 structure. 
![ggplot2 structure](diagrams/ggplot_structure.png)

### Adjust our ggplot

Now that we have a base plot that shows our data, we can add layers on to it and adjust it.
We can adjust the color of points using the `color` aesthetic.

```{r ggplot-color, live = TRUE}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  ) # We added this argument to color code the points!
) +
  geom_point()
```

Because we have so many points overlapping one another, we will want to adjust 
the transparency, which we can do with an `alpha` argument. 

```{r ggplot-alpha, live = TRUE}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) # We are using the `alpha` argument to make our points transparent
```

Notice that we added the alpha within the `geom_point()` function, not to the `aes()`. 
We did this because we want all of the points to have the same level of transparency, and it will not vary depending on any variable in the data.
We can also change the background and appearance of the plot as a whole by adding a `theme`.

```{r ggplot-theme}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  theme_bw() # Add on this set of appearance presets to make it pretty
```

We are not limited to a single plotting layer. 
For example, if we want to add a horizontal line to indicate a significance cutoff, we can do that with `geom_hline()`.
For now, we will choose the value of 5.5 (that is close to a Bonferroni correction) and add that to the plot.

```{r ggplot-hline, live = TRUE}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) + 
  geom_hline(yintercept = 5.5, color = "darkgreen") # we can specify colors by names here
```

We can change the x and y labels using a few different strategies. 
One approach is to use functions `xlab()` and `ylab()` individually to set, respectively, the x-axis label and the the y-axis label.


```{r ggplot-label-1}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  # Add labels with separate functions:
  xlab("log2 Fold Change Tumor/Normal") + 
  ylab("-log10 p value") 
```


Alternatively, we can use the `ggplot2` function `labs()`, which takes individual arguments for each label we want want to set. 
We can also include the argument `title` to add an overall plot title.

```{r ggplot-label-2, live = TRUE}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  # Add x and y labels and overall plot title with arguments to labs():
  labs(
    x = "log2 Fold Change Tumor/Normal",
    y = "-log10 p value",
    title = "Astrocytoma Tumor vs Normal Cerebellum"
  )
  
```

Something great about the `labs()` function is you can also use it to specify labels for your *legends* derived from certain aesthetics. 
In this plot, our legend is derived from a *color aesthetic*, so we can specify the keyword "color" to update the legend title.

```{r ggplot-label-aes}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  # Add x and y labels and overall plot title with arguments to labs():
  labs(
    x = "log2 Fold Change Tumor/Normal",
    y = "-log10 p value",
    title = "Astrocytoma Tumor vs Normal Cerebellum",
    # Use the color keyword to label the color legend
    color = "Average expression"
  )
  
```


Use this chunk to make the same kind of plot as the previous chunk but instead plot the male female contrast data, that is stored in `male_female_df`. 

```{r mf-volcano, live = TRUE}
# Use this chunk to make the same kind of volcano plot, but with the male-female contrast data.

```


Turns out, we don't have to plot each contrast separately, instead, we can use the original data frame that contains all three contrasts' data, `stats_df`, and add a `facet_wrap` to make each contrast its own plot. 

```{r ggplot-facets}
ggplot(
  stats_df, # Switch to the bigger data frame with all three contrasts' data
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  facet_wrap(~contrast) +
  labs(
    x  = "log2 Fold Change", # Now that this includes the other contrasts,
                             # we'll make this label more general
    y = "-log10 p value",
    color = "Average expression"
  ) +
  coord_cartesian(xlim = c(-25, 25)) # zoom in on the x-axis
```

We can store the plot as an object in the global environment by using `<-` operator. 
Here we will call this `volcano_plot`. 

```{r ggplot-store-object}
volcano_plot <- ggplot(
  stats_df, # We are calling this plot `volcano_plot`
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  facet_wrap(~contrast) +
  labs(
    x  = "log2 Fold Change",
    y = "-log10 p value",
    color = "Average expression"
  ) +
  coord_cartesian(xlim = c(-25, 25))
```

When we are happy with our plot, we can save the plot using `ggsave`. 

```{r ggsave}
ggsave(
  plot = volcano_plot,
  filename = file.path(plots_dir, "volcano_plot.png")
)
```

### Session Info

```{r}
# Print out the versions and packages we are using in this session
sessionInfo()
```

+
---
title: "Introduction to ggplot2"
author: "CCDL for ALSF" 
date: 2021
output:   
  html_notebook: 
    toc: true
    toc_float: true
---


## Objectives

This notebook will demonstrate how to: 

- Load and use R packages  
- Read in and perform simple manipulations of data frames 
- Use `ggplot2` to plot and visualize data
- Customize plots using features of `ggplot2`

---

We'll use a real gene expression dataset to get comfortable making visualizations using ggplot2. 
We've [performed differential expression analyses](./scripts/00-setup-intro-to-R.R) on a pre-processed [astrocytoma microarray dataset](https://www.refine.bio/experiments/GSE44971/gene-expression-data-from-pilocytic-astrocytoma-tumour-samples-and-normal-cerebellum-controls).
We'll start by making a volcano plot of differential gene expression results from this experiment.
We performed three sets of contrasts:  

1) `sex` category contrasting: `Male` vs `Female`  
2) `tissue` category contrasting : `Pilocytic astrocytoma tumor` samples vs `normal cerebellum` samples  
3) An interaction of both `sex` and `tissue`. 

**More ggplot2 resources:**  

- [ggplot2 website](https://ggplot2.tidyverse.org/)  
- [Handy cheatsheet for ggplot2 (pdf)](https://github.com/rstudio/cheatsheets/raw/main/data-visualization.pdf)
- [_Data Visualization, A practical introduction_](https://socviz.co/)
- [Data visualization chapter of _R for Data Science_](https://r4ds.had.co.nz/data-visualisation.html)  
- [ggplot2 online tutorial](http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)  

## Set Up

We saved these results to a tab separated values (TSV) file called `gene_results_GSE44971.tsv`.
It's been saved to the `data` folder. 
File paths are relative to where this notebook file (.Rmd) is saved.
So we can reference it later, let's make a variable with our data directory name. 

```{r}
data_dir <- "data"
```

Let's declare our output folder name as its own variable. 

```{r}
plots_dir <- "plots"
```

We can also create a directory if it doesn't already exist. 

```{r createif}
# The if statement here tests whether the plot directory exists and
# only executes the  expressions between the braces if it does not.
if (!dir.exists(plots_dir)) {
  dir.create(plots_dir)
}
```

In this notebook we will be using functions from the Tidyverse set of packages, so we need to load in those functions using `library()`.
We could load the individual packages we need one at a time, but it is convenient for now to load them all with the `tidyverse` "package," which groups many of them together as a shortcut.
Keep a look out for where we tell you which individual package different functions come from.

```{r tidyverse}
library(tidyverse)
```

## Read in the differential expression analysis results file

Here we are using a `tidyverse` function `read_tsv()` from the `readr` package.
Like we did in the previous notebook, we will store the resulting data frame as `stats_df`.

```{r read-stats}
# read in the file `gene_results_GSE44971.tsv` from the data directory
stats_df <- read_tsv(file.path(
  data_dir,
  "gene_results_GSE44971.tsv"
))
```

We can take a look at a column individually by using a `$`. 
Note we are using `head()` so the whole thing doesn't print out. 

```{r column}
head(stats_df$contrast)
```

If we want to see a specific set of values, we can use brackets with the indices of the values we'd like returned.

```{r}
stats_df$avg_expression[6:10]
```

Let's look at some basic statistics from the data set using `summary()`

```{r stats-summary, live = TRUE}
# summary of stats_df
summary(stats_df)
```

The statistics for `contrast` are not very informative, so let's do that again with just the `contrast` column after converting it to a `factor`
```{r factor-summary, live = TRUE}
# summary of `stats_df$contrast` as a factor
summary(as.factor(stats_df$contrast))
```

## Set up the dataset

Before we make our plot, we want to calculate a set of new values for each row; transformations of the raw statistics in our table.
To do this we will use a function from the `dplyr` package called `mutate()` to make a new column of -log10 p values.

```{r mutate}
# add a `neg_log10_p` column to the data frame
stats_df <- mutate(stats_df, # data frame we'd like to add a variable to
  neg_log10_p = -log10(p_value) # column name and values
)
```

Let's filter to only `male_female` contrast data. 
First let's try out a logical expression: 

```{r eval = FALSE}
stats_df$contrast == "male_female"
```

Now we can try out the `filter()` function.
Notice that we are not assigning the results to a variable, so this filtered dataset will not be saved to the environment.

```{r filter, live = TRUE}
# filter stats_df to "male_female" only
filter(stats_df, contrast == "male_female")
```

Now we can assign the results to a new data frame: `male_female_df`. 

```{r filter-save, live = TRUE}
# filter and save to male_female_df
male_female_df <- filter(stats_df, contrast == "male_female")
```

## Plotting this data

Let's make a volcano plot with this data. 
First let's take a look at only the tumor vs. normal comparison. 
Let's save this as a separate data frame by assigning it a new name. 

```{r filter-tumor}
tumor_normal_df <- filter(stats_df, contrast == "astrocytoma_normal")
```

To make this plot we will be using functions from the `ggplot2` package, the main plotting package of the tidyverse.
We use the first function, `ggplot()` to define the data that will be plotted.
Remember, the name of this package is `ggplot2`, but the function we use is called `ggplot()` without the `2`.
`ggplot()` takes two main arguments:  

1. `data`, which is the data frame that contains the data we want to plot.  
2. `mapping`, which is a special list made with the `aes()` function to describe which values will be used for each **aes**thetic component of the plot, such as the x and y coordinates of each point. 
(If you find calling things like the x and y coordinates "aesthetics" confusing, don't worry, you are not alone.)  
Specifically, the `aes()` function is used to specify that a given column (variable) in your data frame be mapped to a given aesthetic component of the plot.


```{r ggplot-base}
ggplot(
  tumor_normal_df, # This first argument is the data frame with the data we want to plot
  aes(
    x = log_fold_change, # This is the column name of the values we want to use
    # for the x coordinates
    y = neg_log10_p
  ) # This is the column name of the data we want for the y-axis
)
```

You'll notice this plot doesn't have anything on it because we haven't 
specified a plot type yet.
To do that, we will add another ggplot layer with `+` which will specify exactly what we want to plot.
A volcano plot is a special kind of scatter plot, so to make that we will want to plot individual points, which we can do with `geom_point()`.

```{r ggplot-points, live = TRUE}
# This first part is the same as before
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p
  )
) +
  # Now we are adding on a layer to specify what kind of plot we want
  geom_point()
```

Here's a brief summary of ggplot2 structure. 
![ggplot2 structure](diagrams/ggplot_structure.png)

### Adjust our ggplot

Now that we have a base plot that shows our data, we can add layers on to it and adjust it.
We can adjust the color of points using the `color` aesthetic.

```{r ggplot-color, live = TRUE}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  ) # We added this argument to color code the points!
) +
  geom_point()
```

Because we have so many points overlapping one another, we will want to adjust 
the transparency, which we can do with an `alpha` argument. 

```{r ggplot-alpha, live = TRUE}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) # We are using the `alpha` argument to make our points transparent
```

Notice that we added the alpha within the `geom_point()` function, not to the `aes()`. 
We did this because we want all of the points to have the same level of transparency, and it will not vary depending on any variable in the data.
We can also change the background and appearance of the plot as a whole by adding a `theme`.

```{r ggplot-theme}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  theme_bw() # Add on this set of appearance presets to make it pretty
```

We are not limited to a single plotting layer. 
For example, if we want to add a horizontal line to indicate a significance cutoff, we can do that with `geom_hline()`.
For now, we will choose the value of 5.5 (that is close to a Bonferroni correction) and add that to the plot.

```{r ggplot-hline, live = TRUE}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) + 
  geom_hline(yintercept = 5.5, color = "darkgreen") # we can specify colors by names here
```

We can change the x and y labels using a few different strategies. 
One approach is to use functions `xlab()` and `ylab()` individually to set, respectively, the x-axis label and the the y-axis label.


```{r ggplot-label-1}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  # Add labels with separate functions:
  xlab("log2 Fold Change Tumor/Normal") + 
  ylab("-log10 p value") 
```


Alternatively, we can use the `ggplot2` function `labs()`, which takes individual arguments for each label we want want to set. 
We can also include the argument `title` to add an overall plot title.

```{r ggplot-label-2, live = TRUE}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  # Add x and y labels and overall plot title with arguments to labs():
  labs(
    x = "log2 Fold Change Tumor/Normal",
    y = "-log10 p value",
    title = "Astrocytoma Tumor vs Normal Cerebellum"
  )
  
```

Something great about the `labs()` function is you can also use it to specify labels for your *legends* derived from certain aesthetics. 
In this plot, our legend is derived from a *color aesthetic*, so we can specify the keyword "color" to update the legend title.

```{r ggplot-label-aes}
ggplot(
  tumor_normal_df,
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  # Add x and y labels and overall plot title with arguments to labs():
  labs(
    x = "log2 Fold Change Tumor/Normal",
    y = "-log10 p value",
    title = "Astrocytoma Tumor vs Normal Cerebellum",
    # Use the color keyword to label the color legend
    color = "Average expression"
  )
  
```


Use this chunk to make the same kind of plot as the previous chunk but instead plot the male female contrast data, that is stored in `male_female_df`. 

```{r mf-volcano, live = TRUE}
# Use this chunk to make the same kind of volcano plot, but with the male-female contrast data.

```


Turns out, we don't have to plot each contrast separately, instead, we can use the original data frame that contains all three contrasts' data, `stats_df`, and add a `facet_wrap` to make each contrast its own plot. 

```{r ggplot-facets}
ggplot(
  stats_df, # Switch to the bigger data frame with all three contrasts' data
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  facet_wrap(~contrast) +
  labs(
    x  = "log2 Fold Change", # Now that this includes the other contrasts,
                             # we'll make this label more general
    y = "-log10 p value",
    color = "Average expression"
  ) +
  coord_cartesian(xlim = c(-25, 25)) # zoom in on the x-axis
```

We can store the plot as an object in the global environment by using `<-` operator. 
Here we will call this `volcano_plot`. 

```{r ggplot-store-object}
volcano_plot <- ggplot(
  stats_df, # We are calling this plot `volcano_plot`
  aes(
    x = log_fold_change,
    y = neg_log10_p,
    color = avg_expression
  )
) +
  geom_point(alpha = 0.2) +
  geom_hline(yintercept = 5.5, color = "darkgreen") +
  theme_bw() +
  facet_wrap(~contrast) +
  labs(
    x  = "log2 Fold Change",
    y = "-log10 p value",
    color = "Average expression"
  ) +
  coord_cartesian(xlim = c(-25, 25))
```

When we are happy with our plot, we can save the plot using `ggsave`. 
It's a good idea to also specify `width` and `height` arguments (units in inches)
to ensure the saved plot is always the same size every time you run this code.
Here, we'll save a 6"x6" plot.


```{r ggsave}
ggsave(
  plot = volcano_plot,
  filename = file.path(plots_dir, "volcano_plot.png"),
  width = 6, 
  height = 6 
)
```

### Session Info

```{r}
# Print out the versions and packages we are using in this session
sessionInfo()
```

diff --git a/scRNA-seq/04-dimension_reduction_scRNA.nb.html b/scRNA-seq/04-dimension_reduction_scRNA.nb.html index dd3c70e7..8ca700e4 100644 --- a/scRNA-seq/04-dimension_reduction_scRNA.nb.html +++ b/scRNA-seq/04-dimension_reduction_scRNA.nb.html @@ -3715,7 +3715,7 @@

UMAP experiments

UMAP_plot_wrapper(nn_param = 3) -

+

@@ -3788,7 +3788,7 @@

t-SNE comparison

plotReducedDim(normalized_sce, "TSNE", colour_by = "detected") -

+