Skip to content

Commit

Permalink
reworking 07 (readr and pivoting)
Browse files Browse the repository at this point in the history
  • Loading branch information
Juan Caballero committed Apr 4, 2024
1 parent 91f5c8a commit b863f55
Showing 1 changed file with 121 additions and 54 deletions.
175 changes: 121 additions & 54 deletions qmd/07_DataImport.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,23 @@ library(tidyverse)

- The code from some slides depends on the previous slides!

- You can execute each line individually using Command-Enter on Mac, alt-Enter on Workbench.
- You can execute each line individually using Command-Enter on Mac, alt-Enter on Workbench.

## Our example dataset

Blackmore S, et al. *Influenza infection triggers disease in a genetic model of experimental autoimmune encephalomyelitis*.PNAS 2017,114(30):E6107-E6116. PMID: 28696309

Design: Gender matched eight week old C57BL/6 mice were inoculated saline or with Influenza A (Puerto Rico/8/34; PR8, 1.0 HAU) by intranasal route and transcriptomic changes
in the cerebellum and spinal cord tissues were evaluated by RNA-seq (Hiseq-2500 100bp pe reads) at days 0 (non-infected), 4 and 8.

An extract of the data is in `https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq.csv` (we'll take a deep look later)

The table is a combined table for gene expression (gene counts) and sample metadata, with columns:
"gene","sample","expression","organism","age","sex","infection","strain","time","tissue","mouse","ENTREZID","product","ensembl_gene_id","external_synonym","chromosome_name","gene_biotype","phenotype_description","hsapiens_homolog_associated_gene_name"

But for this session, we'll use only:
- `https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq_counts_wide.csv`
- `https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq_counts_long.csv`

## Data Import in the Tidyverse

Expand All @@ -62,50 +78,25 @@ Note:

- MS Excel files can be read and/or write but **should be avoided**

## Formats {auto-animate="true"}
## Table formats {auto-animate="true"}

CSV (*comma separated values*)

```
label1,label2,label3,"label num 4"<NL>
value1,value2,value3,"value num 4"<NL>
```

```
sample,sex,age,treatment,response
A001,M,8,KO,5200
A002,M,4,WT,4430
A003,F,4,KO,344
B001,F,6,WT,2328
```
## Formats {auto-animate="true"}

TSV (*tab separated values*)

```
label1<TAB>label2<TAB>label3<TAB>"label num 4"<NL>
value1<TAB>value2<TAB>value3<TAB>"value num 4"<NL>
```

```
sample\tsex\tage\ttreatment\tresponse
A001\tM\t8\tKO\t5200
A002\tM\t4\tWT\t4430
A003\tF\t4\tKO\t344
B001\tF\t6t\tWT\t2328
```
## Formats {auto-animate="true"}

DELIM (*char separated values*)

```
label1;label2;label3;"label num 4"$value1;value2;value3;"value num 4"//
```

```
sample;sex;age;treatment;response$A001;M;8;KO;5200$A002;M;4;WT;4430$A003;F;4;KO;344$B001;F;6;WT;2328//
```


## `readr::` {auto-animate="true"}

Expand All @@ -119,6 +110,7 @@ sample;sex;age;treatment;response$A001;M;8;KO;5200$A002;M;4;WT;4430$A003;F;4;KO;

## `readr::` in action {auto-animate="true"}

Example commands, don't run
```{r}
#| echo: true
#| eval: false
Expand All @@ -135,9 +127,10 @@ dat = read_csv("https://server.com/region/file.csv")

## `readr::` in action {auto-animate="true"}

Real data load
```{r}
#| echo: true
rnaseq_file = "https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq.csv"
rnaseq_file = "https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq_counts_wide.csv"
rna = read_csv(rnaseq_file)
Expand All @@ -154,70 +147,144 @@ What is this "*rna*"?
- Tibbles can be grouped
- You can see the types of each column in a tibble

## Quick exercise {auto-animate="true"}
## What is Tidy Data?

- In tidy data, each row is an observation and each column is a different variable (*long-format*).
- In wide data, each row contains several observations, and the columns contain values (*wide-format*).

![](images/tidy_data.png)

```
## Hands on
Get the basic statistics for each sample in `rna`
Which sample has the highest mean expression?
Explore the RNAseq data as a tibble.
## Dplyr in action - `pivot_longer()` {auto-animate="true"}
To transform the data in a long-format we use `pivot_longer()`, it takes as inputs:
1. the `data` to be transformed;
2. the `names_to` the new column name we wish to create and populate with the current column names;
3. the `values_to` the new column name we wish to create and populate with current values;
4. the names of the columns to be used to populate the `names_to` and `values_to` variables (or to drop with `-`).
## Dplyr in action - `pivot_longer()` {auto-animate="true"}
```{r}
#| echo: true
head(rna)
rna_long = pivot_longer(
rna,
names_to = "sample",
values_to = "expression",
-gene)
rna_long
```

## Quick exercise {auto-animate="true"}
## Dplyr in action - `pivot_longer()` {auto-animate="true"}

![](images/pivot_longer.png)

## Dplyr in action - `pivot_longer()` {auto-animate="true"}

Explore the RNAseq data as a tibble.
Column selection can be defined with patterns or ranges

```{r}
#| echo: true
tail(rna)
rna_long2 = pivot_longer(
rna,
names_to = "sample",
values_to = "expression",
cols = starts_with("GSM"))
```

## Quick exercise {auto-animate="true"}
## Dplyr in action - `pivot_longer()` {auto-animate="true"}

Explore the RNAseq data as a tibble.
Column selection can be also defined with patterns or ranges

```{r}
#| echo: true
dim(rna)
rna_long3 = pivot_longer(
rna,
names_to = "sample",
values_to = "expression",
GSM2545336:GSM2545380)
```

## `readxl::` {auto-animate="true"}
## Dplyr in action - `pivot_wider()` {auto-animate="true"}

- `readxl::` is the Tidyverse library for reading data from Excel formats.
- `read_excel()`, `read_xls()` and `read_xlsx()` are some of the functions provided
- The `excel_sheets()` function yields the names of the sheets in the Excel file. These names are passed to the `sheet` argument for the **readxl** functions
- The `read_lines()` function shows the first few lines of a file in R.
The inverse operation is `pivot_wider()` can transform long-format to wide-format.

It takes three main arguments:

1. the `data` to be transformed

2. the `names_from` are the column whose values will become new column names

## `readxl::` in action {auto-animate="true"}
3. the `values_from` are the column whose values will fill the new columns

## Dplyr in action - `pivot_wider()` {auto-animate="true"}

```{r}
#| echo: true
library(readxl)
full_url = "https://github.com/maxplanck-ie/Rintro/raw/2024.04/qmd/data/2010_bigfive_regents.xls"
rna_wide = pivot_wider(
rna_long,
names_from = sample,
values_from = expression)
download.file(url=full_url, destfile="2010_bigfive_regents.xls")
rna_wide
excel_sheets("2010_bigfive_regents.xls")
```

`excel_sheets()` lists the sheets in the Excel file
## Dplyr in action - `pivot_wider()` {auto-animate="true"}

![](images/pivot_wider.png)

## Dplyr in action - `pivot_wider()` {auto-animate="true"}

By default, missing values will be converted to `NA`, we can change it with `values_fill`

```{r}
#| echo: true
rna_wide_noNAs = pivot_wider(
rna_long,
names_from = sample,
values_from = expression,
values_fill = 0)
## `readxl::` in action {auto-animate="true"}
rna_wide_noNAs
Loading the data in "Sheet1"
```

## Dplyr in action - `write_csv` {auto-animate="true"}

Finally, we could need to save our data as a new file for later use or sharing, we can use `write_csv()`

```{r}
#| echo: true
sheet_one = read_xls("2010_bigfive_regents.xls", sheet = "Sheet1")
sheet_one
write_csv(rna_wide_noNAs, file = "rna_wide.csv")
```

## `readxl::` {auto-animate="true"}

- `read_xls()` is for xls files, `read_xlsx` for XML-based xlsx files
- you can use `read_excel()` if you are unsure of the format
- `readxl::` is the Tidyverse library for reading data from Excel formats.
- `read_excel()`, `read_xls()` and `read_xlsx()` are some of the functions provided
- The `excel_sheets()` function yields the names of the sheets in the Excel file. These names are passed to the `sheet` argument for the **readxl** functions
- The `read_lines()` function shows the first few lines of a file in R.
- But please, **don't use Excel**

## Any questions?

Expand Down

0 comments on commit b863f55

Please sign in to comment.