reworking 07 (readr and pivoting)

maxplanck-ie · Apr 4, 2024 · b863f55 · b863f55
1 parent 91f5c8a
commit b863f55
Showing 1 changed file with 121 additions and 54 deletions.
diff --git a/qmd/07_DataImport.qmd b/qmd/07_DataImport.qmd
@@ -50,7 +50,23 @@ library(tidyverse)
 
 - The code from some slides depends on the previous slides! 
 
-- You can execute each line individually using Command-Enter on Mac, alt-Enter on Workbench. 
+- You can execute each line individually using Command-Enter on Mac, alt-Enter on Workbench.
+
+## Our example dataset
+
+Blackmore S, et al. *Influenza infection triggers disease in a genetic model of experimental autoimmune encephalomyelitis*.PNAS 2017,114(30):E6107-E6116. PMID: 28696309
+
+Design: Gender matched eight week old C57BL/6 mice were inoculated saline or with Influenza A (Puerto Rico/8/34; PR8, 1.0 HAU) by intranasal route and transcriptomic changes
+in the cerebellum and spinal cord tissues were evaluated by RNA-seq (Hiseq-2500 100bp pe reads) at days 0 (non-infected), 4 and 8.
+
+An extract of the data is in `https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq.csv` (we'll take a deep look later)
+
+The table is a combined table for gene expression (gene counts) and sample metadata, with columns:
+"gene","sample","expression","organism","age","sex","infection","strain","time","tissue","mouse","ENTREZID","product","ensembl_gene_id","external_synonym","chromosome_name","gene_biotype","phenotype_description","hsapiens_homolog_associated_gene_name"
+
+But for this session, we'll use only: 
+- `https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq_counts_wide.csv`
+- `https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq_counts_long.csv`
 
 ## Data Import in the Tidyverse
 
@@ -62,50 +78,25 @@ Note:
 
 - MS Excel files can be read and/or write but **should be avoided**
 
-## Formats {auto-animate="true"}
+## Table formats {auto-animate="true"}
 
 CSV (*comma separated values*)
-
-```
-label1,label2,label3,"label num 4"<NL>
-value1,value2,value3,"value num 4"<NL>
-```
-
 ```
 sample,sex,age,treatment,response
 A001,M,8,KO,5200
 A002,M,4,WT,4430
 A003,F,4,KO,344
 B001,F,6,WT,2328
 ```
-## Formats {auto-animate="true"}
 
 TSV (*tab separated values*)
-
-```
-label1<TAB>label2<TAB>label3<TAB>"label num 4"<NL>
-value1<TAB>value2<TAB>value3<TAB>"value num 4"<NL>
-```
-
 ```
 sample\tsex\tage\ttreatment\tresponse
 A001\tM\t8\tKO\t5200
 A002\tM\t4\tWT\t4430
 A003\tF\t4\tKO\t344
 B001\tF\t6t\tWT\t2328
 ```
-## Formats {auto-animate="true"}
-
-DELIM (*char separated values*)
-
-```
-label1;label2;label3;"label num 4"$value1;value2;value3;"value num 4"//
-```
-
-```
-sample;sex;age;treatment;response$A001;M;8;KO;5200$A002;M;4;WT;4430$A003;F;4;KO;344$B001;F;6;WT;2328//
-```
-
 
 ## `readr::` {auto-animate="true"}
 
@@ -119,6 +110,7 @@ sample;sex;age;treatment;response$A001;M;8;KO;5200$A002;M;4;WT;4430$A003;F;4;KO;
 
 ## `readr::` in action {auto-animate="true"}
 
+Example commands, don't run
 ```{r}
 #| echo: true
 #| eval: false
@@ -135,9 +127,10 @@ dat = read_csv("https://server.com/region/file.csv")
 
 ## `readr::` in action {auto-animate="true"}
 
+Real data load
 ```{r}
 #| echo: true
-rnaseq_file = "https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq.csv"
+rnaseq_file = "https://raw.githubusercontent.com/maxplanck-ie/Rintro/2024.04/qmd/data/rnaseq_counts_wide.csv"
 
 rna = read_csv(rnaseq_file)
 
@@ -154,70 +147,144 @@ What is this "*rna*"?
 -   Tibbles can be grouped
 -   You can see the types of each column in a tibble
 
-## Quick exercise {auto-animate="true"}
+## What is Tidy Data? 
+
+- In tidy data, each row is an observation and each column is a different variable (*long-format*). 
+- In wide data, each row contains several observations, and the columns contain values (*wide-format*). 
+
+![](images/tidy_data.png)
+
+```
+
+## Hands on
+
+Get the basic statistics for each sample in `rna`
+
+Which sample has the highest mean expression?
 
-Explore the RNAseq data as a tibble.
+## Dplyr in action - `pivot_longer()` {auto-animate="true"}
+
+To transform the data in a long-format we use `pivot_longer()`, it takes as inputs:
+
+1. the `data` to be transformed;
+2. the `names_to` the new column name we wish to create and populate with the current column names;
+3. the `values_to` the new column name we wish to create and populate with current values;
+4. the names of the columns to be used to populate the `names_to` and `values_to` variables (or to drop with `-`).
+
+## Dplyr in action - `pivot_longer()` {auto-animate="true"}
 
 ```{r}
 #| echo: true
-head(rna)
+
+rna_long = pivot_longer(
+                 rna,
+                 names_to = "sample",
+                 values_to = "expression",
+                 -gene)
+
+rna_long
+
 ```
 
-## Quick exercise {auto-animate="true"}
+## Dplyr in action - `pivot_longer()` {auto-animate="true"}
+
+![](images/pivot_longer.png)
+
+## Dplyr in action - `pivot_longer()` {auto-animate="true"}
 
-Explore the RNAseq data as a tibble.
+Column selection can be defined with patterns or ranges
 
 ```{r}
 #| echo: true
-tail(rna)
+
+rna_long2 = pivot_longer(
+                 rna,
+                 names_to = "sample",
+                 values_to = "expression",
+                 cols = starts_with("GSM"))
+
 ```
 
-## Quick exercise {auto-animate="true"}
+## Dplyr in action - `pivot_longer()` {auto-animate="true"}
 
-Explore the RNAseq data as a tibble.
+Column selection can be also defined with patterns or ranges
 
 ```{r}
 #| echo: true
-dim(rna)
+
+rna_long3 = pivot_longer(
+                 rna, 
+                 names_to = "sample",
+                 values_to = "expression",
+                 GSM2545336:GSM2545380)
+
 ```
 
-## `readxl::` {auto-animate="true"}
+## Dplyr in action - `pivot_wider()` {auto-animate="true"}
 
--   `readxl::` is the Tidyverse library for reading data from Excel formats.
--   `read_excel()`, `read_xls()` and `read_xlsx()` are some of the functions provided
--   The `excel_sheets()` function yields the names of the sheets in the Excel file. These names are passed to the `sheet` argument for the **readxl** functions
--   The `read_lines()` function shows the first few lines of a file in R.
+The inverse operation is `pivot_wider()` can transform long-format to wide-format.
+
+It takes three main arguments:
+
+1. the `data` to be transformed
+
+2. the `names_from` are the column whose values will become new column names
 
-## `readxl::` in action {auto-animate="true"}
+3. the `values_from` are the column whose values will fill the new columns
+
+## Dplyr in action - `pivot_wider()` {auto-animate="true"}
 
 ```{r}
 #| echo: true
-library(readxl)
 
-full_url = "https://github.com/maxplanck-ie/Rintro/raw/2024.04/qmd/data/2010_bigfive_regents.xls"
+rna_wide = pivot_wider(
+                rna_long,
+                names_from = sample,
+                values_from = expression)
 
-download.file(url=full_url, destfile="2010_bigfive_regents.xls")
+rna_wide
 
-excel_sheets("2010_bigfive_regents.xls")
 ```
 
-`excel_sheets()` lists the sheets in the Excel file
+## Dplyr in action - `pivot_wider()` {auto-animate="true"}
+
+![](images/pivot_wider.png)
+
+## Dplyr in action - `pivot_wider()` {auto-animate="true"}
+
+By default, missing values will be converted to `NA`, we can change it with `values_fill`
+
+```{r}
+#| echo: true
+
+rna_wide_noNAs = pivot_wider(
+                      rna_long,
+                      names_from = sample,
+                      values_from = expression,
+                      values_fill = 0)
 
-## `readxl::` in action {auto-animate="true"}
+rna_wide_noNAs
 
-Loading the data in "Sheet1"
+```
+
+## Dplyr in action - `write_csv` {auto-animate="true"}
+
+Finally, we could need to save our data as a new file for later use or sharing, we can use `write_csv()`
 
 ```{r}
 #| echo: true
-sheet_one = read_xls("2010_bigfive_regents.xls", sheet = "Sheet1")
 
-sheet_one
+write_csv(rna_wide_noNAs, file = "rna_wide.csv")
+
 ```
 
 ## `readxl::` {auto-animate="true"}
 
-- `read_xls()` is for xls files, `read_xlsx` for XML-based xlsx files
-- you can use `read_excel()` if you are unsure of the format
+-   `readxl::` is the Tidyverse library for reading data from Excel formats.
+-   `read_excel()`, `read_xls()` and `read_xlsx()` are some of the functions provided
+-   The `excel_sheets()` function yields the names of the sheets in the Excel file. These names are passed to the `sheet` argument for the **readxl** functions
+-   The `read_lines()` function shows the first few lines of a file in R.
+-   But please, **don't use Excel**
 
 ## Any questions?