lecture notes and final markdowm

bihealth · Mar 15, 2024 · e87fce1 · e87fce1
1 parent c472117
commit e87fce1
Show file tree

Hide file tree

Showing 2 changed files with 174 additions and 0 deletions.
diff --git a/Scripts/courseRecap.Rmd b/Scripts/courseRecap.Rmd
@@ -0,0 +1,123 @@
+---
+title: "Exploring the iris data set"
+author: "MB"
+date: "2024-03-15"
+output: html_document
+editor_options: 
+  chunk_output_type: console
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE)
+```
+
+## Recap of the week
+
+ * how to use functions
+ * how to create matrices and data frame
+ * how to work with different data types
+
+ * how import and view data
+ * how to clean data
+ * how to merge data
+ * how to plot graphs
+
+## Introducing the iris data set
+
+*taken from wikipedia*
+
+The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3]
+
+The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Fisher's paper was published in the Annals of Eugenics (today the Annals of Human Genetics).[1] 
+
+## Importing the data set
+
+Some information about importing data ...
+```{r echo=FALSE}
+library(readxl)
+iris <- read_excel("Datasets/irisForMarkDown.xlsx")
+meta <- read_excel("Datasets/irisForMarkDown.xlsx", sheet = 2)
+```
+
+
+## Cleaning data
+ * get consistent column names/header, you need to load library "janitor"
+ * janitor masks functions from stats
+
+```{r clean-data, message=FALSE, warning=FALSE}
+library(janitor)
+iris <- clean_names(iris) 
+meta <- clean_names(meta)
+
+meta$species <- tolower(meta$species)
+
+iris$sepal_length <- gsub(",", ".", iris$sepal_length)
+iris$sepal_length <- gsub("[[:alpha:]]", "", iris$sepal_length)
+iris$sepal_length <- as.numeric(iris$sepal_length)
+
+iris$petal_length <- gsub("[[:alpha:]]", "", iris$petal_length)
+iris$petal_length <- as.numeric(iris$petal_length)
+
+
+```
+
+## View data 
+
+```{r}
+iris
+```
+
+```{r}
+meta
+```
+
+
+## Merging data
+
+```{r}
+iris <- merge(iris, meta, by = "flower", all = T)
+```
+
+
+## Explorative data analysis
+Summarizing data 
+
+```{r}
+library(writexl)
+sumIris <- colorDF::summary_colorDF(iris)
+#write_xlsx(sumIris, path = "sumIris.xlsx")
+
+DT::datatable(sumIris)
+```
+
+```{r}
+library(tidyverse)
+iris |> 
+  ggplot(aes(x = petal_length, y = petal_width, color = species)) + geom_point() +
+# annotate text  geom_text()
+  theme_classic()
+```
+
+publication ready plots -> ggpubr
+
+```{r}
+pca <- prcomp(iris[,2:5], scale. = TRUE)
+df <- data.frame(pca$x, species=iris$species)
+ggplot(df, aes(x=PC1, y=PC2, color=species)) + geom_point()
+```
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
diff --git a/Scripts/lecture05.R b/Scripts/lecture05.R
@@ -0,0 +1,51 @@
+library(tidyverse)
+## merging and matching ----
+set1 <- letters[1:5]
+set2 <- letters[3:7]
+
+# set theory
+intersect(set1, set2)
+union(set1, set2)
+
+# for data frames - concepts from data bases
+# the intersecting part -> inner join (ids in both data frames)
+# the data belonging to set1 -> left join (first vs second)
+# the data belonging to set2 -> right join
+# union -> outer join
+
+df1 <- tibble(ID=sample(letters, 15), value1=rnorm(15))
+df2 <- tibble(ID=sample(1:15, 15), 
+              color=sample(c("black", "red", "green", "blue"), 15, replace = T))
+df2$old_ID <- df2$ID
+df2$ID <- sample(letters, 15)
+
+# lets use tidyverse right away
+small <- inner_join(df1, df2)
+large <- full_join(df1, df2)
+View(small)
+View(large)
+leftDf <- left_join(df1, df2)
+View(leftDf)
+rightDf <- right_join(df1, df2)
+View(rightDf)
+left_join(df2, df1) # same as line 29
+# order of rows might change -> resorting might be necessary
+
+df1 <- tibble(ID = c("a", "a", "b", "c"), no = 1:4, value = rnorm(4))
+df2 <- tibble(ID = c("a", "a", "b", "c"), no = 1:4, value = runif(4))
+inner_join(df1, df2) # intersection of values
+inner_join(df1, df2, by = c("ID")) # it's joining only by ID
+inner_join(df1, df2, by = c("ID", "no"), suffix = c(".df1", ".df2"))
+
+
+
+
+
+
+
+
+
+
+
+
+