Skip to content

Commit

Permalink
lecture notes and final markdowm
Browse files Browse the repository at this point in the history
  • Loading branch information
mbenary committed Mar 15, 2024
1 parent c472117 commit e87fce1
Show file tree
Hide file tree
Showing 2 changed files with 174 additions and 0 deletions.
123 changes: 123 additions & 0 deletions Scripts/courseRecap.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
title: "Exploring the iris data set"
author: "MB"
date: "2024-03-15"
output: html_document
editor_options:
chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

## Recap of the week

* how to use functions
* how to create matrices and data frame
* how to work with different data types

* how import and view data
* how to clean data
* how to merge data
* how to plot graphs

## Introducing the iris data set

*taken from wikipedia*

The Iris flower data set or Fisher's Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis.[1] It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species.[2] Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".[3]

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Fisher's paper was published in the Annals of Eugenics (today the Annals of Human Genetics).[1]

## Importing the data set

Some information about importing data ...
```{r echo=FALSE}
library(readxl)
iris <- read_excel("Datasets/irisForMarkDown.xlsx")
meta <- read_excel("Datasets/irisForMarkDown.xlsx", sheet = 2)
```


## Cleaning data
* get consistent column names/header, you need to load library "janitor"
* janitor masks functions from stats

```{r clean-data, message=FALSE, warning=FALSE}
library(janitor)
iris <- clean_names(iris)
meta <- clean_names(meta)
meta$species <- tolower(meta$species)
iris$sepal_length <- gsub(",", ".", iris$sepal_length)
iris$sepal_length <- gsub("[[:alpha:]]", "", iris$sepal_length)
iris$sepal_length <- as.numeric(iris$sepal_length)
iris$petal_length <- gsub("[[:alpha:]]", "", iris$petal_length)
iris$petal_length <- as.numeric(iris$petal_length)
```

## View data

```{r}
iris
```

```{r}
meta
```


## Merging data

```{r}
iris <- merge(iris, meta, by = "flower", all = T)
```


## Explorative data analysis
Summarizing data

```{r}
library(writexl)
sumIris <- colorDF::summary_colorDF(iris)
#write_xlsx(sumIris, path = "sumIris.xlsx")
DT::datatable(sumIris)
```

```{r}
library(tidyverse)
iris |>
ggplot(aes(x = petal_length, y = petal_width, color = species)) + geom_point() +
# annotate text geom_text()
theme_classic()
```

publication ready plots -> ggpubr

```{r}
pca <- prcomp(iris[,2:5], scale. = TRUE)
df <- data.frame(pca$x, species=iris$species)
ggplot(df, aes(x=PC1, y=PC2, color=species)) + geom_point()
```
















51 changes: 51 additions & 0 deletions Scripts/lecture05.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
library(tidyverse)
## merging and matching ----
set1 <- letters[1:5]
set2 <- letters[3:7]

# set theory
intersect(set1, set2)
union(set1, set2)

# for data frames - concepts from data bases
# the intersecting part -> inner join (ids in both data frames)
# the data belonging to set1 -> left join (first vs second)
# the data belonging to set2 -> right join
# union -> outer join

df1 <- tibble(ID=sample(letters, 15), value1=rnorm(15))
df2 <- tibble(ID=sample(1:15, 15),
color=sample(c("black", "red", "green", "blue"), 15, replace = T))
df2$old_ID <- df2$ID
df2$ID <- sample(letters, 15)

# lets use tidyverse right away
small <- inner_join(df1, df2)
large <- full_join(df1, df2)
View(small)
View(large)
leftDf <- left_join(df1, df2)
View(leftDf)
rightDf <- right_join(df1, df2)
View(rightDf)
left_join(df2, df1) # same as line 29
# order of rows might change -> resorting might be necessary

df1 <- tibble(ID = c("a", "a", "b", "c"), no = 1:4, value = rnorm(4))
df2 <- tibble(ID = c("a", "a", "b", "c"), no = 1:4, value = runif(4))
inner_join(df1, df2) # intersection of values
inner_join(df1, df2, by = c("ID")) # it's joining only by ID
inner_join(df1, df2, by = c("ID", "no"), suffix = c(".df1", ".df2"))













0 comments on commit e87fce1

Please sign in to comment.