Skip to content

Commit

Permalink
updated L03
Browse files Browse the repository at this point in the history
  • Loading branch information
january3 committed Sep 18, 2024
1 parent fcebc71 commit d028c38
Show file tree
Hide file tree
Showing 2 changed files with 4,023 additions and 3,871 deletions.
7,761 changes: 3,920 additions & 3,841 deletions Lectures/lecture_03.html

Large diffs are not rendered by default.

133 changes: 103 additions & 30 deletions Lectures/lecture_03.rmd
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@ Main data types you will encounter:
---------------------------- ------------------------------- ------------------------------- --------------------------
Data type Function Package Notes
---------------------------- ------------------------------- ------------------------------- --------------------------
Columns separated by spaces `read_table()` `readr` one or more
Columns separated by spaces `read_table()` `readr`/`tidyverse` one or more
spaces separate
each column

TSV / TAB separated values `read_tsv()` `readr` Delimiter is tab (`\t`).
TSV / TAB separated values `read_tsv()` `readr`/`tidyverse` Delimiter is tab (`\t`).

CSV / comma separated `read_csv()` `readr` Comma separated values
CSV / comma separated `read_csv()` `readr`/`tidyverse` Comma separated values

Any delimiter `read_delim()` `readr` Customizable
Any delimiter `read_delim()` `readr`/`tidyverse` Customizable

XLS (old Excel) `read_xls()` `readxl` Just don't use it.
`read_excel()` From the
Expand Down Expand Up @@ -72,31 +72,32 @@ tidyverse functions above are preferable.

* Downloading files from our git-repository

## Excercise 3.1
## Exercise 3.1

Read, inspect the following files:

* `TB_ORD_Gambia_Sutherland_biochemicals.csv`
* `iris.csv`
* `meta_data_botched.xlsx`

1. Which functions would you use?
1. What kind of issues can you detect?
1. Suggestions of solving the issues?
Which functions would you use?

The function `readxl_example("deaths.xls")` returns a file name. Read this file. How can you omit the lines at the top and at the bottom of the file?
The function `readxl_example("deaths.xls")` returns a file name. Read this file:

```{r}
fn <- readxl_example("deaths.xls")
data <- read_excel(fn)
```

How can you omit the lines at the top and at the bottom of the file?
(hint: `?read_excel`). How can you force the date columns to be interpreted
as dates and not numbers?


## Tibbles / readxl

tibbles belong to the tidyverse. They are nice to work with and very
useful, but we can stick to data frames for now. Therefore, do

```{r}
mydataframe <- as.data.frame(read_xlsx("file.xlsx"))
```
useful. Also, they are *mostly* identical to data frames.

One crucial difference between tibble and data frame is that `tibble[ , 1 ]`
returns a tibble, while `dataframe[ , 1]` returns a vector. The second
Expand Down Expand Up @@ -131,6 +132,23 @@ crucial difference is that it does not support row names (on purpose!).
* brace yourself for bad times
* get used to nasty remarks

## Standardising column names

Column names should be uniform.

* They should not contain other characters than alphanumeric and underscore
* Dots are allowed, but not recommended ("old style")
* They should not contain spaces.
* They should start with a letter

You can use the janitor package to clean up column names:

```{r}
library(janitor)
data <- read_csv("data.csv")
data <- clean_names(data)
```

## Diagnosing problems

Potential problems:
Expand All @@ -149,7 +167,6 @@ Potential problems:
* tidyverse reading functions provide a summary on the reading process,
e.g.:


```{r eval=TRUE,results="markdown"}
library(tidyverse)
myiris <- read_csv("../Datasets/iris.csv")
Expand Down Expand Up @@ -191,28 +208,86 @@ summary(myiris$`Sepal Length`)

(we use the back ticks because the column name contains a space)




## Diagnosing problems

* The colorDF package provides a function called `summary_colorDF` which
can be used to diagnose problems with different flavors of data frames:





```{r eval=TRUE,results="markdown",R.options=list(width=100)}
library(colorDF)
summary_colorDF(myiris)
```

## Exercise 3.2: Diagnosing problems

* Read the data file `iris.csv` using the `read_csv` function. Spot the
problems. How can we deal with them?
* Read the data file `meta_data_botched.xlsx`. Spot
the errors. How can we deal with them?

## Mending problems

* Use logical vectors to substitute values which are incorrect
* Use logical vectors to filter out rows which are incorrect
* Enforce a data format with `as.numeric`, `as.character`, `as.factor` etc.
* Use regular expressions to search and replace

## Mending problems with logical vectors

Use logical vectors to substitute values which are incorrect:

```{r}
nas <- is.na(some_df$some_column)
some_df$some_column[nas] <- 0
```

* `is.na` returns a logical vector of the same length as the input vector
* `some_df$some_column[nas]` returns only the values which are `NA`
* `some_df$some_column[nas] <- 0` replaces the `NA` values by `0`

## Mending problems with logical vectors

Use logical vectors to substitute values which are incorrect:


```{r}
to_replace <- some_df$some_column == "male"
some_df$some_column[to_replace] <- "M"
```

* `some_df$some_column == "male"` returns a logical vector with `TRUE` for
all values which are equal to "male"
* we then can replace them with a standardized value

## Mending problems through filtering

* Filtering the data: see tomorrow

## Mending problems by enforcing a data format

Use as.numeric, as.character, as.factor etc. to enforce a data format:

```{r}
some_df$some_column <- as.numeric(some_df$some_column)
```

* `as.numeric` converts a vector to numeric values
* `as.character` converts a vector to character values
* `as.factor` converts a vector to a factor

Note: dates are special case. If you are in a pinch, take a look at the
`lubridate` package.

# Search and replace with regular expressions

## Mending problems by search and replace

Regular expressions are a powerful tool to search and replace, not only for
mending / cleaning data, but also for data processing in general.

* find the incorrect values
* replace incorrect values (by search and replace)
* enforcing a particular data type (e.g., converting strings to numbers)
* [Video: introduction to regular expressions, 15 minutes](https://youtu.be/ukN59iCo5wc)

## Using patterns to clean data
Expand All @@ -226,8 +301,8 @@ summary_colorDF(myiris)

## Substitutions (search & replace)

* `gsub(string1, string2, text)` substitute all occurences of `string1` in
by `string2` in `text`
* `gsub(pattern, string, text)` substitute all occurences of `pattern` in
by `string` in `text`
* `sub(...)` same, but only the first occurence

```{r}
Expand Down Expand Up @@ -290,9 +365,7 @@ group <- c("Control", " control", "control ", "Control ")
group1 <- trimws(group)
group2 <- tolower(group1)
## sin(log(tan(pi)))
group <- tolower(trimws(group))
```


Expand Down Expand Up @@ -385,7 +458,7 @@ vec <- gsub("^m.*", "Mouse", vec)
vec <- gsub("S*chicken", "Chicken", vec)
```

## Exercise 3.2
## Exercise 3.3

* Use gsub to make the following uniform: `c("male", "Male ", "M", "F", "female", " Female")`
* Using `gsub` and the `toupper` function, clean up the gene names such
Expand All @@ -397,7 +470,7 @@ vec <- gsub("S*chicken", "Chicken", vec)
## Using regular expressions to clean tables


## Exercise 3.3
## Exercise 3.4

* Read the data file `iris.csv` using the `read_csv` function. Spot and correct errors.
* If you have time: Read the data file `meta_data_botched.xlsx`. Spot and correct errors.
Expand Down Expand Up @@ -444,7 +517,7 @@ removes existing row names on data frames.

## Wide and Long format (demonstration)

* https://youtu.be/NO1gaeJ7wtA`
* https://youtu.be/NO1gaeJ7wtA
* https://youtu.be/v5Y_yrnkWIU
* https://youtu.be/jN0CI62WKs8

Expand Down Expand Up @@ -506,7 +579,7 @@ pivot_wider(long, names_from="condition", values_from="measurement")
pivot_wider(long, id_cols="subject", names_from="condition", values_from="measurement")
```

## Exercise 3.4
## Exercise 3.5

Convert the following files to long format:

Expand Down

0 comments on commit d028c38

Please sign in to comment.