diff --git a/Lectures/lecture_03.html b/Lectures/lecture_03.html index d63c1c5..45d4904 100644 --- a/Lectures/lecture_03.html +++ b/Lectures/lecture_03.html @@ -1,3841 +1,3920 @@ - - - - R crash course / 03 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

- -

-

2024-03-13

-
-
- -

Aims for today

- -
    -
  • Reading data
  • -
  • Cleaning data
  • -
  • (Regular expressions)
  • -
  • Searching, sorting and selecting
  • -
- -

Repetitorium of Day 2

- -

Reading and writing data

- -

Reading data

- -

Main data types you will encounter:

- - ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Data typeFunctionPackageNotes
Columns separated by spacesread_table()readrone or more spaces separate each column
TSV / TAB separated valuesread_tsv()readrDelimiter is tab (\t).
CSV / comma separatedread_csv()readrComma separated values
Any delimiterread_delim()readrCustomizable
XLS (old Excel)read_xls() read_excel()readxlJust don’t use it. From the readxl package.
XLSX (new Excel)read_xlsx() read_excel()readxlFrom the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame!
- -

Note: there are also “base R” functions read.table, read.csv, read.tsv (there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.

- -

Where are your files - absolute vs relative paths

- -
    -
  • absolute path start at root directory, e.g.
    C:\Users\mbenary_c\CUBI\teaching\202403_RCrashCourse\lecture01.R

  • -
  • start at the working directory lecture01.R

  • -
  • Where am I? - getwd()

  • -
  • Downloading files from our git-repository

  • -
- -

Excercise 3.1

- -

Read, inspect the following files:

- -
    -
  • TB_ORD_Gambia_Sutherland_biochemicals.csv
  • -
  • iris.csv
  • -
  • meta_data_botched.xlsx
  • -
- -
    -
  1. Which functions would you use?
  2. -
  3. What kind of issues can you detect?
  4. -
  5. Suggestions of solving the issues?
  6. -
- -

The function readxl_example("deaths.xls") returns a file name. Read this file. How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel). How can you force the date columns to be interpreted as dates and not numbers?

- -

Tibbles / readxl

- -

tibbles belong to the tidyverse. They are nice to work with and very useful, but we can stick to data frames for now. Therefore, do

- -
mydataframe <- as.data.frame(read_xlsx("file.xlsx"))
- -

One crucial difference between tibble and data frame is that tibble[ , 1 ] returns a tibble, while dataframe[ , 1] returns a vector. The second crucial difference is that it does not support row names (on purpose!).

- -

Video: tibbles vs data frames, 10 minutes -> for later

- -

Cleaning data

- -

A short note on organizing your data

- -
    -
  • If possible, use a specialized tool and not Excel (a database, specialized clinical data management system, REDCap etc.)
  • -
- -

And if you use Excel…

- -

If you use Excel, follow these rules:

- -
    -
  • NEVER modify original files
  • -
  • Versionize your files
  • -
  • Check data for consistency (e.g. with R!)
  • -
  • Use preselected values
  • -
  • Add a meta-data sheet describing the fields
  • -
  • do not color-code or format-code your information (e.g. denoting treatment in bold)
  • -
  • Add comments in separate fields
  • -
  • NEVER throw away or replace any identifiers, always keep the old identifiers
  • -
  • understand what you describe (patients? samples? assays?)
  • -
  • understand the difference between wide and long format
  • -
- -

And if you use Excel and don’t follow these rules

- -
    -
  • brace yourself for bad times
  • -
  • get used to nasty remarks
  • -
- -

Diagnosing problems

- -

Potential problems:

- -
    -
  • incorrect import (e.g. numeric values as strings)
  • -
  • typos (e.g. “x15” instead of “15”)
  • -
  • incongruent labels (e.g. mixing “M”, “male” and “männlich”)
  • -
  • mixing upper and lower case
  • -
  • mixing different types of decimal separator
  • -
  • special numeric values (e.g. “> 10”, “< 5”)
  • -
  • comments in the same column (“10 (unreliable)”)
  • -
  • dates (also, Excel converts certain strings to dates)
  • -
- -

Diagnosing problems

- -
    -
  • tidyverse reading functions provide a summary on the reading process, e.g.:
  • -
- -
library(tidyverse)
-myiris <- read_csv("../Datasets/iris.csv")
- -
    -
  • Use summary, class and typeof functions to find potential problems. This is what we expect:
  • -
- -
typeof(myiris$Petal.Width)
- -
## [1] "double"
- -
class(myiris$Petal.Width)
- -
## [1] "numeric"
- -
summary(myiris$Petal.Width)
- -
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-##   0.100   0.300   1.300   1.199   1.800   2.500
- -

Diagnosing problems

- -
typeof(myiris$Petal.Width)
-class(myiris$Petal.Width)
-summary(myiris$Petal.Width)
- -

Longer explanation: typeof returns the so called “storage type”, that is the very basic type of a value (string, logical, integer, list etc.). class can be different (more specific). For example, if mtx is a matrix of numbers, typeof(mtx) is double, and class(mtx) is matrix.

- -

Diagnosing problems

- -
    -
  • Use summary, class and typeof functions to find potential problems. This is not what we expect:
  • -
- -
 typeof(myiris$`Sepal Length`)
- -
## [1] "character"
- -
  class(myiris$`Sepal Length`)
- -
## [1] "character"
- -
summary(myiris$`Sepal Length`)
- -
##    Length     Class      Mode 
-##       150 character character
- -

(we use the back ticks because the column name contains a space)

- -

Diagnosing problems

- -
    -
  • The colorDF package provides a function called summary_colorDF which can be used to diagnose problems with different flavors of data frames:
  • -
- -
library(colorDF)
-summary_colorDF(myiris)
- -
## # Color data frame (class colorDF) 5 x 5:
-##  │Col         │Class│NAs  │unique│Summary                                                       
-## 1│Sepal Length│<chr>│    0│    37│5: 10, 5.1: 9, 6.3: 9, 5.7: 8, 6.7: 8, 5.5: 7, 5.8: 7, 6.4: 7…
-## 2│Sepal Width │<dbl>│    0│    25│ 2.00 [ 2.80 < 3.00>  3.38] 36.00                             
-## 3│Petal?Length│<chr>│    0│    44│1.4: 13, 1.5: 13, 4.5: 8, 1.3: 7, 1.6: 7, 5.1: 7, 5.6: 6, 4: …
-## 4│Petal.Width │<dbl>│    0│    22│0.1 [0.3 <1.3> 1.8] 2.5                                       
-## 5│Species     │<chr>│    0│     6│virginica: 46, setosa: 45, versicolor: 42, Versicolor: 8, Set…
- -

Mending problems

Using patterns to clean data

- -
    -
  • everything is a character (even numbers)
  • -
  • wildcards (jokers) allow to create patterns
  • -
  • grep(pattern, string) is the programming equivalent of Ctrl F in Word/Excel
  • -
  • teachers <- c("January", "Manuela")
  • -
  • grep("an", teachers) find the string “an” in teachers -> what’s the difference when using grepl
  • -
  • grep("^J", teachers) find the string which starts with a capital J
  • -
- -

Substitutions (search & replace)

- -
    -
  • gsub(string1, string2, text) substitute all occurences of string1 in by string2 in text
  • -
  • sub(...) same, but only the first occurence
  • -
- -
## Clearly broken!
-group <- c("ontrol", "Montrol", "Kontrol", "Kkkkkontrl", "hello Connnnnnntrol", 
-           "oh what a nice day it is, controooool", "control ", "Control, incoming")
-
-## replace upper-case "Control"
-group2 <- gsub("[ckCK]+on*tro*l$", "control", group)
-
-grepl("[ckCK]+on*tro*l$", group)
-
-group != group2
-
-## replace all spaces
-group <- gsub(" ", "", group)
- -

Substitutions (search & replace)

- -

Notes:

- -

The above operations are so common that there are “shortcut” functions defined in R:

- -
## get rid of the spaces
-trimws(group)
-
-## make everything lower case
-tolower(group)
-
-## make everything upper case
-toupper(group)
- -

Regular expressions

- -

Is there not a way to simplify things?

- -
group <- c("Control", " control", "control ", "Control   ")
-
-# one method
-gsub(" ", "", group) # but what happens if the groups are "group 1", "group 2" etc?
-
-# substitute at the end of the string
-gsub(" $", "", group)
-
-# at the beginning
-gsub("^ ", "", group)
-
-# any number of occurences
-gsub(" *", "", group)
- -
group <- c("Control", " control", "control ", "Control   ")
-group1 <- trimws(group)
-group2 <- tolower(group1)
-
-## sin(log(tan(pi)))
-group <- tolower(trimws(group))
- -

Regular expression cont.

- -

In regular expressions, some characters or strings have special meanings

- -
    -
  • Some mean what is supposed to be replaced (e.g. ., dot, means “any character”)

  • -
  • Some mean how many times something gets matched; e.g. * means “any number of matches of the previous character, so .* means any character any number of times (like, everything)

  • -
  • Some characters at certain position may also have a special meaning (like ^ means “match the beginning of a string” and $ at the end of a regular expression means “match at the end of a string”).

  • -
  • A string like [abc] will match only a, b and c, but not d. You can use ranges, e.g. [0-9] or [a-z]

  • -
  • https://rstudio.github.io/cheatsheets/regex.pdf

  • -
- -

Regular expression cont.

- -
foo <- c("a1_123", "a1_231", "a2_17", "B2_23", "c2_889")
-
-# substitute digits only
-gsub("[0-9]*", "", foo)
-
-# substitute letters only
-gsub("[a-z]*", "", foo)
-
-# same, but case insensitive
-gsub("[a-z]*", "", foo, ignore.case=TRUE)
-
-# substitute everything after "_"
-# dot means "any character"
-gsub("_.*", "", foo)
- -

Regular expression cont.

- -

Note: the function gsub takes three arguments:

- -
    -
  • a regular expression
  • -
  • a replacement
  • -
  • a character vector
  • -
- -

It then finds all occurences of the regular expression in each element of the character vector and replaces them by the replacement.

- -

Typical example

- -

Often, due to typos, when you read a data file, a numeric column becomes a character string, because for example someone entered “x15” instead of “15” or mixed German and English decimal separators (e.g. “1,5” and “1.5” in the same column).

- -

We can use the as.numeric function to convert the strings to numeric values, but this does not work as intended:

- -
foo <- c("xxxxx15", "15.5", "15,5")
-as.numeric(foo)
- -
## [1]   NA 15.5   NA
- -

We need to substitute the strings first:

- -
foo <- gsub("[a-zA-Z]*", "", foo) # remove letters
-foo <- gsub(",", ".", foo)        # convert decimal separator
-as.numeric(foo)
- -
## [1] 15.0 15.5 15.5
- -

Regexps cont.

- -

You can wrap gsub around another gsub

- -
vec <- c("mouse", " Mouse", "mus domesticus", "chicken", "Schicken", "Duck")
-vec <- gsub("S*chicken", "Chicken", gsub("^m.*", "Mouse", gsub("^ *", "", vec)))
-
-## same as
-
-vec <- gsub("^ *", "", vec)
-vec <- gsub("^m.*", "Mouse", vec)
-vec <- gsub("S*chicken", "Chicken", vec)
- -

Exercise 3.2

- -
    -
  • Use gsub to make the following uniform: c("male", "Male ", "M", "F", "female", " Female")
  • -
  • Using gsub and the toupper function, clean up the gene names such that they conform to the HGNC (all capital letters, no spaces, no dashes): c("ankrd22", "ifng", "Nf-kb", " Cxcl 5", "CCL 6.", "ANK.r.d. 12")
  • -
  • What regular expression matches all of the ankyrin repeat genes (but not other genes) in the following vector: c("ANKRD22", "ANKEN", "ank.rep.domain 12", "ifng-1", "ANKRD-33", " ankrd23", "MAPK)
  • -
- -

Using regular expressions to clean tables

- -

Exercise 3.3

- -
    -
  • Read the data file iris.csv using the read_csv function. Spot and correct errors.
  • -
  • If you have time: Read the data file meta_data_botched.xlsx. Spot and correct errors.
  • -
- -

Writing data

- -

Main data types you will encounter:

- - ----- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Data typeFunctionPackageNotes
TSV / TAB separated valueswrite.table()base RTab is \t. A very general and customizable function Warning: column names
write_tsv()readrNo rownames!
CSV / comma separatedwrite.csv()base Ra wrapper around write.table()
write_csv()readrNo rownames!
XLS (old Excel)Just don’t use it. No, seriously, don’t.
XLSX (new Excel)write_xlsx()writexlWarning: row names omitted!
- -

Row names – gory details.

- -

The default base R functions read.* assign (unless otherwise specified) row names to a data frame.

- -

The tidyverse read_* family of functions return not a data frame, but its variant, a tibble. These tibbles behave very much like data frame; however, tidyverse discourages the use of row names and sometimes even removes existing row names on data frames.

- -

Wide and Long format (demonstration)

Wide and Long format

- -

Long advantages:

- -
    -
  • easier to filter, process, visualize, do statistics with
  • -
  • focused on measurement (“patient ID” or equivalent is a covariate, and so is measurement type)
  • -
- -

Wide advantages:

- -
    -
  • groups data by a covariate (“patient ID”)
  • -
  • can be easier to manage (each column one measurement type)
  • -
- -

Converting from wide to long:

- -
wide <- read.table(header=TRUE, text='
- subject sex control cond1 cond2
-       1   M     7.9  12.3  10.7
-       2   F     6.3  10.6  11.1
-       3   F     9.5  13.1  13.8
-       4   M    11.5  13.4  12.9
-')
-pivot_longer(wide, cols=c("control", "cond1", "cond2"), 
-  names_to="condition", values_to="measurement")
- -

Converting from long to wide

- -
long <- read.table(header=TRUE, text='
- subject  sampleID sex condition measurement
-       1  ID000001 M   control         7.9
-       1  ID000002 M     cond1        12.3
-       1  ID000003 M     cond2        10.7
-       2  ID000004 F   control         6.3
-       2  ID000005 F     cond1        10.6
-       2  ID000006 F     cond2        11.1
-       3  ID000007 F   control         9.5
-       3  ID000008 F     cond1        13.1
-       3  ID000009 F     cond2        13.8
-')
- -

Converting from long to wide

- -
## not what we wanted!!! Why?
-pivot_wider(long, names_from="condition", values_from="measurement")
-
-## Instead: 
-pivot_wider(long, id_cols="subject", names_from="condition", values_from="measurement")
- -

Exercise 3.4

- -

Convert the following files to long format:

- -
    -
  • labresults_wide.csv
  • -
  • The iris data set (data(iris))
  • -
  • cars.xlsx (tricky!)
  • -
- -

Clean up and convert to long format (what seems to be the problem? How do we deal with that?):

- -
    -
  • mtcars_wide.csv
  • -
- - - - -
- - - - - - - - - + + + + R crash course / 03 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+

+ +

+

2024-09-18

+
+
+ +

Aims for today

+ +
    +
  • Reading data
  • +
  • Cleaning data
  • +
  • (Regular expressions)
  • +
  • Searching, sorting and selecting
  • +
+ +

Repetitorium of Day 2

+ +

Reading and writing data

+ +

Reading data

+ +

Main data types you will encounter:

+ + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Data typeFunctionPackageNotes
Columns separated by spacesread_table()readr/tidyverseone or more spaces separate each column
TSV / TAB separated valuesread_tsv()readr/tidyverseDelimiter is tab (\t).
CSV / comma separatedread_csv()readr/tidyverseComma separated values
Any delimiterread_delim()readr/tidyverseCustomizable
XLS (old Excel)read_xls() read_excel()readxlJust don’t use it. From the readxl package.
XLSX (new Excel)read_xlsx() read_excel()readxlFrom the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame!
+ +

Note: there are also “base R” functions read.table, read.csv, read.tsv (there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.

+ +

Where are your files - absolute vs relative paths

+ +
    +
  • absolute path start at root directory, e.g.
    C:\Users\mbenary_c\CUBI\teaching\202403_RCrashCourse\lecture01.R

  • +
  • start at the working directory lecture01.R

  • +
  • Where am I? - getwd()

  • +
  • Downloading files from our git-repository

  • +
+ +

Exercise 3.1

+ +

Read, inspect the following files:

+ +
    +
  • TB_ORD_Gambia_Sutherland_biochemicals.csv
  • +
  • iris.csv
  • +
  • meta_data_botched.xlsx
  • +
+ +

Which functions would you use?

+ +

The function readxl_example("deaths.xls") returns a file name. Read this file:

+ +
fn <- readxl_example("deaths.xls")
+data <- read_excel(fn)
+ +

How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel). How can you force the date columns to be interpreted as dates and not numbers?

+ +

Tibbles / readxl

+ +

tibbles belong to the tidyverse. They are nice to work with and very useful. Also, they are mostly identical to data frames.

+ +

One crucial difference between tibble and data frame is that tibble[ , 1 ] returns a tibble, while dataframe[ , 1] returns a vector. The second crucial difference is that it does not support row names (on purpose!).

+ +

Video: tibbles vs data frames, 10 minutes -> for later

+ +

Cleaning data

+ +

A short note on organizing your data

+ +
    +
  • If possible, use a specialized tool and not Excel (a database, specialized clinical data management system, REDCap etc.)
  • +
+ +

And if you use Excel…

+ +

If you use Excel, follow these rules:

+ +
    +
  • NEVER modify original files
  • +
  • Versionize your files
  • +
  • Check data for consistency (e.g. with R!)
  • +
  • Use preselected values
  • +
  • Add a meta-data sheet describing the fields
  • +
  • do not color-code or format-code your information (e.g. denoting treatment in bold)
  • +
  • Add comments in separate fields
  • +
  • NEVER throw away or replace any identifiers, always keep the old identifiers
  • +
  • understand what you describe (patients? samples? assays?)
  • +
  • understand the difference between wide and long format
  • +
+ +

And if you use Excel and don’t follow these rules

+ +
    +
  • brace yourself for bad times
  • +
  • get used to nasty remarks
  • +
+ +

Standardising column names

+ +

Column names should be uniform.

+ +
    +
  • They should not contain other characters than alphanumeric and underscore
  • +
  • Dots are allowed, but not recommended (“old style”)
  • +
  • They should not contain spaces.
  • +
  • They should start with a letter
  • +
+ +

You can use the janitor package to clean up column names:

+ +
library(janitor)
+data <- read_csv("data.csv")
+data <- clean_names(data)
+ +

Diagnosing problems

+ +

Potential problems:

+ +
    +
  • incorrect import (e.g. numeric values as strings)
  • +
  • typos (e.g. “x15” instead of “15”)
  • +
  • incongruent labels (e.g. mixing “M”, “male” and “männlich”)
  • +
  • mixing upper and lower case
  • +
  • mixing different types of decimal separator
  • +
  • special numeric values (e.g. “> 10”, “< 5”)
  • +
  • comments in the same column (“10 (unreliable)”)
  • +
  • dates (also, Excel converts certain strings to dates)
  • +
+ +

Diagnosing problems

+ +
    +
  • tidyverse reading functions provide a summary on the reading process, e.g.:
  • +
+ +
library(tidyverse)
+myiris <- read_csv("../Datasets/iris.csv")
+ +
    +
  • Use summary, class and typeof functions to find potential problems. This is what we expect:
  • +
+ +
typeof(myiris$Petal.Width)
+ +
## [1] "double"
+ +
class(myiris$Petal.Width)
+ +
## [1] "numeric"
+ +
summary(myiris$Petal.Width)
+ +
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##   0.100   0.300   1.300   1.199   1.800   2.500
+ +

Diagnosing problems

+ +
typeof(myiris$Petal.Width)
+class(myiris$Petal.Width)
+summary(myiris$Petal.Width)
+ +

Longer explanation: typeof returns the so called “storage type”, that is the very basic type of a value (string, logical, integer, list etc.). class can be different (more specific). For example, if mtx is a matrix of numbers, typeof(mtx) is double, and class(mtx) is matrix.

+ +

Diagnosing problems

+ +
    +
  • Use summary, class and typeof functions to find potential problems. This is not what we expect:
  • +
+ +
 typeof(myiris$`Sepal Length`)
+ +
## [1] "character"
+ +
  class(myiris$`Sepal Length`)
+ +
## [1] "character"
+ +
summary(myiris$`Sepal Length`)
+ +
##    Length     Class      Mode 
+##       150 character character
+ +

(we use the back ticks because the column name contains a space)

+ +

Diagnosing problems

+ +
    +
  • The colorDF package provides a function called summary_colorDF which can be used to diagnose problems with different flavors of data frames:
  • +
+ +
library(colorDF)
+summary_colorDF(myiris)
+ +
## # Color data frame (class colorDF) 5 x 5:
+##  │Col         │Class│NAs  │unique│Summary                                                       
+## 1│Sepal Length│<chr>│    0│    37│5: 10, 5.1: 9, 6.3: 9, 5.7: 8, 6.7: 8, 5.5: 7, 5.8: 7, 6.4: 7…
+## 2│Sepal Width │<dbl>│    0│    25│ 2.00 [ 2.80 < 3.00>  3.38] 36.00                             
+## 3│Petal?Length│<chr>│    0│    44│1.4: 13, 1.5: 13, 4.5: 8, 1.3: 7, 1.6: 7, 5.1: 7, 5.6: 6, 4: …
+## 4│Petal.Width │<dbl>│    0│    22│0.1 [0.3 <1.3> 1.8] 2.5                                       
+## 5│Species     │<chr>│    0│     6│virginica: 46, setosa: 45, versicolor: 42, Versicolor: 8, Set…
+ +

Exercise 3.2: Diagnosing problems

+ +
    +
  • Read the data file iris.csv using the read_csv function. Spot the problems. How can we deal with them?
  • +
  • Read the data file meta_data_botched.xlsx. Spot the errors. How can we deal with them?
  • +
+ +

Mending problems

+ +
    +
  • Use logical vectors to substitute values which are incorrect
  • +
  • Use logical vectors to filter out rows which are incorrect
  • +
  • Enforce a data format with as.numeric, as.character, as.factor etc.
  • +
  • Use regular expressions to search and replace
  • +
+ +

Mending problems with logical vectors

+ +

Use logical vectors to substitute values which are incorrect:

+ +
nas <- is.na(some_df$some_column)
+some_df$some_column[nas] <- 0
+ +
    +
  • is.na returns a logical vector of the same length as the input vector
  • +
  • some_df$some_column[nas] returns only the values which are NA
  • +
  • some_df$some_column[nas] <- 0 replaces the NA values by 0
  • +
+ +

Mending problems with logical vectors

+ +

Use logical vectors to substitute values which are incorrect:

+ +
to_replace <- some_df$some_column == "male"
+some_df$some_column[to_replace] <- "M"
+ +
    +
  • some_df$some_column == "male" returns a logical vector with TRUE for all values which are equal to “male”
  • +
  • we then can replace them with a standardized value
  • +
+ +

Mending problems through filtering

+ +
    +
  • Filtering the data: see tomorrow
  • +
+ +

Mending problems by enforcing a data format

+ +

Use as.numeric, as.character, as.factor etc. to enforce a data format:

+ +
some_df$some_column <- as.numeric(some_df$some_column)
+ +
    +
  • as.numeric converts a vector to numeric values
  • +
  • as.character converts a vector to character values
  • +
  • as.factor converts a vector to a factor
  • +
+ +

Note: dates are special case. If you are in a pinch, take a look at the lubridate package.

+ +

Search and replace with regular expressions

+ +

Mending problems by search and replace

+ +

Regular expressions are a powerful tool to search and replace, not only for mending / cleaning data, but also for data processing in general.

+ + + +

Using patterns to clean data

+ +
    +
  • everything is a character (even numbers)
  • +
  • wildcards (jokers) allow to create patterns
  • +
  • grep(pattern, string) is the programming equivalent of Ctrl F in Word/Excel
  • +
  • teachers <- c("January", "Manuela")
  • +
  • grep("an", teachers) find the string “an” in teachers -> what’s the difference when using grepl
  • +
  • grep("^J", teachers) find the string which starts with a capital J
  • +
+ +

Substitutions (search & replace)

+ +
    +
  • gsub(pattern, string, text) substitute all occurences of pattern in by string in text
  • +
  • sub(...) same, but only the first occurence
  • +
+ +
## Clearly broken!
+group <- c("ontrol", "Montrol", "Kontrol", "Kkkkkontrl", "hello Connnnnnntrol", 
+           "oh what a nice day it is, controooool", "control ", "Control, incoming")
+
+## replace upper-case "Control"
+group2 <- gsub("[ckCK]+on*tro*l$", "control", group)
+
+grepl("[ckCK]+on*tro*l$", group)
+
+group != group2
+
+## replace all spaces
+group <- gsub(" ", "", group)
+ +

Substitutions (search & replace)

+ +

Notes:

+ +

The above operations are so common that there are “shortcut” functions defined in R:

+ +
## get rid of the spaces
+trimws(group)
+
+## make everything lower case
+tolower(group)
+
+## make everything upper case
+toupper(group)
+ +

Regular expressions

+ +

Is there not a way to simplify things?

+ +
group <- c("Control", " control", "control ", "Control   ")
+
+# one method
+gsub(" ", "", group) # but what happens if the groups are "group 1", "group 2" etc?
+
+# substitute at the end of the string
+gsub(" $", "", group)
+
+# at the beginning
+gsub("^ ", "", group)
+
+# any number of occurences
+gsub(" *", "", group)
+ +
group <- c("Control", " control", "control ", "Control   ")
+group1 <- trimws(group)
+group2 <- tolower(group1)
+
+group <- tolower(trimws(group))
+ +

Regular expression cont.

+ +

In regular expressions, some characters or strings have special meanings

+ +
    +
  • Some mean what is supposed to be replaced (e.g. ., dot, means “any character”)

  • +
  • Some mean how many times something gets matched; e.g. * means “any number of matches of the previous character, so .* means any character any number of times (like, everything)

  • +
  • Some characters at certain position may also have a special meaning (like ^ means “match the beginning of a string” and $ at the end of a regular expression means “match at the end of a string”).

  • +
  • A string like [abc] will match only a, b and c, but not d. You can use ranges, e.g. [0-9] or [a-z]

  • +
  • https://rstudio.github.io/cheatsheets/regex.pdf

  • +
+ +

Regular expression cont.

+ +
foo <- c("a1_123", "a1_231", "a2_17", "B2_23", "c2_889")
+
+# substitute digits only
+gsub("[0-9]*", "", foo)
+
+# substitute letters only
+gsub("[a-z]*", "", foo)
+
+# same, but case insensitive
+gsub("[a-z]*", "", foo, ignore.case=TRUE)
+
+# substitute everything after "_"
+# dot means "any character"
+gsub("_.*", "", foo)
+ +

Regular expression cont.

+ +

Note: the function gsub takes three arguments:

+ +
    +
  • a regular expression
  • +
  • a replacement
  • +
  • a character vector
  • +
+ +

It then finds all occurences of the regular expression in each element of the character vector and replaces them by the replacement.

+ +

Typical example

+ +

Often, due to typos, when you read a data file, a numeric column becomes a character string, because for example someone entered “x15” instead of “15” or mixed German and English decimal separators (e.g. “1,5” and “1.5” in the same column).

+ +

We can use the as.numeric function to convert the strings to numeric values, but this does not work as intended:

+ +
foo <- c("xxxxx15", "15.5", "15,5")
+as.numeric(foo)
+ +
## [1]   NA 15.5   NA
+ +

We need to substitute the strings first:

+ +
foo <- gsub("[a-zA-Z]*", "", foo) # remove letters
+foo <- gsub(",", ".", foo)        # convert decimal separator
+as.numeric(foo)
+ +
## [1] 15.0 15.5 15.5
+ +

Regexps cont.

+ +

You can wrap gsub around another gsub

+ +
vec <- c("mouse", " Mouse", "mus domesticus", "chicken", "Schicken", "Duck")
+vec <- gsub("S*chicken", "Chicken", gsub("^m.*", "Mouse", gsub("^ *", "", vec)))
+
+## same as
+
+vec <- gsub("^ *", "", vec)
+vec <- gsub("^m.*", "Mouse", vec)
+vec <- gsub("S*chicken", "Chicken", vec)
+ +

Exercise 3.3

+ +
    +
  • Use gsub to make the following uniform: c("male", "Male ", "M", "F", "female", " Female")
  • +
  • Using gsub and the toupper function, clean up the gene names such that they conform to the HGNC (all capital letters, no spaces, no dashes): c("ankrd22", "ifng", "Nf-kb", " Cxcl 5", "CCL 6.", "ANK.r.d. 12")
  • +
  • What regular expression matches all of the ankyrin repeat genes (but not other genes) in the following vector: c("ANKRD22", "ANKEN", "ank.rep.domain 12", "ifng-1", "ANKRD-33", " ankrd23", "MAPK)
  • +
+ +

Using regular expressions to clean tables

+ +

Exercise 3.4

+ +
    +
  • Read the data file iris.csv using the read_csv function. Spot and correct errors.
  • +
  • If you have time: Read the data file meta_data_botched.xlsx. Spot and correct errors.
  • +
+ +

Writing data

+ +

Main data types you will encounter:

+ + +++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Data typeFunctionPackageNotes
TSV / TAB separated valueswrite.table()base RTab is \t. A very general and customizable function Warning: column names
write_tsv()readrNo rownames!
CSV / comma separatedwrite.csv()base Ra wrapper around write.table()
write_csv()readrNo rownames!
XLS (old Excel)Just don’t use it. No, seriously, don’t.
XLSX (new Excel)write_xlsx()writexlWarning: row names omitted!
+ +

Row names – gory details.

+ +

The default base R functions read.* assign (unless otherwise specified) row names to a data frame.

+ +

The tidyverse read_* family of functions return not a data frame, but its variant, a tibble. These tibbles behave very much like data frame; however, tidyverse discourages the use of row names and sometimes even removes existing row names on data frames.

+ +

Wide and Long format (demonstration)

Wide and Long format

+ +

Long advantages:

+ +
    +
  • easier to filter, process, visualize, do statistics with
  • +
  • focused on measurement (“patient ID” or equivalent is a covariate, and so is measurement type)
  • +
+ +

Wide advantages:

+ +
    +
  • groups data by a covariate (“patient ID”)
  • +
  • can be easier to manage (each column one measurement type)
  • +
+ +

Converting from wide to long:

+ +
wide <- read.table(header=TRUE, text='
+ subject sex control cond1 cond2
+       1   M     7.9  12.3  10.7
+       2   F     6.3  10.6  11.1
+       3   F     9.5  13.1  13.8
+       4   M    11.5  13.4  12.9
+')
+pivot_longer(wide, cols=c("control", "cond1", "cond2"), 
+  names_to="condition", values_to="measurement")
+ +

Converting from long to wide

+ +
long <- read.table(header=TRUE, text='
+ subject  sampleID sex condition measurement
+       1  ID000001 M   control         7.9
+       1  ID000002 M     cond1        12.3
+       1  ID000003 M     cond2        10.7
+       2  ID000004 F   control         6.3
+       2  ID000005 F     cond1        10.6
+       2  ID000006 F     cond2        11.1
+       3  ID000007 F   control         9.5
+       3  ID000008 F     cond1        13.1
+       3  ID000009 F     cond2        13.8
+')
+ +

Converting from long to wide

+ +
## not what we wanted!!! Why?
+pivot_wider(long, names_from="condition", values_from="measurement")
+
+## Instead: 
+pivot_wider(long, id_cols="subject", names_from="condition", values_from="measurement")
+ +

Exercise 3.5

+ +

Convert the following files to long format:

+ +
    +
  • labresults_wide.csv
  • +
  • The iris data set (data(iris))
  • +
  • cars.xlsx (tricky!)
  • +
+ +

Clean up and convert to long format (what seems to be the problem? How do we deal with that?):

+ +
    +
  • mtcars_wide.csv
  • +
+ + + + +
+ + + + + + + + + diff --git a/Lectures/lecture_03.rmd b/Lectures/lecture_03.rmd index f02e392..04bc14c 100644 --- a/Lectures/lecture_03.rmd +++ b/Lectures/lecture_03.rmd @@ -33,15 +33,15 @@ Main data types you will encounter: ---------------------------- ------------------------------- ------------------------------- -------------------------- Data type Function Package Notes ---------------------------- ------------------------------- ------------------------------- -------------------------- -Columns separated by spaces `read_table()` `readr` one or more +Columns separated by spaces `read_table()` `readr`/`tidyverse` one or more spaces separate each column -TSV / TAB separated values `read_tsv()` `readr` Delimiter is tab (`\t`). +TSV / TAB separated values `read_tsv()` `readr`/`tidyverse` Delimiter is tab (`\t`). -CSV / comma separated `read_csv()` `readr` Comma separated values +CSV / comma separated `read_csv()` `readr`/`tidyverse` Comma separated values -Any delimiter `read_delim()` `readr` Customizable +Any delimiter `read_delim()` `readr`/`tidyverse` Customizable XLS (old Excel) `read_xls()` `readxl` Just don't use it. `read_excel()` From the @@ -72,7 +72,7 @@ tidyverse functions above are preferable. * Downloading files from our git-repository -## Excercise 3.1 +## Exercise 3.1 Read, inspect the following files: @@ -80,11 +80,16 @@ Read, inspect the following files: * `iris.csv` * `meta_data_botched.xlsx` - 1. Which functions would you use? - 1. What kind of issues can you detect? - 1. Suggestions of solving the issues? +Which functions would you use? -The function `readxl_example("deaths.xls")` returns a file name. Read this file. How can you omit the lines at the top and at the bottom of the file? +The function `readxl_example("deaths.xls")` returns a file name. Read this file: + +```{r} +fn <- readxl_example("deaths.xls") +data <- read_excel(fn) +``` + +How can you omit the lines at the top and at the bottom of the file? (hint: `?read_excel`). How can you force the date columns to be interpreted as dates and not numbers? @@ -92,11 +97,7 @@ as dates and not numbers? ## Tibbles / readxl tibbles belong to the tidyverse. They are nice to work with and very -useful, but we can stick to data frames for now. Therefore, do - -```{r} -mydataframe <- as.data.frame(read_xlsx("file.xlsx")) -``` +useful. Also, they are *mostly* identical to data frames. One crucial difference between tibble and data frame is that `tibble[ , 1 ]` returns a tibble, while `dataframe[ , 1]` returns a vector. The second @@ -131,6 +132,23 @@ crucial difference is that it does not support row names (on purpose!). * brace yourself for bad times * get used to nasty remarks +## Standardising column names + +Column names should be uniform. + + * They should not contain other characters than alphanumeric and underscore + * Dots are allowed, but not recommended ("old style") + * They should not contain spaces. + * They should start with a letter + +You can use the janitor package to clean up column names: + +```{r} +library(janitor) +data <- read_csv("data.csv") +data <- clean_names(data) +``` + ## Diagnosing problems Potential problems: @@ -149,7 +167,6 @@ Potential problems: * tidyverse reading functions provide a summary on the reading process, e.g.: - ```{r eval=TRUE,results="markdown"} library(tidyverse) myiris <- read_csv("../Datasets/iris.csv") @@ -191,28 +208,86 @@ summary(myiris$`Sepal Length`) (we use the back ticks because the column name contains a space) - - - ## Diagnosing problems * The colorDF package provides a function called `summary_colorDF` which can be used to diagnose problems with different flavors of data frames: - - - ```{r eval=TRUE,results="markdown",R.options=list(width=100)} library(colorDF) summary_colorDF(myiris) ``` +## Exercise 3.2: Diagnosing problems + + * Read the data file `iris.csv` using the `read_csv` function. Spot the + problems. How can we deal with them? + * Read the data file `meta_data_botched.xlsx`. Spot + the errors. How can we deal with them? + ## Mending problems + * Use logical vectors to substitute values which are incorrect + * Use logical vectors to filter out rows which are incorrect + * Enforce a data format with `as.numeric`, `as.character`, `as.factor` etc. + * Use regular expressions to search and replace + +## Mending problems with logical vectors + +Use logical vectors to substitute values which are incorrect: + +```{r} +nas <- is.na(some_df$some_column) +some_df$some_column[nas] <- 0 +``` + + * `is.na` returns a logical vector of the same length as the input vector + * `some_df$some_column[nas]` returns only the values which are `NA` + * `some_df$some_column[nas] <- 0` replaces the `NA` values by `0` + +## Mending problems with logical vectors + +Use logical vectors to substitute values which are incorrect: + + +```{r} +to_replace <- some_df$some_column == "male" +some_df$some_column[to_replace] <- "M" +``` + + * `some_df$some_column == "male"` returns a logical vector with `TRUE` for + all values which are equal to "male" + * we then can replace them with a standardized value + +## Mending problems through filtering + + * Filtering the data: see tomorrow + +## Mending problems by enforcing a data format + +Use as.numeric, as.character, as.factor etc. to enforce a data format: + +```{r} +some_df$some_column <- as.numeric(some_df$some_column) +``` + + * `as.numeric` converts a vector to numeric values + * `as.character` converts a vector to character values + * `as.factor` converts a vector to a factor + +Note: dates are special case. If you are in a pinch, take a look at the +`lubridate` package. + +# Search and replace with regular expressions + +## Mending problems by search and replace + +Regular expressions are a powerful tool to search and replace, not only for +mending / cleaning data, but also for data processing in general. + * find the incorrect values * replace incorrect values (by search and replace) - * enforcing a particular data type (e.g., converting strings to numbers) * [Video: introduction to regular expressions, 15 minutes](https://youtu.be/ukN59iCo5wc) ## Using patterns to clean data @@ -226,8 +301,8 @@ summary_colorDF(myiris) ## Substitutions (search & replace) - * `gsub(string1, string2, text)` substitute all occurences of `string1` in - by `string2` in `text` + * `gsub(pattern, string, text)` substitute all occurences of `pattern` in + by `string` in `text` * `sub(...)` same, but only the first occurence ```{r} @@ -290,9 +365,7 @@ group <- c("Control", " control", "control ", "Control ") group1 <- trimws(group) group2 <- tolower(group1) -## sin(log(tan(pi))) group <- tolower(trimws(group)) - ``` @@ -385,7 +458,7 @@ vec <- gsub("^m.*", "Mouse", vec) vec <- gsub("S*chicken", "Chicken", vec) ``` -## Exercise 3.2 +## Exercise 3.3 * Use gsub to make the following uniform: `c("male", "Male ", "M", "F", "female", " Female")` * Using `gsub` and the `toupper` function, clean up the gene names such @@ -397,7 +470,7 @@ vec <- gsub("S*chicken", "Chicken", vec) ## Using regular expressions to clean tables -## Exercise 3.3 +## Exercise 3.4 * Read the data file `iris.csv` using the `read_csv` function. Spot and correct errors. * If you have time: Read the data file `meta_data_botched.xlsx`. Spot and correct errors. @@ -444,7 +517,7 @@ removes existing row names on data frames. ## Wide and Long format (demonstration) - * https://youtu.be/NO1gaeJ7wtA` + * https://youtu.be/NO1gaeJ7wtA * https://youtu.be/v5Y_yrnkWIU * https://youtu.be/jN0CI62WKs8 @@ -506,7 +579,7 @@ pivot_wider(long, names_from="condition", values_from="measurement") pivot_wider(long, id_cols="subject", names_from="condition", values_from="measurement") ``` -## Exercise 3.4 +## Exercise 3.5 Convert the following files to long format: