-
-
- Reading data -
- Cleaning data -
- (Regular expressions) -
- Searching, sorting and selecting -
diff --git a/Lectures/lecture_03.html b/Lectures/lecture_03.html index d63c1c5..45d4904 100644 --- a/Lectures/lecture_03.html +++ b/Lectures/lecture_03.html @@ -1,3841 +1,3920 @@ - - -
-2024-03-13
- -Main data types you will encounter:
- -Data type | -Function | -Package | -Notes | -
---|---|---|---|
Columns separated by spaces | -read_table() |
-readr |
-one or more spaces separate each column | -
TSV / TAB separated values | -read_tsv() |
-readr |
-Delimiter is tab (\t ). |
-
CSV / comma separated | -read_csv() |
-readr |
-Comma separated values | -
Any delimiter | -read_delim() |
-readr |
-Customizable | -
XLS (old Excel) | -read_xls() read_excel() |
-readxl |
-Just don’t use it. From the readxl package. |
-
XLSX (new Excel) | -read_xlsx() read_excel() |
-readxl |
-From the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame! |
-
Note: there are also “base R” functions read.table
, read.csv
, read.tsv
(there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.
absolute path start at root directory, e.g.C:\Users\mbenary_c\CUBI\teaching\202403_RCrashCourse\lecture01.R
start at the working directory lecture01.R
Where am I? - getwd()
Downloading files from our git-repository
Read, inspect the following files:
- -TB_ORD_Gambia_Sutherland_biochemicals.csv
iris.csv
meta_data_botched.xlsx
The function readxl_example("deaths.xls")
returns a file name. Read this file. How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel
). How can you force the date columns to be interpreted as dates and not numbers?
tibbles belong to the tidyverse. They are nice to work with and very useful, but we can stick to data frames for now. Therefore, do
- -mydataframe <- as.data.frame(read_xlsx("file.xlsx"))- -
One crucial difference between tibble and data frame is that tibble[ , 1 ]
returns a tibble, while dataframe[ , 1]
returns a vector. The second crucial difference is that it does not support row names (on purpose!).
Video: tibbles vs data frames, 10 minutes -> for later
- -If you use Excel, follow these rules:
- -Potential problems:
- -library(tidyverse) -myiris <- read_csv("../Datasets/iris.csv")- -
summary
, class
and typeof
functions to find potential problems. This is what we expect:typeof(myiris$Petal.Width)- -
## [1] "double"- -
class(myiris$Petal.Width)- -
## [1] "numeric"- -
summary(myiris$Petal.Width)- -
## Min. 1st Qu. Median Mean 3rd Qu. Max. -## 0.100 0.300 1.300 1.199 1.800 2.500- -
typeof(myiris$Petal.Width) -class(myiris$Petal.Width) -summary(myiris$Petal.Width)- -
Longer explanation: typeof
returns the so called “storage type”, that is the very basic type of a value (string, logical, integer, list etc.). class
can be different (more specific). For example, if mtx
is a matrix of numbers, typeof(mtx)
is double
, and class(mtx)
is matrix
.
summary
, class
and typeof
functions to find potential problems. This is not what we expect:typeof(myiris$`Sepal Length`)- -
## [1] "character"- -
class(myiris$`Sepal Length`)- -
## [1] "character"- -
summary(myiris$`Sepal Length`)- -
## Length Class Mode -## 150 character character- -
(we use the back ticks because the column name contains a space)
- -summary_colorDF
which can be used to diagnose problems with different flavors of data frames:library(colorDF) -summary_colorDF(myiris)- -
## # Color data frame (class colorDF) 5 x 5: -## │Col │Class│NAs │unique│Summary -## 1│Sepal Length│<chr>│ 0│ 37│5: 10, 5.1: 9, 6.3: 9, 5.7: 8, 6.7: 8, 5.5: 7, 5.8: 7, 6.4: 7… -## 2│Sepal Width │<dbl>│ 0│ 25│ 2.00 [ 2.80 < 3.00> 3.38] 36.00 -## 3│Petal?Length│<chr>│ 0│ 44│1.4: 13, 1.5: 13, 4.5: 8, 1.3: 7, 1.6: 7, 5.1: 7, 5.6: 6, 4: … -## 4│Petal.Width │<dbl>│ 0│ 22│0.1 [0.3 <1.3> 1.8] 2.5 -## 5│Species │<chr>│ 0│ 6│virginica: 46, setosa: 45, versicolor: 42, Versicolor: 8, Set…- -
grep(pattern, string)
is the programming equivalent of Ctrl F
in Word/Excelteachers <- c("January", "Manuela")
grep("an", teachers)
find the string “an” in teachers -> what’s the difference when using grepl
grep("^J", teachers)
find the string which starts with a capital Jgsub(string1, string2, text)
substitute all occurences of string1
in by string2
in text
sub(...)
same, but only the first occurence## Clearly broken! -group <- c("ontrol", "Montrol", "Kontrol", "Kkkkkontrl", "hello Connnnnnntrol", - "oh what a nice day it is, controooool", "control ", "Control, incoming") - -## replace upper-case "Control" -group2 <- gsub("[ckCK]+on*tro*l$", "control", group) - -grepl("[ckCK]+on*tro*l$", group) - -group != group2 - -## replace all spaces -group <- gsub(" ", "", group)- -
Notes:
- -The above operations are so common that there are “shortcut” functions defined in R:
- -## get rid of the spaces -trimws(group) - -## make everything lower case -tolower(group) - -## make everything upper case -toupper(group)- -
Is there not a way to simplify things?
- -group <- c("Control", " control", "control ", "Control ") - -# one method -gsub(" ", "", group) # but what happens if the groups are "group 1", "group 2" etc? - -# substitute at the end of the string -gsub(" $", "", group) - -# at the beginning -gsub("^ ", "", group) - -# any number of occurences -gsub(" *", "", group)- -
group <- c("Control", " control", "control ", "Control ") -group1 <- trimws(group) -group2 <- tolower(group1) - -## sin(log(tan(pi))) -group <- tolower(trimws(group))- -
In regular expressions, some characters or strings have special meanings
- -Some mean what is supposed to be replaced (e.g. .
, dot, means “any character”)
Some mean how many times something gets matched; e.g. *
means “any number of matches of the previous character, so .*
means any character any number of times (like, everything)
Some characters at certain position may also have a special meaning (like ^
means “match the beginning of a string” and $
at the end of a regular expression means “match at the end of a string”).
A string like [abc]
will match only a, b and c, but not d. You can use ranges, e.g. [0-9]
or [a-z]
foo <- c("a1_123", "a1_231", "a2_17", "B2_23", "c2_889") - -# substitute digits only -gsub("[0-9]*", "", foo) - -# substitute letters only -gsub("[a-z]*", "", foo) - -# same, but case insensitive -gsub("[a-z]*", "", foo, ignore.case=TRUE) - -# substitute everything after "_" -# dot means "any character" -gsub("_.*", "", foo)- -
Note: the function gsub
takes three arguments:
It then finds all occurences of the regular expression in each element of the character vector and replaces them by the replacement.
- -Often, due to typos, when you read a data file, a numeric column becomes a character string, because for example someone entered “x15” instead of “15” or mixed German and English decimal separators (e.g. “1,5” and “1.5” in the same column).
- -We can use the as.numeric
function to convert the strings to numeric values, but this does not work as intended:
foo <- c("xxxxx15", "15.5", "15,5") -as.numeric(foo)- -
## [1] NA 15.5 NA- -
We need to substitute the strings first:
- -foo <- gsub("[a-zA-Z]*", "", foo) # remove letters -foo <- gsub(",", ".", foo) # convert decimal separator -as.numeric(foo)- -
## [1] 15.0 15.5 15.5- -
You can wrap gsub around another gsub
- -vec <- c("mouse", " Mouse", "mus domesticus", "chicken", "Schicken", "Duck") -vec <- gsub("S*chicken", "Chicken", gsub("^m.*", "Mouse", gsub("^ *", "", vec))) - -## same as - -vec <- gsub("^ *", "", vec) -vec <- gsub("^m.*", "Mouse", vec) -vec <- gsub("S*chicken", "Chicken", vec)- -
c("male", "Male ", "M", "F", "female", " Female")
gsub
and the toupper
function, clean up the gene names such that they conform to the HGNC (all capital letters, no spaces, no dashes): c("ankrd22", "ifng", "Nf-kb", " Cxcl 5", "CCL 6.", "ANK.r.d. 12")
c("ANKRD22", "ANKEN", "ank.rep.domain 12", "ifng-1", "ANKRD-33", " ankrd23", "MAPK)
iris.csv
using the read_csv
function. Spot and correct errors.meta_data_botched.xlsx
. Spot and correct errors.Main data types you will encounter:
- -Data type | -Function | -Package | -Notes | -
---|---|---|---|
TSV / TAB separated values | -write.table() |
-base R |
-Tab is \t . A very general and customizable function Warning: column names |
-
- | write_tsv() |
-readr |
-No rownames! | -
CSV / comma separated | -write.csv() |
-base R |
-a wrapper around write.table() |
-
- | write_csv() |
-readr |
-No rownames! | -
XLS (old Excel) | -- | - | Just don’t use it. No, seriously, don’t. | -
XLSX (new Excel) | -write_xlsx() |
-writexl |
-Warning: row names omitted! | -
The default base R functions read.*
assign (unless otherwise specified) row names to a data frame.
The tidyverse read_*
family of functions return not a data frame, but its variant, a tibble. These tibbles behave very much like data frame; however, tidyverse discourages the use of row names and sometimes even removes existing row names on data frames.
Long advantages:
- -Wide advantages:
- -wide <- read.table(header=TRUE, text=' - subject sex control cond1 cond2 - 1 M 7.9 12.3 10.7 - 2 F 6.3 10.6 11.1 - 3 F 9.5 13.1 13.8 - 4 M 11.5 13.4 12.9 -') -pivot_longer(wide, cols=c("control", "cond1", "cond2"), - names_to="condition", values_to="measurement")- -
long <- read.table(header=TRUE, text=' - subject sampleID sex condition measurement - 1 ID000001 M control 7.9 - 1 ID000002 M cond1 12.3 - 1 ID000003 M cond2 10.7 - 2 ID000004 F control 6.3 - 2 ID000005 F cond1 10.6 - 2 ID000006 F cond2 11.1 - 3 ID000007 F control 9.5 - 3 ID000008 F cond1 13.1 - 3 ID000009 F cond2 13.8 -')- -
## not what we wanted!!! Why? -pivot_wider(long, names_from="condition", values_from="measurement") - -## Instead: -pivot_wider(long, id_cols="subject", names_from="condition", values_from="measurement")- -
Convert the following files to long format:
- -labresults_wide.csv
data(iris)
)cars.xlsx
(tricky!)Clean up and convert to long format (what seems to be the problem? How do we deal with that?):
- -mtcars_wide.csv
2024-09-18
+ +Main data types you will encounter:
+ +Data type | +Function | +Package | +Notes | +
---|---|---|---|
Columns separated by spaces | +read_table() |
+readr /tidyverse |
+one or more spaces separate each column | +
TSV / TAB separated values | +read_tsv() |
+readr /tidyverse |
+Delimiter is tab (\t ). |
+
CSV / comma separated | +read_csv() |
+readr /tidyverse |
+Comma separated values | +
Any delimiter | +read_delim() |
+readr /tidyverse |
+Customizable | +
XLS (old Excel) | +read_xls() read_excel() |
+readxl |
+Just don’t use it. From the readxl package. |
+
XLSX (new Excel) | +read_xlsx() read_excel() |
+readxl |
+From the readxl package. You need to provide the sheet number you wish to read. Note: returns a tibble, not a data frame! |
+
Note: there are also “base R” functions read.table
, read.csv
, read.tsv
(there is no function for reading XLS[X] files in base R). The tidyverse functions above are preferable.
absolute path start at root directory, e.g.C:\Users\mbenary_c\CUBI\teaching\202403_RCrashCourse\lecture01.R
start at the working directory lecture01.R
Where am I? - getwd()
Downloading files from our git-repository
Read, inspect the following files:
+ +TB_ORD_Gambia_Sutherland_biochemicals.csv
iris.csv
meta_data_botched.xlsx
Which functions would you use?
+ +The function readxl_example("deaths.xls")
returns a file name. Read this file:
fn <- readxl_example("deaths.xls") +data <- read_excel(fn)+ +
How can you omit the lines at the top and at the bottom of the file? (hint: ?read_excel
). How can you force the date columns to be interpreted as dates and not numbers?
tibbles belong to the tidyverse. They are nice to work with and very useful. Also, they are mostly identical to data frames.
+ +One crucial difference between tibble and data frame is that tibble[ , 1 ]
returns a tibble, while dataframe[ , 1]
returns a vector. The second crucial difference is that it does not support row names (on purpose!).
Video: tibbles vs data frames, 10 minutes -> for later
+ +If you use Excel, follow these rules:
+ +Column names should be uniform.
+ +You can use the janitor package to clean up column names:
+ +library(janitor) +data <- read_csv("data.csv") +data <- clean_names(data)+ +
Potential problems:
+ +library(tidyverse) +myiris <- read_csv("../Datasets/iris.csv")+ +
summary
, class
and typeof
functions to find potential problems. This is what we expect:typeof(myiris$Petal.Width)+ +
## [1] "double"+ +
class(myiris$Petal.Width)+ +
## [1] "numeric"+ +
summary(myiris$Petal.Width)+ +
## Min. 1st Qu. Median Mean 3rd Qu. Max. +## 0.100 0.300 1.300 1.199 1.800 2.500+ +
typeof(myiris$Petal.Width) +class(myiris$Petal.Width) +summary(myiris$Petal.Width)+ +
Longer explanation: typeof
returns the so called “storage type”, that is the very basic type of a value (string, logical, integer, list etc.). class
can be different (more specific). For example, if mtx
is a matrix of numbers, typeof(mtx)
is double
, and class(mtx)
is matrix
.
summary
, class
and typeof
functions to find potential problems. This is not what we expect:typeof(myiris$`Sepal Length`)+ +
## [1] "character"+ +
class(myiris$`Sepal Length`)+ +
## [1] "character"+ +
summary(myiris$`Sepal Length`)+ +
## Length Class Mode +## 150 character character+ +
(we use the back ticks because the column name contains a space)
+ +summary_colorDF
which can be used to diagnose problems with different flavors of data frames:library(colorDF) +summary_colorDF(myiris)+ +
## # Color data frame (class colorDF) 5 x 5: +## │Col │Class│NAs │unique│Summary +## 1│Sepal Length│<chr>│ 0│ 37│5: 10, 5.1: 9, 6.3: 9, 5.7: 8, 6.7: 8, 5.5: 7, 5.8: 7, 6.4: 7… +## 2│Sepal Width │<dbl>│ 0│ 25│ 2.00 [ 2.80 < 3.00> 3.38] 36.00 +## 3│Petal?Length│<chr>│ 0│ 44│1.4: 13, 1.5: 13, 4.5: 8, 1.3: 7, 1.6: 7, 5.1: 7, 5.6: 6, 4: … +## 4│Petal.Width │<dbl>│ 0│ 22│0.1 [0.3 <1.3> 1.8] 2.5 +## 5│Species │<chr>│ 0│ 6│virginica: 46, setosa: 45, versicolor: 42, Versicolor: 8, Set…+ +
iris.csv
using the read_csv
function. Spot the problems. How can we deal with them?meta_data_botched.xlsx
. Spot the errors. How can we deal with them?as.numeric
, as.character
, as.factor
etc.Use logical vectors to substitute values which are incorrect:
+ +nas <- is.na(some_df$some_column) +some_df$some_column[nas] <- 0+ +
is.na
returns a logical vector of the same length as the input vectorsome_df$some_column[nas]
returns only the values which are NA
some_df$some_column[nas] <- 0
replaces the NA
values by 0
Use logical vectors to substitute values which are incorrect:
+ +to_replace <- some_df$some_column == "male" +some_df$some_column[to_replace] <- "M"+ +
some_df$some_column == "male"
returns a logical vector with TRUE
for all values which are equal to “male”Use as.numeric, as.character, as.factor etc. to enforce a data format:
+ +some_df$some_column <- as.numeric(some_df$some_column)+ +
as.numeric
converts a vector to numeric valuesas.character
converts a vector to character valuesas.factor
converts a vector to a factorNote: dates are special case. If you are in a pinch, take a look at the lubridate
package.
Regular expressions are a powerful tool to search and replace, not only for mending / cleaning data, but also for data processing in general.
+ +grep(pattern, string)
is the programming equivalent of Ctrl F
in Word/Excelteachers <- c("January", "Manuela")
grep("an", teachers)
find the string “an” in teachers -> what’s the difference when using grepl
grep("^J", teachers)
find the string which starts with a capital Jgsub(pattern, string, text)
substitute all occurences of pattern
in by string
in text
sub(...)
same, but only the first occurence## Clearly broken! +group <- c("ontrol", "Montrol", "Kontrol", "Kkkkkontrl", "hello Connnnnnntrol", + "oh what a nice day it is, controooool", "control ", "Control, incoming") + +## replace upper-case "Control" +group2 <- gsub("[ckCK]+on*tro*l$", "control", group) + +grepl("[ckCK]+on*tro*l$", group) + +group != group2 + +## replace all spaces +group <- gsub(" ", "", group)+ +
Notes:
+ +The above operations are so common that there are “shortcut” functions defined in R:
+ +## get rid of the spaces +trimws(group) + +## make everything lower case +tolower(group) + +## make everything upper case +toupper(group)+ +
Is there not a way to simplify things?
+ +group <- c("Control", " control", "control ", "Control ") + +# one method +gsub(" ", "", group) # but what happens if the groups are "group 1", "group 2" etc? + +# substitute at the end of the string +gsub(" $", "", group) + +# at the beginning +gsub("^ ", "", group) + +# any number of occurences +gsub(" *", "", group)+ +
group <- c("Control", " control", "control ", "Control ") +group1 <- trimws(group) +group2 <- tolower(group1) + +group <- tolower(trimws(group))+ +
In regular expressions, some characters or strings have special meanings
+ +Some mean what is supposed to be replaced (e.g. .
, dot, means “any character”)
Some mean how many times something gets matched; e.g. *
means “any number of matches of the previous character, so .*
means any character any number of times (like, everything)
Some characters at certain position may also have a special meaning (like ^
means “match the beginning of a string” and $
at the end of a regular expression means “match at the end of a string”).
A string like [abc]
will match only a, b and c, but not d. You can use ranges, e.g. [0-9]
or [a-z]
foo <- c("a1_123", "a1_231", "a2_17", "B2_23", "c2_889") + +# substitute digits only +gsub("[0-9]*", "", foo) + +# substitute letters only +gsub("[a-z]*", "", foo) + +# same, but case insensitive +gsub("[a-z]*", "", foo, ignore.case=TRUE) + +# substitute everything after "_" +# dot means "any character" +gsub("_.*", "", foo)+ +
Note: the function gsub
takes three arguments:
It then finds all occurences of the regular expression in each element of the character vector and replaces them by the replacement.
+ +Often, due to typos, when you read a data file, a numeric column becomes a character string, because for example someone entered “x15” instead of “15” or mixed German and English decimal separators (e.g. “1,5” and “1.5” in the same column).
+ +We can use the as.numeric
function to convert the strings to numeric values, but this does not work as intended:
foo <- c("xxxxx15", "15.5", "15,5") +as.numeric(foo)+ +
## [1] NA 15.5 NA+ +
We need to substitute the strings first:
+ +foo <- gsub("[a-zA-Z]*", "", foo) # remove letters +foo <- gsub(",", ".", foo) # convert decimal separator +as.numeric(foo)+ +
## [1] 15.0 15.5 15.5+ +
You can wrap gsub around another gsub
+ +vec <- c("mouse", " Mouse", "mus domesticus", "chicken", "Schicken", "Duck") +vec <- gsub("S*chicken", "Chicken", gsub("^m.*", "Mouse", gsub("^ *", "", vec))) + +## same as + +vec <- gsub("^ *", "", vec) +vec <- gsub("^m.*", "Mouse", vec) +vec <- gsub("S*chicken", "Chicken", vec)+ +
c("male", "Male ", "M", "F", "female", " Female")
gsub
and the toupper
function, clean up the gene names such that they conform to the HGNC (all capital letters, no spaces, no dashes): c("ankrd22", "ifng", "Nf-kb", " Cxcl 5", "CCL 6.", "ANK.r.d. 12")
c("ANKRD22", "ANKEN", "ank.rep.domain 12", "ifng-1", "ANKRD-33", " ankrd23", "MAPK)
iris.csv
using the read_csv
function. Spot and correct errors.meta_data_botched.xlsx
. Spot and correct errors.Main data types you will encounter:
+ +Data type | +Function | +Package | +Notes | +
---|---|---|---|
TSV / TAB separated values | +write.table() |
+base R |
+Tab is \t . A very general and customizable function Warning: column names |
+
+ | write_tsv() |
+readr |
+No rownames! | +
CSV / comma separated | +write.csv() |
+base R |
+a wrapper around write.table() |
+
+ | write_csv() |
+readr |
+No rownames! | +
XLS (old Excel) | ++ | + | Just don’t use it. No, seriously, don’t. | +
XLSX (new Excel) | +write_xlsx() |
+writexl |
+Warning: row names omitted! | +
The default base R functions read.*
assign (unless otherwise specified) row names to a data frame.
The tidyverse read_*
family of functions return not a data frame, but its variant, a tibble. These tibbles behave very much like data frame; however, tidyverse discourages the use of row names and sometimes even removes existing row names on data frames.
Long advantages:
+ +Wide advantages:
+ +wide <- read.table(header=TRUE, text=' + subject sex control cond1 cond2 + 1 M 7.9 12.3 10.7 + 2 F 6.3 10.6 11.1 + 3 F 9.5 13.1 13.8 + 4 M 11.5 13.4 12.9 +') +pivot_longer(wide, cols=c("control", "cond1", "cond2"), + names_to="condition", values_to="measurement")+ +
long <- read.table(header=TRUE, text=' + subject sampleID sex condition measurement + 1 ID000001 M control 7.9 + 1 ID000002 M cond1 12.3 + 1 ID000003 M cond2 10.7 + 2 ID000004 F control 6.3 + 2 ID000005 F cond1 10.6 + 2 ID000006 F cond2 11.1 + 3 ID000007 F control 9.5 + 3 ID000008 F cond1 13.1 + 3 ID000009 F cond2 13.8 +')+ +
## not what we wanted!!! Why? +pivot_wider(long, names_from="condition", values_from="measurement") + +## Instead: +pivot_wider(long, id_cols="subject", names_from="condition", values_from="measurement")+ +
Convert the following files to long format:
+ +labresults_wide.csv
data(iris)
)cars.xlsx
(tricky!)Clean up and convert to long format (what seems to be the problem? How do we deal with that?):
+ +mtcars_wide.csv