Aims for today
-
-
- https://youtu.be/NO1gaeJ7wtA -
- https://youtu.be/v5Y_yrnkWIU -
- https://youtu.be/jN0CI62WKs8 +
- Searching, sorting and selecting +
- Matching and merging data +
- Pipes - writing readable code +
- Wide and long format
Long advantages:
+ --
-
- easier to filter, process, visualize, do statistics with -
- focused on measurement (“patient ID” or equivalent is a covariate, and so is measurement type) -
Selecting columns
Wide advantages:
+Selecting columns of data frames
-
-
- groups data by a covariate (“patient ID”) -
- can be easier to manage (each column one measurement type) -
If we want the actual column, we use the $
operator:
Exercise 4.1
df <- data.frame(a=1:5, b=6:10, c=11:15, d=16:20) +df$a-
Convert the following files to long format:
+However, what if we want to select multiple columns?
--
-
labresults_wide.csv
-- The iris data set (
data(iris)
)
- cars.xlsx
(tricky!)
-
Selecting multiple columns
Discuss how to clean up and convert to long format (what seems to be the problem? How do we deal with that?):
+First, the old way:
--
-
mtcars_wide.csv
-
# select columns 1 to 2 +df2 <- df[ , 1:2] -
Aims for today
-
-
- Pipes - writing readable code -
- Searching, sorting and selecting -
- Matching and merging data -
- Visualization -
Pipes in R
This is very similar to what we did when dealing with matrices, and actually similar to how we select elements from a vector.
-Nested function calls vs piping
Selecting columns using tidyverse
# from Exercise 3.3 -iris$petal_length <- gsub("[a-z]", "", iris$petal_length) -iris$petal_length <- gsub(",", ".", iris$petal_length) -iris$petal_length <- as.numeric(iris$petal_length) ++Tidyverse has the
-iris$petal_length |> - str_remove("[a-z]", "") |> - str_replace(",", ".") |> - as.numeric()select
function, which is more explicit and readable. It also has extra features that make it easier to work with!
library(tidyverse) +# select columns a and c +df2 <- select(df, a, c) + +# select columns a to c +df2 <- select(df, a:c) + +# select anything but column b +df2 <- select(df, -b)+ +
Note: This only works with tidyverse functions!“
+ +Tidyverse and quotes
select(df, a, c)+ +
Note the lack of quotes around a
and c
! This is a feature in tidyverse which has two effects:
-
+
- it is easier to type (you save the typing of
df$""
! imagine how much time you have now)
+ - it is confusing for beginners (“why are there no quotes?”, “when should I use quotes and when not?”, “how does it know that it is
df$a
and not some othera
?”)
+ - makes programming confusing (what if “a” holds the name of the column that you would like to sort by? - use
.data[[a]]
; Or is some other vector by which you wish to sort?)
+
Exercise 4.1
-
+
- Read the file ‘Datasets/transcriptomics_results.csv’ +
- What columns are in the file? +
- Select only the columns ‘GeneName’, ‘Description’, ‘logFC.F.D1’ and ‘qval.F.D1’ +
- Rename the columns to ‘Gene’, ‘Description’, ‘LFC’ and ‘FDR’ +
Searching, sorting and selecting
Sorting and ordering
sort and order (base R - not covered in the course)
sort
directly sorts a vector:
v <- sample(1:10)/10 # randomize numbers 1-10 +# randomize numbers 0.1, 0.2, ... 1 +v <- sample(1:10)/10 sort(v) ## decreasing @@ -3349,24 +3367,51 @@ ## same as rev(sort(v))+
sort and order cont.
However, order
is more useful. It returns the position of a value in a sorted vector.
order(v) -order(v, decreasing=TRUE)+
order(v)-
sort and order cont.
## [1] 2 4 3 1 6 8 9 7 10 5+ +
order(v, decreasing=TRUE)+ +
## [1] 5 10 7 9 8 6 1 3 4 2+ +
Think for a moment what happens here.
+ +sort and order cont.
sort
and order
can be applied to character vectors as well:
l <- sample(letters, 10) -sort(l) -order(l, decreasing=TRUE)+sort(l) + +
## [1] "a" "b" "d" "g" "k" "n" "o" "q" "r" "t"+ +
order(l, decreasing=TRUE)+ +
## [1] 9 5 8 6 3 7 2 1 10 4
Note that sorting values turned to a character vector will not give expected results:
v <- sample(1:200, 15) sort(as.character(v))+
## [1] "109" "119" "130" "133" "151" "185" "189" "197" "39" "56" "63" "71" "77" "81" "96"+ +
Using order to sort the data
We can use the return value of order
to sort the vector:
v <- sample(1:10)/10 +v[ order(v) ]+ +
## [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0+ +
This is the same as sort(v)
, but has a huge advantage: we can use it sort another vector, matrix, list, data frame etc.
Sorting data frames (using order)
To sort a data frame according to one of its columns, we use order
and then select rows of a data frame based on that order. That is the “classic” way of sorting.
Sorting data frames with tidyverse
Sorting with tidyverse is easier.
+Sorting with tidyverse is easier (but comes at a cost - you need to know tidyverse functions):
arrange(df, val) @@ -3391,37 +3436,31 @@ ## largest absolute values first arrange(df, desc(abs(val)))-
Sorting data frames with tidyverse
Note: no quotes around column names!
-arrange(df, val)- -
Note the lack of quotes around val
! This is a feature in tidyverse which has two effects:
Why both?
-
-
- it is easier to type (you save the typing of
df$
! imagine how much time you have now)
- - it is confusing for beginners (“why are there no quotes?”, “when should I use quotes and when not?”, “how does it know that it is
df$val
and not some otherval
?”)
- - makes programming confusing (what if “val” holds the name of the column that you would like to sort by? - use
.data[[val]]
; Or is some other vector by which you wish to sort?)
+ order
is more flexible and can be used for any type of data
+arrange
is easier to use and is more readable, but only works with data frames
You should know both!
+Example
## read the transcriptomic results data set res <- read_csv("Datasets/transcriptomics_results.csv") ## only a few interesting columns -res <- res[ , c(3, 5, 8:9) ] -colnames(res) <- c("Gene", "Description", "LFC", "p.value")+res <- select(res, GeneName, Description, logFC.F.D1, qval.F.D1) -
Example cont.
We can use sort, factor and level to find out more about our data set:
- -desc.sum <- summary(factor(res$Description)) -head(sort(desc.sum, decreasing=TRUE)) # using base R sorting+## use new column names +colnames(res) <- c("Gene", "Description", "LFC", "FDR")
Data from: Weiner, January, et al. “Characterization of potential biomarkers of reactogenicity of licensed antiviral vaccines: randomized controlled clinical trials conducted by the BIOVACSAFE consortium.” Scientific reports 9.1 (2019): 1-14.
-Example cont.
Example cont.
## order by decreasing absolute logFC res <- arrange(res, desc(abs(LFC))) @@ -3432,28 +3471,24 @@ # res <- res[ord, ] ## then, order by p-value -res <- arrange(res, p.value) +res <- arrange(res, FDR) plot(abs(res$LFC[1:250]), type="b") -plot(res$p.value[1:250], type="b", log="y")+plot(res$FDR[1:250], type="b", log="y") -
Side-note on plotting
Filtering and subsetting
Selecting / filtering of data frames
Filtering of data frames
There are two ways, both simple. In both of them, you need to have a logical vector that indicates which rows to keep and which to remove.
-keep <- res$p.value < 0.05 ++ +keep <- res$FDR < 0.05 res[ keep, ] -## or +## or, with tidyverse: -filter(res, p.value < 0.05) -## note that we don't have to type "res$p.value", -## see comment about tidyverse above+filter(res, FDR < 0.05)
Note: again, we don’t use quotes around column names!
Excercise 4.2
sel <- res$p.value < 0.01 & res$LFC > 0 head(res[ sel, ])-
Note: for long data frames, head
shows only the first 6 rows.`
Note: for long data frames, head shows only the first 6 rows.`
Combining searches
Note: More on the filter()
function and other tidyverse functions later.`
Filtering with multiple conditions
keep <- res$FDR < 0.05 & abs(res$LFC) > 1 +res[ keep, ] + +## or, with tidyverse: +filter(res, FDR < 0.05, abs(LFC) > 1) +filter(res, FDR < 0.05 & abs(LFC) > 1)+
Excercise 4.3
Continue with the data frame from exercise 4.2
@@ -3648,7 +3692,140 @@[ , c("ARM", "sex") ]
to select the desired columns from a data set.