-
Notifications
You must be signed in to change notification settings - Fork 15
Data Analysis Patterns
Vince Buffalo edited this page Jun 11, 2018
·
9 revisions
Suppose you have files with semantic names, like sampleA_rep01.tsv
, sampleA_rep02.tsv
, ..., sampleC_rep01.tsv
. You want to load in and combine all data, and extract relevant metadata into columns. How do you do this? Tidyverse to the rescue:
### example setup:
DIR <- 'path/to/data' # change to directory you can write files to.
# filenames to make example work:
files <- c('sampleA_rep01.tsv', 'sampleA_rep02.tsv','sampleB_rep01.tsv',
'sampleB_rep02.tsv', 'sampleC_rep01.tsv', 'sampleC_rep02.tsv')
# write test files for example (iris a bunch of times)
walk(files, ~ write_tsv(iris, file.path(DIR, .)))
### Pattern:
# grab all files programmatically:
input_files <- list.files(DIR,
pattern='sample.*\\.tsv', full.names=TRUE)
# data loading pattern:
all_data <- tibble(file=input_files) %>%
# read data in (note: in general, best to
# pass col_names and col_types to map)
mutate(data=map(file, read_tsv)) %>%
# get the file basename (no path); if
# your metadata is in the path, change accordingly!
mutate(basename=basename(file)) %>%
# extract out the metadata from the base filename
extract(basename, into=c('sample', 'rep'),
regex='sample([^_]+)_rep([^_]+)\\.tsv') %>%
unnest(data) # optional, depends on what you need.
Before the unnest()
, the data looks like:
# A tibble: 6 x 4
file data sample rep
* <chr> <list> <chr> <chr>
1 sampleA_rep01.tsv <tibble [150 × 5]> A 01
2 sampleA_rep02.tsv <tibble [150 × 5]> A 02
3 sampleB_rep01.tsv <tibble [150 × 5]> B 01
4 sampleB_rep02.tsv <tibble [150 × 5]> B 02
5 sampleC_rep01.tsv <tibble [150 × 5]> C 01
6 sampleC_rep02.tsv <tibble [150 × 5]> C 02
This is useful if you have data with multiple columns A
, B
, etc. that each need a lower/mean/upper summary statistic calculated on them, and you want as your end result A_lower, A_mean, A_upper, B_lower, B_mean, B_upper
, etc. The trick to spreading multiple columns like this is to realize you need to do a gather()
+ unite()
first. This could probably be made more efficient, but this is a quick readable version:
library(tidyverse)
iris <- as_tibble(iris)
iris %>% gather(var_type, val, Sepal.Length:Petal.Width) %>%
group_by(Species, var_type) %>%
summarize(lower=quantile(val, 0.25),
mean=mean(val),
upper=quantile(val, 0.75)) %>%
# now, gather + unite
gather(stat, val, lower:upper) %>%
# now, unite to make a new column name (which will be column
# after spread)
unite(col, var_type:stat) %>% spread(col, val)