Skip to content

Data Analysis Patterns

Vince Buffalo edited this page May 2, 2018 · 9 revisions

Reading in multiple files

Suppose you have files with semantic names, like sampleA_rep01.tsv, sampleA_rep02.tsv, ..., sampleC_rep01.tsv. You want to load in and combine all data, and extract relevant metadata into columns. How do you do this? Tidyverse to the rescue:

# filenames to make example work:
files <- c('sampleA_rep01.tsv', 'sampleA_rep02.tsv','sampleB_rep01.tsv', 
           'sampleB_rep02.tsv', 'sampleC_rep01.tsv', 'sampleC_rep02.tsv')

# write test files for example (iris a bunch of times)
walk(files, ~ write_tsv(iris, file.path('path/to/data', .)))

# normally you would do:
input_files <- list.files('path/to/data/', pattern='sample.*\\.tsv', full.names=TRUE)

# main pattern:
all_data <- tibble(file=input_files) %>% 
   # read data in (note: in general, best to pass col_names and col_types to map)
   mutate(data=map(file, read_tsv)) %>% 
   # get the file basename (no path); if your metadata is in the path, change accordingly!
   mutate(basename=basename(file)) %>% 
   # extract out the metadata from the base filename
   extract(file, into=c('sample', 'rep'), regex='sample([^_]+)_rep([^_]+)\\.tsv') %>% 
   unnest(data)
Clone this wiki locally