Introduction to Unix and R

Date: 26th July 2018, 9am - 1pm (CET)
Course: TrainMalta Summer School 2018: Epigenomics
Location: Informatics Lab, University of Malta
Trainers:
- Luigi Grassi, Department of Haematology, University of Cambridge (UK) Email: Luigi.grassi [at] bioresource.nihr.ac.uk
- Daniel D'Andrea, Section of Inflammation and Signal Transduction, Imperial College London (UK) Email: d.dandrea [at] imperial.ac.uk
- Sergio Martínez Cuesta, Department of Chemistry and CRUK-CI, University of Cambridge (UK) Email: sermarcue [at] gmail.com
Today's plan
We will be using the etherpad to exchange questions
Materials for later sessions in the course: http://34.246.2.206:3000/

Introduction to Unix

Section based on:

Grassi's Introduction to Unix and R, TrainMalta Summer School, 2016
Pajon's Introduction to the Shell
Software Carpentry's The Unix Shell

Outline

The structure of Unix: files and directories
Exploring the command line
Exercise 1
Using Unix tools to explore files
Downloading files from the internet
Exercise 2
Coffee break (10:00 - 10:30am)
Compressing and archiving
Locating files
Text editors

The structure of Unix: files and directories

Unix is made up of files and directories.

An example of a basic directory structure:

It looks like a tree 🌲!

/: root directory
/bin: basic system commands
/data: storage of datasets
/Users: user directories. Who? imhotep, larry and nelle e.g. /Users/imhotep
/tmp: storage of temporary files

Examples of files: README, .bashrc and index.html

Case sensitive: README is a different file from readme
No length limit
Can contain any character except / (including whitespaces)

A path is a sequence of nested directories with a file or directory at the end, separated by the / character.

Absolute path e.g. /Users/nelle/README
Relative path e.g. nelle/README (with respect to the current directory /Users)

Exploring the command line and working with files

What is the command line? The tool used to execute commands, also known as instructions to tell the computer what to do.

In different contexts, the command line is often known as the terminal, shell, bash, console ...

How does the command line look like?

Open your command line:

$ indicates where you can start typing commands e.g. type the command pwd, then press Enter. This command helps you find where you are located in the directory structure. pwd displays the current directory ("print working directory").

Other useful basic commands are:

whoami: who am I? what's my username?
hostname: what is the name of the machine that am I am using now?

#### How can I list other directories?

ls: lists directories available in the current directory

Commands can often take options, which help commands to be more specific. Options are defined with the en dash - or double en dash -- symbols.

ls -l: provides additional info on files and directories
ls -la: (options can be combined) this includes hidden files (.name)
ls -ltr: with additional info and most recent files at the end
man ls: open the manual about the command ls to look for more details about other options

How can I move up and down in the directory structure?

Use cd followed by the directory you want to go to, e.g. cd Desktop takes me to my Desktop (Hint: first run ls to find out which directories are available for you to visit).

If you want to move in the opposite direction:

cd ..: moves one directory up
cd ../..: moves two directories up (and so on)
cd or cd ~: takes you to your home directory

How to create, copy, move or remove files and directories?

Create:

touch test.txt: creates file test.txt
mkdir tmp: creates directory tmp

(Hint: execute ls -lh to see how test.txt and tmp have been created)

Copy and move:

cp test.txt tmp/: copies file test.txt inside directory tmp
mv test.txt tmp/: moves file test.txt inside directory tmp (and removes it from the current directory)

Remove:

rm for files and rm -r for directories (:warning: With Great Power Comes Great Responsibility. When files or directories are deleted, there is no way back. They are totally gone forever.)

cd tmp/            # Go inside directory tmp/
rm test.txt        # Delete file test.txt
cd ..              # Move up one level in the directory tree
rm -r tmp/         # Delete directory tmp/ (here you can also use command rmdir)

Exercise 1

Click on this link to download the example dataset that we will be using this morning. This is a small made-up dataset which is often used for training purposes and contains information about 100 lung cancer patients aged 42-44 from different states in the US.
The file patient-data-cleaned.csv is a comma-separated values (CSV) file, which can easily be opened using software like Excel
Open the command line and navigate to find the exact directory where you downloaded patient-data-cleaned.csv (Hint: use cd to check in your home directory or in directories such as Desktop/ or Downloads/)
Now go to your home directory and create a new directory called Unix_R
Copy the downloaded file patient-data-cleaned.csv from its current location to your newly created Unix_R directory

🎉 Congratulations! You did it! 👍

Other relevant commands and tricks:

history: trace back your recent history of commands
use the arrows UP and DOWN in your keyword to navigate through your history of commands
wildcards: commands can use wildcards e.g. * to perform actions on more than one file at a time, e.g. ls -l *.txt lists all text files that end with txt
use the TAB key to autocomplete paths and file names
ctrl + a: cursor to beginning of command line
ctrl + e: cursor to end of command line
ctrl + c: stops the execution of a command

Using Unix tools to explore files

There are many unix tools available, some are useful to explore files:

cd ~/Unix_R/

head patient-data-cleaned.csv       # Prints the top 10 lines
head -5 patient-data-cleaned.csv    # Prints the top 5 lines

tail patient-data-cleaned.csv       # Prints the bottom 10 lines
tail -2 patient-data-cleaned.csv    # Prints the bottom 2 lines

wc -l patient-data-cleaned.csv      # Counts the number of lines
wc -m patient-data-cleaned.csv      # Counts the number of characters

cat patient-data-cleaned.csv        # Prints all the contents in the file
less patient-data-cleaned.csv       # Browse through the contents in the file by pressing SPACE. Then click on Q to quit.

sort -n patient-data-cleaned.csv    # Sorts lines in numerical order

We can use the symbol > to redirect the output of some commands into a file, instead of printing it to the screen.

wc -l patient-data-cleaned.csv > number_patients.txt
cat number_patients.txt             # check the output
rm number_patients.txt              # remove file

You can search within files using the tool grep:

grep "California" patient-data-cleaned.csv

The output obtained using some commands can be used as input in other commands. Almost all of the standard Unix tools can work this way: unless told to do otherwise, they read input, do something with what they've read, and write to output. To combine commands, we use the vertical bar also known as pipe |. It tells the shell that we want to use the output of the command on the left as the input to the command on the right. E.g. we can count how many of the patients are from California by combining grep and wc as follows:

grep "California" patient-data-cleaned.csv | wc -l

The output of the grep command is the input of wc -l.

A more advanced tool to search within files is awk, which is actually a programming language on its own. The general usage is awk <options> '<code>' <files>. For example, we can extend the grep functionality shown above like:

awk -F "," '($9=="California") {print $0}' patient-data-cleaned.csv

<options>: -F "," (patient-data-cleaned.csv is a comma-separated value file)
'<code>': '($9=="California") {print $0}', filter lines containing California in the 9th column, then print the entire line ($0)
<files>: patient-data-cleaned.csv

awk -F "," '($9=="Louisiana") {print $5, $9, $11, $17}' patient-data-cleaned.csv

Downloading files from the internet

The wget utility is the best option to download files from the internet. It retrieves files from the World Wide Web (WWW) using widely used protocols like HTTP, HTTPS and FTP, and is designed in such way so that it works in slow or unstable network connections. wget can automatically re-start a download where it was left off in case of network problem. Also it downloads file recursively and will keep trying until the file has be retrieved completely.

As an example on how to use wget, we are going to download a compressed raw sequencing file (FASTQ format) from a wheat RNA-seq experiment hosted at EMBL-EBI:

cd ~/Unix_R/
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR056/ERR056477/ERR056477.fastq.gz

You can use commands such as zcat and head to inspect the file:

zcat < ERR056477.fastq.gz | head

Exercise 2

Use pipes to find out how many patients from Florida are Male and how many are Female in the patient-data-cleaned.csv dataset
Download the sequencing data from a replicate RNA-seq experiment (ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR056/ERR056478/ERR056478.fastq.gz) and concatenate ERR056477.fastq.gz and ERR056478.fastq.gz to produce a new combined file (Hint: use cat)

🎉 Well done! 👍

Compressing and archiving

How much space do I use?

The command line tools du and df are useful to measure disk usage:

cd ~
du -h Unix_R            # Size of the directory

cd Unix_R
du -h *.fastq.gz        # Size of individual files

df -h /                 # Space that I use in the system

How to compress and uncompress files and directories?

Compressed files with extension .gz can be uncompressed with the tool gzip. Let's uncompress one of the fastq files:

cd ~/Unix_R
gzip -d ERR056477.fastq.gz    # The option -d uncompresses files
ls -lh
head ERR056477.fastq
gzip ERR056477.fastq          # Now we compress it back
ls -lh

Similarly, zipped files (extension .zip) can be uncompressed with the tool unzip. The tool tar is also widely-used to build data archives and backups.

Locating files

The tool find is useful to search for files of unknown location (they may not have been used for some time):

cd ~
find . -name "*.fastq.gz"
find . -name "patient-data*"
find . -name "README"

Text editors

Graphical:

Recent:
- Atom: free and open source; macOS, Linux and Windows; lots of free extension packages built and maintained by the community; supports most programming languages; developed by GitHub
- PyCharm: free and open source; macOS, Linux and Windows; specifically for the Python programming language; professional edition with extra features released under a proprietary license; developed by JetBrains.
Classic:
- Gedit: free and open source; macOS, Linux and Windows; simple and easy to use
- Emacs: one of the oldest free open source projects still under development
- Kate: intended for software developers

Text-only:

nano
vi

Any questions? 💭 💭 💭

Introduction to R and ggplot2

Section based on:

R crash course developed by Mark Dunning and Laurent Gatto.
Introduction to R developed by semacu
Data visualisation with R and ggplot2 developed by semacu

Outline

Motivation
How can I find help?
Getting started
Variables and functions
Exercise 3
Vectors
Import and explore data
Subsetting
Exercise 4
Sort tables and export results
Basic plotting
Exercise 5
Advanced plotting using the ggplot2 library
Export graphics

Motivation

R is one of the most widely-used programming languages for data analysis, statistics and visualisation in academia and industry.
It is free, open source and available in all platforms (macOS, Linux and Windows)
Supported by a broad community of software developers and researchers who contribute R packages and libraries to many fields of research
It facilitates reproducibility in research and integration of all your analyses in individual scripts
Easy to write documentation and code together using a free environment like RStudio

E.g. The New Zealand Tourism Dashboard uses R extensively to report statistics.

How can I find help?

Stack Overflow
The Comprehensive R Archive Network (CRAN)
CRAN Task Views
R-bloggers
Quick-R
Local R groups
Type ? followed my the name of the function that you'd like to use, e.g. ?mean
Bioconductor
Biostars

Getting started

Open RStudio and explore the different panels

The RStudio interface is composed of four panels, in anti-clockwise sense:
- Top-left: scripts panel
- Bottom-left: R console
- Bottom-right: plots, packages and help
- Top-right: log panel
The scripts panel is used to write commands whereas the R console below is used to interact with the programming language.
Click on File -> New File -> R Script to open up a page where to record R commands
Save it as e.g. myScript.R in your preferred scripts location

Variables and functions

You can use R as a calculator using the symbols +, -, * and /, or more advanced features such as statistical operations, logarithms, trigonometry ...

2 + 1
7 - 1
3 * 2
10 / 5

mean(1:5)
log(1)
pi
sin(pi/2)

To store your results for later, use variables. To create them, use the assignment operator <-:

x <- 25
x
y <- 16
y

You can perform multiple operations using variables:

sqrt(x)
x + y
x <- 36
x <- y
x <- x + 8

Functions in R take one or more arguments as input, which are captured using parentheses. Arguments can be named explicitly, otherwise they are meant to be used in the same order as described in the function definition. E.g. seq is a function for generating a numeric sequence from and to particular numbers. Type ?seq to get the help page for this function.

?seq
seq(from = 1, to = 10, by = 2)
seq(1, 10, 2)

Some functions have default values in some arguments:

seq(1, 10, 1)
seq(1, 10)

The default value for the by argument in the seq() function is 1.

An alternative method to obtain sequences of numbers spaced by one value is the : symbol:

z <- 1:5
z

Exercise 3

Create a sequence of numbers from 10 to 30 spaced by three values
How about decreasing sequences? Now try from 30 to 10 spaced by three values (hint: check ?seq)
Round the number pi down to 1 decimal place (hint: check ?round)

🎉 Congratulations! You did it! 👍

Vectors

The output we get using R functions such as seq() are called vectors, which are collections of numbers or characters
To create vectors use the function c() (a.k.a. combine)
Use square brackets [ ] to indicate the position within the vector (the index) and extract elements

x <- c(5,6,7,8,9,10)
x
x[3]
x[1]
x[3:5]

Arithmetic operations in vectors occur element by element:

x <- c(2, 4, 5, 6, 7)
y <- x*2
y
x + y

A vector can also contain text, however unlike values, these need to be captured using quotation marks " ":

x <- c("a", "b", "b", "c", "c", "d")
x

x <- c(a, b, b, c, c, d) # otherwise R thinks they are objects

To create subsets of our vectors, we can use comparison operators:

== equal
> greater than
< less than
!= not equal

x <- c("a", "b", "b", "c", "c", "d")
x == "b" # this is known as a logical or boolean vector, composed of TRUE or FALSE values only
x != "b"
x[x != "b"]

x <- c(2, 4, 5, 6, 7)
x > 4
x[x > 4]

Import and explore data

We will be using the dataset patient-data-cleaned.csv presented in the Introduction to Unix session earlier.

You will first need to find the path to the file patient-data-cleaned.csv. You can use the function file.choose() to open a dialogue box and browse through the directories to reach the file. The path will then be displayed in R:

file.choose()

e.g. for me the path to patient-data-cleaned.csv is /Users/martin03/Unix_R/patient-data-cleaned.csv. The file patient-data-cleaned.csv is a comma-separated values (CSV) file and in R, you can use the read.csv() function to import the dataset and create a data frame object using the path obtained above:

patient_data <- read.csv("/Users/martin03/Unix_R/patient-data-cleaned.csv") # copy here the path obtained when running file.choose()

Exploring rows and columns in the patient_data data frame:

# Dimensions
dim(patient_data)
ncol(patient_data)
nrow(patient_data)

# Viewing contents
head(patient_data)
View(patient_data)

# Names of columns
colnames(patient_data)

# Accessing data using column names
patient_data$Smokes
patient_data$Height
patient_data$State

# Summary of all data frame contents
summary(patient_data)
str(patient_data)

R works such that the values in each column have all to be of the same type (i.e. all numbers or all characters/text).

You can apply functions to the columns of the data frame to ask various questions:

# What is the maximum height?
max(patient_data$Height)
# What is the minimum weight?
min(patient_data$Weight)
# What is the mean body mass index (BMI)? Rounded to one decimal place?
round(min(patient_data$BMI), 1)

Subsetting

Just like when subsetting vectors, a selection of a data frame can be made using square brackes [ , ], however data frames are two-dimensional objects so you'll need both row and column indexes:

patient_data[1 , 2]
patient_data[2 , 1]
patient_data[c(1,2,3) , 1]
patient_data[c(1,2,3) , c(1,2)]

If you'd like to see all the rows, or all the columns, you can neglect either the row or column index respectively. But ... remember to keep the comma ;)

patient_data[2, ]
patient_data[, 2]
patient_data[, 1:4]

Rather than selecting rows based on indexes, you can also use comparison operators to give either a TRUE or FALSE result. When applied to subsetting, only rows with a TRUE result get returned.

# The vector of TRUE or FALSE results applied to subsetting data
patient_data$Height > 183

# Which patients are taller than 183cm?
patient_data[patient_data$Height > 183,]

# Which patients are smokers?
patient_data[patient_data$Smokes == "Smoker",]

# Which patients are taller than 183cm AND are smokers too?
patient_data$Height > 183 & patient_data$Smokes == "Smoker"
patient_data[patient_data$Height > 183 & patient_data$Smokes == "Smoker",]

# You can also select only specific columns using the column name, e.g. give me only the ID, Name, State and Disease Grade
patient_data[patient_data$Height > 183 & patient_data$Smokes == "Smoker", c("ID", "Name", "State", "Grade")]

The useful subsetting operators to bear in mind here are and &, or | and in %in%.

Exercise 4

Select patients that have a BMI greater than 30 or their weight is greater than 90kg. Calculate their average height.
Select female patients from California who are not overweighted

🎉 Well done! 👍

Sort tables and export results

The function order() gives sorted indices, which can then be used to sort your data set:

# Sort patients by Disease Grade
order(patient_data$Grade)
patient_data[order(patient_data$Grade),] # from benign (1) to harmful (3)
patient_data[order(patient_data$Grade, decreasing = TRUE),] # from harmful (3) to benign (1)

# Sort patients by more than one condition: first Disease Grade, second Weight
patient_data[order(patient_data$Grade, patient_data$Weight, decreasing = TRUE),]

Once data processing is completed, you can export results out of R as follows:

# Which patients from California are non-smokers?
patient_data_california <- patient_data[patient_data$State == "California" & patient_data$Smokes == "Non-Smoker",]

# Export
write.csv(patient_data_california, file = "/Users/martin03/Unix_R/patient-data-cleaned-california.csv")

Basic plotting

Simple plotting functions are available in the base R distribution (histograms, barplots, boxplots, scatterplots ...). All that is required as input are vectors of data, e.g. columns in your data frame.

Histograms are often used to have an overview of the distribution of continuous data:

hist(patient_data$BMI)
hist(patient_data$Weight)

Barplots are useful when you have counts of categorical data:

barplot(table(patient_data$Race))
barplot(table(patient_data$Sex))
barplot(table(patient_data$Smokes))
barplot(table(patient_data$State), las=2, cex.names=0.7) # 'las=2' changes the x-axis labels to horizonal and 'cex.names=0.7' changes the size
barplot(table(patient_data$Grade))
barplot(table(patient_data$Overweight))

Boxplots are good when comparing distributions Here the ~ symbol sets up a formula, the effect of which is to put the categorical variable on the x-axis and continuous variable on the y-axis -> boxplot(y ~ x)

boxplot(patient_data$BMI ~ patient_data$Grade)
boxplot(patient_data$BMI ~ patient_data$Overweight)

boxplot(patient_data$Weight ~ patient_data$Overweight)

Scatter plots are useful when representing two continuous variables. Here -> plot(x, y):

plot(patient_data$Weight, patient_data$BMI)

To enhance the appearance of your plots, almost infinite ways of customisation are possible, e.g.:

Colours: col argument. To get a full list of possible colours type colours(), or check this online reference.
Point type: pch
Axis labels: xlab and ylab
Plot title: main
... and many others: see ?plot and ?par for more options

# linear regression
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")

# polynomial regression
quadratic.model <-lm(patient_data$BMI ~ patient_data$Weight + I(patient_data$Weight^2))
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
lines(sort(patient_data$Weight), fitted(quadratic.model)[order(patient_data$Weight)], col = "darkgreen")

The arguments can also be used for other plotting functions!

boxplot(patient_data$BMI ~ patient_data$Overweight, col=c("red", "green"), xlab="Overweight patient?", ylab="BMI", main="US patient data")

To explore other types of plots using R standard functions, have a look here. There are dedicated R libraries e.g. ggplot2 to do more sophisticated plotting.

Exercise 5

Any differences of BMI between Smokers and Non-Smokers? (hint: try boxplot)
Visualise the relationship between the Height and Weight of the patients
A small trick: if you attach the data.frame patient_data as follows, then you will only need the column name without the '$' notation:

attach(patient_data)
plot(Weight, BMI)

Advanced plotting using the ggplot2 library

The ggplot2 library offers a powerful graphics language for creating elegant and complex plots. It is particularly useful when creating publication-quality graphics.

The key to understanding ggplot2 is thinking about a figure in layers (e.g. data points, axes and labels, legend). This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator or Inkscape, where you can ungroup the figure into its different components.

Load ggplot2

There are two ways to do this:

Click on the Packages tab in the bottom-right RStudio panel and search for ggplot2, then tick its box. If you can't find it, then click on Install and type ggplot2 inside the Packages box. Leaving the rest on default, click on Install. Once installed, then tick the box.
Run library(ggplot2) in the console. If you get a message like Error in library("ggplot2") : there is no package called ‘ggplot2’ then run install.packages("ggplot2") in the console. Once the installation is finished, run library(ggplot2) again.

Example

Let's begin with the scatterplot of Weight and Height.

First, loading ggplot2 library:

library("ggplot2")

The first "global" layer requires the definition of the dataset, and the x and y axes:

ggplot(data = patient_data, aes(x = Weight, y = Height))

In the second layer, we need to tell ggplot how we want to visually represent the data (scatterplot, boxplot, barplot ...). For a scatterplot, we need geom_point():

ggplot(data = patient_data, aes(x = Weight, y = Height)) +
geom_point()

Another aes (aesthetic) property we can modify is the point color, e.g. to change the color depending on the grade of the disease:

ggplot(data = patient_data, aes(x = Weight, y = Height, col = as.factor(Grade))) +
geom_point()

Export graphics

When running commands directly in the interactive console (bottom-left panel), plots can be exported using the Plots tab in RStudio (bottom-right panel). Click on Export -> Save as PDF ....

When plotting using R standard graphics, you can also save plots to a file calling the pdf() or png() functions before executing the code to create the plot:

pdf("/Users/martin03/Unix_R/BMIvsWeight.pdf")
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")
dev.off()

The dev.off() line is important; without it you will not be able to view the plot you have created.

If you use ggplot2, the syntax is a bit more concise, e.g.:

gg<- ggplot(data = patient_data, aes(x = Weight, y = Height, col = as.factor(Grade))) +
geom_point()
ggsave("/Users/martin03/Unix_R/HeightvsWeight.pdf")

That's it! Enjoy R! 👍 🚀

Questions?

Any later feedback / questions about the course, please email Sergio ([email protected])

Additional materials and resources

Unix:

Software Carpentry: The Unix Shell
Data Carpentry: Introduction to the command line
CRUK-CI Bioinformatics: Introduction to the Shell
Babraham: Introduction to Unix
Linux Cheat Sheet

R:

Software Carpentry: Programming with R
Software Carpentry: R for Reproducible Scientific Analysis
Data Carpentry: Data analysis and visualization in R
Babraham: Introduction to R
Torfs and Brauer: A (very) short introduction to R
R and RStudio Cheat Sheets

License

This work is distributed under a Creative Commons CC0 license. No rights reserved.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
img		img
README.md		README.md

semacu/20180726_TrainMalta_Unix_R

Folders and files

Latest commit

History

Repository files navigation

Introduction to Unix and R