- Date: 31st May 2018, 6 - 7pm
- Series: Wolfson College Skills for Academic Success
- Location: Gatsby room, Wolfson college, University of Cambridge, UK
- Trainer: Sergio Martínez Cuesta
- Register here
This course provides a short beginners introduction to data visualisation using the R programming language and software environment for statistical computing and graphics. Sergio will demonstrate basic examples on how to import data, perform different types of plots and export graphics using R standard functions and the library ggplot2. Everybody is welcome; if you would like to follow along with your laptop, please bring R and RStudio downloaded and installed before the session.
- Motivation
- Getting started
- Import data into R
- Basic plotting
- Exercise 1
- Advanced plotting using the ggplot2 library
- Export graphics
For a basic introduction to R functionality, check out our basic R course.
These short courses are inspired by the R crash course developed by Mark Dunning, Laurent Gatto and others.
- R is one of the most widely-used programming languages for data analysis, statistics and visualisation in academia and industry.
- It is open-source and available in all platforms (Mac, Linux and Windows)
- Supported by a broad community of software developers and researchers who contribute R packages and libraries to many fields of research
- It facilitates reproducibility in research and integration of all your analyses in individual scripts
- Easy to write documentation and code together using a free environment like RStudio
- Open RStudio, e.g. go to
Finder
->Applications
and click onRstudio
To download today's workshop:
- Go to your web browser e.g. Firefox and type: https://tinyurl.com/2018-DataVisR-Wolfson
- Click on
DataVisR.zip
, then pressDownload
and save the file in your preferred folder, e.g. your Desktop - Go to the folder where you saved
DataVisR.zip
and uncompress it, e.g. in Mac just double-click onDataVisR.zip
. Only then, the folderDataVisR
will appear. - The folder
DataVisR
contains two files:DataVisR.Rmd
- the code for today's sessionpatient-data-cleaned.csv
- the dataset that we will be exploring
Now, go back to RStudio:
- Click on
File
->Open File
and selectDataVisR.Rmd
- You are all set to go now :)
Also, in case you are not familiar with RStudio, a quick recap:
-
RStudio interface is composed of four panels, in anti-clockwise sense:
- Top-left: scripts panel
- Bottom-left: R console
- Bottom-right: plots, packages and help
- Top-right: log panel
-
You are now looking at the scripts panel. We will be using the R console below to interact with R during the workshop
-
Blocks of code in RStudio are often written using the format R markdown, which allows mixed plain text and R code together within the same document
-
Each line of R code inside a block can be executed by clicking on the line and pressing CMD + ENTER (Mac) or CTRL + ENTER (Windows and Linux), e.g.:
print("R is fun!")
Alternatively, to execute the entire block, click on the green arrow tip on the right-hand side of the block.
3 + 1
- You can add a new block of code by selecting
R
in theInsert
menu or by typing the following syntax directly:
# R code goes in here
We will use a small made-up dataset which is often used for training purposes. It contains information about 100 lung cancer patients aged 42-44 from different states in the US. We have saved these data as a comma-separated values (CSV) file patient-data-cleaned.csv
, which can easily be opened using software like Excel. In R, use the read.csv()
function to import the data:
patient_data <- read.csv("/Users/martin03/Desktop/DataVisR/patient-data-cleaned.csv") # copy here the path to the file
If you have trouble finding the exact path to patient-data-cleaned.csv
, use the function file.choose()
to open a dialogue box and browse through the directories to reach the file:
file.choose()
The path will then be displayed in R and you can copy it into the read.csv()
command above.
The object patient_data
is known as a data frame in R. To explore its contents:
# Dimensions (rows and columns)
dim(patient_data)
# Viewing contents
View(patient_data)
# Structure of the data frame
str(patient_data)
# Summary of all data frame contents
summary(patient_data)
Simple plotting functions are available in the base R distribution (histograms, barplots, boxplots, scatterplots ...). All that is required as input are vectors of data, e.g. columns in your data frame.
Histograms are often used to have an overview of the distribution of continuous data:
hist(patient_data$BMI)
hist(patient_data$Weight)
Barplots are useful when you have counts of categorical data:
barplot(table(patient_data$Race))
barplot(table(patient_data$Sex))
barplot(table(patient_data$Smokes))
barplot(table(patient_data$State), las=2, cex.names=0.7) # 'las=2' changes the x-axis labels to horizonal and 'cex.names=0.7' changes the size
barplot(table(patient_data$Grade))
barplot(table(patient_data$Overweight))
Boxplots are good when comparing distributions Here the ~
symbol sets up a formula, the effect of which is to put the categorical variable on the x-axis and continuous variable on the y-axis -> boxplot(y ~ x)
boxplot(patient_data$BMI ~ patient_data$Grade)
boxplot(patient_data$BMI ~ patient_data$Overweight)
boxplot(patient_data$Weight ~ patient_data$Overweight)
Scatter plots are useful when representing two continuous variables. Here -> plot(x, y)
:
plot(patient_data$Weight, patient_data$BMI)
To enhance the appearance of your plots, many different ways of customisation are possible:
- Colours:
col
argument. To get a full list of possible colours typecolours()
, or check this online reference. - Point type:
pch
- Axis labels:
xlab
andylab
- Plot title:
main
- ... and many others: see
?plot
and?par
for more options
# linear regression
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")
# polynomial regression
quadratic.model <-lm(patient_data$BMI ~ patient_data$Weight + I(patient_data$Weight^2))
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
lines(sort(patient_data$Weight), fitted(quadratic.model)[order(patient_data$Weight)], col = "darkgreen")
The arguments can also be used for other plotting functions!
boxplot(patient_data$BMI ~ patient_data$Overweight, col=c("red", "green"), xlab="Overweight patient?", ylab="BMI", main="US patient data")
To explore other types of plots using R standard functions, have a look here. There are dedicated R libraries e.g. ggplot2 to do more sophisticated plotting.
- Any differences of BMI between Smokers and Non-Smokers? (hint: try
boxplot
) - Visualise the relationship between the Height and Weight of the patients
- A small trick: if you attach the data.frame
patient_data
as follows, then you will only need the column name without the '$' notation:
attach(patient_data)
plot(Weight, BMI)
The ggplot2 library offers a powerful graphics language for creating elegant and complex plots. It is particularly useful when creating publication-quality graphics.
The key to understanding ggplot2 is thinking about a figure in layers (e.g. data points, axes and labels, legend). This idea may be familiar to you if you have used image editing programs like Photoshop, Illustrator or Inkscape, where you can ungroup the figure into its different components.
There are two ways to do this:
-
Click on the
Packages
tab in the bottom-right RStudio panel and search forggplot2
, then tick its box. If you can't find it, then click onInstall
and typeggplot2
inside the Packages box. Leaving the rest on default, click onInstall
. Once installed, then tick the box. -
Run
library(ggplot2)
in the console. If you get a message likeError in library("ggplot2") : there is no package called ‘ggplot2’
then runinstall.packages("ggplot2")
in the console. Once the installation is finished, runlibrary(ggplot2)
again.
Let's begin with the scatterplot of Weight and Height.
First, loading ggplot2 library:
library("ggplot2")
The first "global" layer requires the definition of the dataset, and the x and y axes:
ggplot(data = patient_data, aes(x = Weight, y = Height))
In the second layer, we need to tell ggplot how we want to visually represent the data (scatterplot, boxplot, barplot ...). For a scatterplot, we need geom_point():
ggplot(data = patient_data, aes(x = Weight, y = Height)) +
geom_point()
Another aes (aesthetic) property we can modify is the point color, e.g. to change the color depending on the grade of the disease:
ggplot(data = patient_data, aes(x = Weight, y = Height, col = as.factor(Grade))) +
geom_point()
When running commands directly in the interactive console (bottom-left panel), plots can be exported using the Plots tab in RStudio (bottom-right panel). Click on Export
-> Save as PDF ...
.
When plotting using R standard graphics, you can also save plots to a file calling the pdf()
or png()
functions before executing the code to create the plot:
pdf("/Users/martin03/Desktop/DataVisR/BMIvsWeight.pdf")
plot(patient_data$Weight, patient_data$BMI, col="red", pch=16, xlab="Weight (kg)", ylab="BMI", main="US patient data")
abline(lm(patient_data$BMI ~ patient_data$Weight), col="blue")
dev.off()
The dev.off()
line is important; without it you will not be able to view the plot you have created.
If you use ggplot2, the syntax is more concise:
gg<- ggplot(data = patient_data, aes(x = Weight, y = Height, col = as.factor(Grade))) +
geom_point()
ggsave("/Users/martin03/Desktop/DataVisR/HeightvsWeight.pdf")
That's it! Enjoy R!
Feedback / questions about the course, please email Sergio ([email protected]).
Blogs:
- Getting started with data visualization in R using ggplot2
- End-to-end visualization using ggplot2
- ggplot2 - Easy way to mix multiple graphs on the same page
- Rookie mistakes and how to fix them when making plots of data
- BBC Visual and Data Journalism cookbook for R graphics
Books:
- Cookbook for R
- R for Data Science
- Data Visualization for Social Science. A practical introduction with R and ggplot2
- ggplot2: Elegant Graphics for Data Analysis (Use R!)
- R packages
- plotly for R
- bookdown: Authoring Books and Technical Documents with R Markdown
Courses:
- Bernd Klaus teaching materials
- Modern Statistics for Modern Biology
- Statistical Inference via Data Science A moderndive into R and the tidyverse
- CRUK-CI R crash course
- R for Reproducible Scientific Analysis
- Karl Broman's mini tutorials
- Basic statistics and data handling with R
- Scripting for data analysis (with R)
- An Introduction to Solving Biological Problems with R
- Data Analysis and Visualisation using R: including dplyr and ggplot2
- Babraham institute basic/advanced R and ggplot2 courses
- R object-oriented programming and package development, link1 and link2
- R course content for the CODATA-RDA Research Data Science Summer School
- Data carpentry course for biologists by Ethan White
- Cambridge's Data carpentry using R
Perspectives:
Tutorials:
Sergio is a University of Cambridge Data Champion funded by a Jisc research data fellowship to develop research data training activities for researchers. He does research in bioinformatics and computational biology within the Balasubramanian laboratories funded by the Wellcome Trust at the University of Cambridge.
This work is distributed under a Creative Commons CC0 license. No rights reserved.
Our sponsors: