Welcome to Introduction to R from fredhutch.io! This course introduces you R by working through common tasks in data science: importing, manipulating, and visualizing data.
R is a statistical and programming computer language widely used for a variety of applications. For more information about R and ways to use it at Fred Hutch, please see the R and RStudio entry for the Fred Hutch Biomedical Data Science Wiki.
Before proceeding with these training materials, please ensure you have installed both R and RStudio as described here.
By the end of this session, you should be able to:
- work within the RStudio interface to run and save R code in a project
- understand basic R syntax to use functions and assign objects
- create and manipulate vectors and understand how R deals with missing data
R is a statistical programming language, while RStudio is an integrated development environment (IDE) that allows you to code in R more easily. RStudio possesses many features that you may find useful in your work. We’ll highlight a few of the most common and useful parts for our introductory course.
The first time you open RStudio, you’ll see three panels, or windows.
- The panel on the left is the console, where you can run R code. The
text printed in this panel is basic information about R and the
version you’re running. You can test how the console can be used to
run code by entering
3 + 4
and then pressing enter. This instructs your computer to read, interpret, and execute the command, then print the result (7
) to the Console, and show a right facing arrow (>
), indicating it is ready to accept additional code. - The panel on the top right is the environment. It’s empty right now, but we’ll learn more about this later in this lesson.
- The panel on the lower right shows the files present in your working
directory. Currently, that’s probably your
Home
directory, which includes folders likeDocuments
andDownloads
.
You may notice that some of the panels possess additional tabs. We’ll explore some of these features in this class, but for more information:
Help -> Cheetsheets -> RStudio IDE cheat sheet
This PDF includes an overview of each of the things you see in RStudio, as well as explanations of how you can use them. It may be intimidating right now, but will come in handy as you gain experience with R.
One of the ways that RStudio makes working in R easier is by allowing you to create R projects. You can think of a project as a discrete unit of work, such as a chapter of a thesis/dissertation, analysis for a manuscript, or a monthly report. We recommend organizing your code, data, and other associated files as projects, which allows you to keep all parts of an analysis together for easier access.
We’ll be creating a project to use for the duration of this course. Create a new project in RStudio:
File -> New Project
- Choose
New Directory
, thenNew Project
- name your project
intro_r
and save it somewhere on your computer you’ll be able to find easily later (we recommend your Desktop or Documents) - Click
Create project
After your RStudio screen reloads, note two things:
- The file browser in the lower right panel will now show the contents
of a new folder,
intro_r
, that was created as a part of your RStudio project. - The console window will show the path, or location in your computer, for your project directory. This is important later in class, when this path will be required to locate data for analysis.
Now we’re ready to create a new R script:
File -> New File -> R Script
- Save the new file as
class1.R
. By default, RStudio will save this in your project directory.
This R script is a text file that we’ll use to save code we learn in this class. We’ll refer to this window as the script or source window. Remember to save this file periodically to retain the record of the work you’re doing, so you can re-execute the code later if necessary.
By convention, a script should include a title at the top, so type the following on the first line:
# Introduction to R: Class 1
Now that we have a project and new script set up, we’re ready to begin adding code. Skipping a line after the title, type the following on the next two lines:
# basic math
4 + 5
## [1] 9
The first of the two boxes above represents the code you execute. The
second box (prefaced with ##
) shows the output you should expect. The
[1]
in the second box means there is one item (in this case, 9
)
present in the output.
The first line in that example is a code comment. It is not interpreted
by R, but is a human-readable explanation of the code that follows. This
is also how we included a title in our script. In R, anything to the
right of one or more #
symbols represents a comment.
The code above is the same mathematical operation we executed earlier. If we wanted to re-run this command, we have two options:
- Copy and paste the code into the Console
- Use the
Run
button at the top of the script window - Use the keyboard shortcut:
Ctrl + Enter
The third option is the most efficient, especially as your coding skills
progress. With your cursor on the line with 4 + 5
, hold down the
Control
key and press Enter
. You’ll see the code and answer both
appear in the Console. A few things to note about this keyboard
shortcut:
- It doesn’t matter where your cursor is on the line of code; the entire line will be executed with the keyboard shortcut.
- If there isn’t code on the line where your cursor is located, RStudio will attempt to execute following lines.
In practice, a script should represent code you are developing in R, and you should only save the code that you know functions. For this class, we’ll be including notes about things we learn as comments.
Ctrl + Enter
is the only keyboard shortcut we emphasize in this course, but there are many others available. You can view them on the second page of the cheat sheet linked above, or by going toHelp -> Keyboard Shortcuts Help
.
If you were looking carefully, you may have noticed that the +
in the
previous code example had spaces on either side, separating it from the
numbers. You may wonder whether spaces matter in how the code is
interpreted. As with many questions in coding, the easiest way to assess
whether removing the spaces matters is to simply try it:
# same code as above, without spaces
4+5
## [1] 9
Given the output, we can conclude that spaces do not matter in how the code functions. In this case, however, spaces represent a common convention in formatting R code, as it makes it easier for human eyes to read. In general, you should attempt to replicate the code presented here as closely as possible, and we’ll do our best to note when something is required as opposed to convention.
Code convention and style doesn’t make or break the ability of your code to run, but it does affect whether other people can easily understand your code. A brief overview of common code style is available here, and more information is available in the tidyverse style guide.
So far, we’ve used R with mathematical symbols representing operations. R possesses the ability to perform much more complex tasks using functions, which is a pre-defined set of code that allows you to repeat particular actions.
R includes functions for other types of math:
# using a function: rounding numbers
round(3.14)
## [1] 3
In this case, round
is the function, and 3.14
is the number (data)
being manipulated by the funcion. A word followed by parentheses is a
common format for functions in R.
Syntax refers to the rules that dictate how combinations of words and symbols are interpreted in a language (either programming or human).
Additional options for modifying functions are called arguments, and are
included with the data between parentheses. For the round
function, a
common modification would be the number of decimal points output. You
can change this detail by adding a comma and then additional argument:
# using a function with more arguments
round(3.14, digits = 1)
## [1] 3.1
If you would like to learn more about how this function works, you can
go to the bottom righthand panel and click on the Help
tab. Enter the
name of a function into the search box and hit Enter
. Alternatively,
execute the following in your console:
?round
This is a shortcut for performing the same task in the panel described above.
R help documentation tends to be formatted very consistently. At the very top, you’ll see the name of the function. Below that, a short title indicates the purpose of the function, along with a more verbose “Description”. “Usage” tells you how to use the function in code, and “Arguments” details each of the optiond in “Usage”. The rest of the subheadings should be self-explanatory.
In the example above, there is no label associated with 3.14
. In
reality, 3.14
represents x
, so the command can actually be written
as round(x = 3.14, digits = 1)
. Even if not explicitly stated, the
computer assumes that 3.14
represents x
if the number is the first
thing that appears after the opening parenthesis.
If you define both arguments explicitly, you can switch the order in which they appear:
# can switch order of arguments
round(digits = 1, x = 3.14)
## [1] 3.1
If you remove the labels (round(1, 3.14)
), the answer is different,
because R is assuming you mean round(x = 1, digits = 3.14)
.
You may notice that boxes pop up as you type. These represent RStudio’s attempts to guess what you’re typing and share additional options.
What does the function
hist
do? What are its main arguments? How did you determine this?
So far, we’ve been performing tasks with R that require us to input the data manually. One of the strengths of using a programming language is the ability to assign data to objects, or variables.
Objects in R are referred to as variables in other programming languages. We’ll use these terms synonymously for this course, though in other contexts there may be differences between them. Please see the R documentation on objects for more information.
Like in math, a variable is a word used to represent a value (in this case, a number):
# assigning value to an object
weight_kg <- 55
In the code above, <-
is the assignment operator: it instructs R to
recognize weight_kg
as representing the value 55. You can think of
this code as referencing “55 goes into weight_kg.”
After executing the code above, you’ll see the object appear in the Environment panel on the upper right hand side of the RStudio screen. The name of the object will appear on the left, with the value assigned to it on the right.
The name you assign to objects can be arbitrary, but we recommend using names that are relatively short and meaningful in the context of the values they represent. It’s useful to also know other general limitations on object names:
- case sensitive
- cannot start with numbers
- avoid other common words in R (e.g., function names, like
mean
) - avoid dots (underscores are a good alternative, such as the example above)
Extra information on object names is available in the tidyverse style guide.
Now that the object has been assigned, we can reference that object by executing its name:
# recall object
weight_kg
## [1] 55
Thus, the value weight_kg
represents is printed to the Console.
We can also perform operations on an object:
# multiple an object (convert kg to lb)
2.2 * weight_kg
## [1] 121
In that case, the answer is printed to the Console. You can also assign the output to a new object:
# assign weight conversion to object
weight_lb <- 2.2 * weight_kg
After executing that line of code, you’ll see weight_lb
appear in the
Environment panel, too.
Now let’s explore what happens if we assign a value to an existing object name:
# reassign new value to an object
weight_kg <- 100
Note that the value assigned to weight_kg
as it appears in the
Environment panel changes after executing the code above.
Has the value assigned to weight_lb
also changed? You might expect
this would be the case, since this value is derived from weight_kg
.
However, weight_kg
remains the same as previously assigned. If you
want the value for weight_kg
to reflect the new value for weight_kg
,
you will need to again execute weight_lb <- 2.2 * weight_kg
. This
should help you understand an important concept in writing code: the
order in which you execute lines of code matters! In the context of the
material we cover in this class, we’ll continue saving code in scripts
so we have a record of both the relevant commands and the appropriate
order for execution.
You can think of the names of objects like sticky notes. You have the option to place the sticky note (name) on any value you choose. You can pick up the sticky note and place it on another value, but you need to explicitly tell R when you want values assigned to certain objects.
At this point in the lesson, it’s common to have accidentally created an object with a typo in the name. If this has happened to you, it’s useful to know how to remove the object to keep your environment up to date. Here, we’ll practice removing an object with something everyone has available:
# remove object
remove(weight_lb)
This removes the specified object from the environment, which you can
confirm by its absence in the Environment panel. You can also abbreviate
this command to rm(weight_lb)
.
You can clear the entire environment using the button at the top of the Environment panel with a picture of a broom. This may seem extreme, but don’t worry! We can re-create all the work we’ve already done by executing each line of code again.
For the code chunk below, what is the value of each item at each step?
mass <- 47.5 # mass?
width <- 122 # width?
mass <- mass * 2.0 # mass?
width <- width - 20 # width?
mass_index <- mass/width # mass_index?
So far, we’ve worked with objects containing a single value. For most research purposes, however, it’s more realistic to work with a collection of values. We can do that in R by creating a vector with multiple values:
# assign vector
ages <- c(50, 55, 60, 65)
# recall vector
ages
## [1] 50 55 60 65
The c
function used above stands for “combine,” meaning all of the
values in parentheses after it are included in the object. This is
reflected in the Console, where recalling the value shows all four
values, and the Environment window, where multiple values are shown on
the right side.
We can use functions to ask basic questions about our vector, including:
# how many things are in object?
length(ages)
## [1] 4
# what type of object?
class(ages)
## [1] "numeric"
# get overview of object
str(ages)
## num [1:4] 50 55 60 65
In the code above, we learn that there are four items (values) in our
vector, and that the vector is composed of numeric data. str
stands
for “structure”, and shows us a general overview of the data, including
a preview of the first few values (or all the values, as is the case in
our small vector).
Even more useful is the ability to use functions to perform more complex tasks for us, such as statistical summaries:
# performing functions with vectors
mean(ages)
## [1] 57.5
range(ages)
## [1] 50 65
Although we’ve focused on numbers as data so far, it’s also possible for data to be words instead:
# vector of body parts
organs <- c("lung", "prostate", "breast")
In this case, each word is encased in quotation marks, indicating these are character data, rather than object names.
Please answer the following questions about
organs
: - How many values are inorgans
? - What type of data isorgans
? - How can you see an overview oforgans
?
We’ve seen data as numbers and letters so far. In fact, R has all of the following basic data types:
- character: sometimes referred to as string data, tend to be surrounded by quotes
- numeric: real or decimal numbers, sometimes referred to as “double”
- integer: a subset of numeric in which numbers are stored as integers
- logical: Boolean data (TRUE and FALSE)
- complex: complex numbers with real and imaginary parts (e.g., 1 + 4i)
- raw: bytes of data (machine readable, but not human readable)
The three data types listed in bold above are the focus of this class. R automatically interprets the type as you enter data. Most data analysis activities will not require you to understand specific details of the other data types.
R tends to handle interpreting data types in the background of most operations. The following code is designed to cause some unexpected results in R. What is unusual about each of the following objects?
num_char <- c(1, 2, 3, "a")
num_logical <- c(1, 2, 3, TRUE)
char_logical <- c("a", "b", "c", TRUE)
tricky <- c(1, 2, 3, "4")
In the section above, we learned to create and assess vectors, and use functions to calculate statistics across the values. We can also modify a vector after it’s been created:
# add a value to end of vector
ages <- c(ages, 90)
The example above uses the same combine (c
) function as when we
initially created the vector. We can also use it to add values to the
beginning of the vector:
# add value at the beginning
ages <- c(30, ages)
If we wanted to extract, or subset, a portion of a vector:
# extracting second value
organs[2]
## [1] "prostate"
In general, square brackets ([ ]
) in R refer to a part of an object.
The number 2 indicates the second value in the vector.
The index position of a value is the number associated with its location in a collection. In the example above, note that R indexes (or counts) starting with 1. This is different from many other programming languages, like Python, which use 0-based indexing.
In R, a minus sign (-
) can be used to negate a value’s position, which
excludes that value from the output:
# excluding second value
organs[-2]
## [1] "lung" "breast"
You may be tempted to try extracting multiple values at a time by
separating the numbers with commas (e.g., organs[2,3]
). This will
result in a rather cryptic error, which we’ll talk more about next time.
For now, remember that you can use the combine function to indicate
multiple values for subsetting:
# extracting first and third values
organs[c(1, 3)]
## [1] "lung" "breast"
We’ll switch back to our numerical ages
object to explore another
common need when subsetting: extracting values based on a condition (or
criteria). For numerical data, we’re often interested in extracting data
that are in a certain range of values. It is tempting to try something
like:
ages > 60
## [1] FALSE FALSE FALSE FALSE TRUE TRUE
The result, however, is less than satisfying: you receive either TRUE or FALSE for each data point, depending on whether it meets the condition or not.
While that information isn’t quite what we expected, we can combine it with the subsetting syntax we learned earlier:
# extracts values which meet condition
ages[ages > 60]
## [1] 65 90
If we read the code above from the inside out (a common strategy for R), the code above identifies which values meet the criteria, and the square brackets are used to extract this from the original vector.
If you want to extract items exactly equal to a specific value, you need to use two equal signs:
# extracts values numerically equivalent values
ages[ages == 60]
## [1] 60
You can think of this as a way to differentiate mathematical equivalency
from specification of parameters for arguments (such as digits = 1
for
round()
, as we learned earlier). R also allows you to use <= and >=.
Finally, it’s common to need to combine conditions while subsetting. For example, you may be interested in only values between 50 and 60:
# ages less than 50 OR greater than 60
ages[ages < 50 | ages > 60]
## [1] 30 65 90
In the code above, the vertical pipe |
is interpreted to mean “or,” so
each data point can belong to either the category on the left of the
pipe, the category on the right, or both. In other words, the vertical
pipe means any single value being evaluated must meet one or both
conditions.
You can also combine conditions with &
, but this means any single
value must meet both conditions:
# ages greater than 50 OR less than 60
ages[ages > 50 & ages < 60]
## [1] 55
Be careful when thinking about human language as opposed to programming languges. When speaking, we is reasonable to say “extract all values below 50 and above 60.” While this makes sense in context, it is mathematically impossible for a value to be both less than 50 AND greater than 60.
Why does the following code return the answer it
"four" > "five"
## [1] TRUE
Most of the data we encounter has missing data. Programming languages interpret and handle missing data in different ways, so it’s worth taking time to dig into how R approaches this issue.
First, we’ll create a new vector some values indicated as missing data:
# create a vector with missing data
heights <- c(2, 4, 4, NA, 6)
In the vector above, NA
represents a value where data are missing. You
may notice NA
is not encased in quotation marks. This is because R
interprets that set of characters specifically as missing data.
Next, let’s investigate how this vector responds to use in functions:
# calculate mean and max on vector with missing data
mean(heights)
## [1] NA
max(heights)
## [1] NA
The answer isn’t very satisfying; we’re told the answer is missing data because of the presence of a single missing value in the vector. This is a slightly frustrating default behavior for some common statistical functions in R, but we can add an argument to ignore missing data and calculate across the remaining values:
# add argument to remove NA
mean(heights, na.rm = TRUE)
## [1] 4
max(heights, na.rm = TRUE)
## [1] 6
In the code above, the na.rm
parameter controls whether missing data
are removed. The default (which you can also reference in the help
documentation) is for missing values to be included (na.rm = FALSE
).
By switching to na.rm = TRUE
, we’re instructing R to remove missing
data.
The example above retains missing values in the dataset while performing calculations. There are certainly cases in which you may want to specifically filter out the missing data from your dataset.
The function is.na
allows you to ask whether elements in a dataset are
missing:
# identify elements which are missing data
is.na(heights)
## [1] FALSE FALSE FALSE TRUE FALSE
If a resulting value is TRUE
, the value is missing. If FALSE
, the
data point is present. We can invert the resulting logical data using an
exclamation point:
# reverse the TRUE/FALSE
!is.na(heights)
## [1] TRUE TRUE TRUE FALSE TRUE
This means missing data are now listed as FALSE
, with data present as
TRUE
.
As with the conditional statements we learned earlier, we can combine these results with our square bracket subsetting syntax to extract only values that are present in the dataset:
# extract elements which are not missing values
heights[!is.na(heights)]
## [1] 2 4 4 6
Alternatively, you can use a function specifically designed for excluding (omitting) missing data:
# remove incomplete cases
na.omit(heights)
## [1] 2 4 4 6
## attr(,"na.action")
## [1] 4
## attr(,"class")
## [1] "omit"
You may notice that this output looks slightly different than the
previous example. This is because na.omit
includes output about
attributes, or information about the data. The output vectors are the
same for the last two code examples, even though the way they appear in
the Console seems different.
If you aren’t sure how to interpret the output in your console, sometimes it helps to assign the output to an object. You can then inspect the data type, structure, etc to ensure you’re getting the answer you expected.
Complete the following tasks after executing the code chunk below. (Note: there are multiple solutions): - Remove NAs - Calculate the median - Identify how many elements in the vector are greater than 67 inches - Visualize the data as a histogram (hint: function
hist
)
# create vector
more_heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
In this session, we spent some time getting to know the RStudio interface for writing and running R code, explored the basic principles of R syntax for functions and object assignment, and worked with vectors to understand how R handles missing data.
In the next session, we’ll learn to import spreadsheet-style data that are more similar to what you’d like handle for a research project, and practice accessing different portions of the data.
When you are done working in RStudio, you should save any changes to your R script. When you close RStudio, you will see a pop-up box asking if you want to save your workspace image. We do not recommend saving your project in this way, as it creates extra (hidden) files on your computer that can be unwieldy in size and inadvertently retain sensitive data (if you’re working with PHI or other private data). If you’ve saved your R script, you can recreate all the work you’ve accomplished. For more information on this topic, please review this explanation. If you would like to prevent this box from popping up in the future, we recommend:
- Go to
Tools -> Global Options
(Global means for all projects; you can also change this for each project usingProject Options
) - In the drop-down menu next to
Save workspace to ~/.Rdata on exit
selectNever
.
If you need to reopen your project after closing RStudio, you should
use the File -> Open Project
and navigate to the location of your
project directory. Alternatively, using your operating system’s file
browser, double click on the r_intro.Rrpoj
file.
This document is written in R
markdown, which is a method of
formatting text, code, and output to create documents that are sharable
with other people. While this document is intended to serve as a
reference for you to read while typing code into your own script, you
may also be interested in modifying and running code in the original R
markdown file (class1.Rmd
in the GitHub repository).
The course materials webpage is available here. Materials for all lessons in this course include:
- Class 1: R syntax, assigning objects, using functions
- Class 2: Data types and structures; slicing and subsetting data
- Class 3: Data manipulation with
dplyr
- Class 4: Data visualization in
ggplot2
Answers to all challenge exercises are available here.
- Create an object called agge that contains your age in years
- Reassign the object to a new object called age (e.g., correct the typo)
- Remove the previous object from your environment
- Calculate your age in days
- create a object representing a vector that contains the names of buildings on Fred Hutch’s campus: https://www.fredhutch.org/en/contact-us/visit-us.html
- add Seattle, Washington to the beginning of the vector, and Steam Plant to the end of the vector
- subset the vector to show only the building in which you work
The following vector represents the number of vacation days possessed by various employees:
vacation_days <- c(5, 7, 20, 1, 0, 0, 12, 4, 2, 2, 2, 4, 5, 6, 7, 10, 4)
- How many employees are represented in the vector?
- How many employees have at least one work week’s worth of vacation available to them?