-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathweek2.Rmd
1460 lines (1077 loc) · 47.9 KB
/
week2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Session 2 -- Working with data in R"
---
> #### Learning objectives
>
> * Create and run a script file containing your R code
> * Extract values or subsets from vectors
> * Modify values within a vector
> * Perform vector arithmetic
> * Introduce more sophisticated data structures (lists and data frames)
> * Learn how to install and use packages that extend R's basic functionality
> * Read data in tabular format into R
> * Calculate summary statistics on your tabular data
> * Introduce the **`tibble`**, arguably the most important data structure you will use in R
> * Learn how R deals with missing values
# Getting started with data in R
In this course, we'll be mostly focusing on a set of R packages specifically
designed for the most useful and common operations for interacting with and
visualizing data, and particularly data in a tabular form. This collection of
packages is known as the **tidyverse**.
Learning the **tidyverse** is more than just learning about some helpful
functions to get certain tasks done. The tidyverse packages form a coherant
system for data manipulation, exploration and visualization that share a common
design philosophy. There is a certain elegance to code crafted using the
tidyverse approach that is in stark contrast to the strange and often cryptic
equivalent using traditional 'base R'. This old-timer only really became a true
R convert on being introduced to the tidyverse.
Sadly, as much as we'd like, we can't just cover the tidyverse alone and
ignore the fundamentals of the R language. So this week we will look at some
aspects of R that are crucial to understanding how R is handling your data,
even though we will come back to some of these concepts in later weeks to show
you how those same operations are more easily and elegantly carried out in
the tidyverse.
We will also start to look at the most important data structure you'll use
with your data, assuming it is in tabular form, the **data frame**, and its
superior tidyverse derivative, the **tibble**.
---
# Scripts
Up to now, we were mostly typing code in the Console pane at the **`>`** prompt.
This is a very interactive way of working with R but what if you want to save
the commands you've typed for a future session working in R?
Instead we can create a script file containing our R commands that we can come
back to later. This is the way most R coding is done so let's have a go.
From the RStudio '**File**' menu, select '**New File**' and then '**R Script**'.
![](images/RStudio_new_file_menu.png){width=50%}
You should now have a new file at the top of the left-hand side of RStudio for
your new R script named 'Untitled1'. The Console window no longer occupies the
whole of the left-hand side.
![](images/RStudio_new_script.png){width=100%}
We can type code into this file just as we have done in the Console window at
the command prompt.
Type in some of the commands from last week's assignment. Do you notice that the
file name on the tab for this pane is now highlighted in red and has an
asterisk?
![](images/RStudio_modified_script.png){width=50%}
This tells us that we haven't yet saved our changes. There are various ways to
do so just like in Word or Excel or other applications you're familiar with, for
example using the '**Save**' option from the '**File**' menu or clicking on the
'**Save**' button.
My preference by far is to use a keyboard shortcut. On a Mac this would be
<kbd>cmd</kbd> + <kbd>S</kbd> (press the <kbd>cmd</kbd> key first and, while
keeping this depressed, click the <kbd>S</kbd> key); on Windows you would do the
same thing using <kbd>Ctrl</kbd> + <kbd>S</kbd>.
If the file already exists it will be saved without any further ado. As this is
a new file, RStudio needs to know what you want to call it and in which folder
on your computer you want it to be saved. You can choose the file name and
location using the file dialog that appears. RStudio will add a '.R' suffix if
you don't specify one.
It is a good idea to keep your scripts and assignment files for this course
together in one folder or directory.
## Running scripts
Having typed an R command and hit the return key you'll notice that the
command isn't actually run like it was in the console window. That's because
you're writing your R code in an editor. To run a single line of code within
your script you can press the '**Run**' button at the top of the script.
![](images/RStudio_run_script_command.png)
This will run the line of code on which the cursor is flashing or the next line
of code if the cursor is on a blank or empty line.
The keyboard shortcut is more convenient in practice as you won't have to stop
typing at the keyboard to use your mouse. This is <kbd>cmd</kbd> +
<kbd>return</kbd> on a Mac and <kbd>Ctrl</kbd> + <kbd>enter</kbd> on Windows.
Running a line in your script will automatically move the cursor onto the
next command which can be very convenient as you'll be able to run successive
commands just by repeatedly clicking '**Run**' or using the keyboard shortcut.
You can also run the entire script by clicking on the '**Source**' button, a
little to the right of the '**Run**' button. More useful though is to run
'**Source with Echo**' from the Source drop-down menu as this will also display
your commands and the outputs from these in the Console window.
## Adding comments to scripts
It is a very good idea to add comments to your code to explain what it's doing
or what you intended. This will help others to understand your code and more
than likely even yourself when you come to revisit it a few weeks or months
later.
Anything following a **`#`** symbol is a comment that R will ignore. Here's an
example of adding comments to our simple script.
![](images/RStudio_script_comments.png)
Comments usually appear at the beginning of lines but can appear at the end of
an R statement.
```{r}
days <- c(1, 2, 4, 6, 8, 12, 16) # didn't manage to get a measurement on day 10
```
It is also quite common when looking at R code to see lines of code commented
out, usually replaced by another line that does something similar or makes a
small change.
```{r}
# random_numbers <- rnorm(100, mean = 0, sd = 1)
random_numbers <- rnorm(100, mean = 0, sd = 0.5)
```
---
# Vectors
In Session 1 we introduced **vectors**, the simplest type of data structure in R.
An atomic vector is a collection of values or things of the same type in a given
order. We created some last session using **`c()`** and the **`:`** operator.
```{r}
some_numbers <- 1:10
days_of_the_week <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
```
In the first example, an integer vector was created with 10 values from 1 to 10.
`some_numbers` is a name that refers to the vector and can be used in other R
statements while `1:10` is the vector object itself.
```{block type = "rmdblock"}
**`:` operator**
The colon operator (**`:`**) returns every integer between two integers. These
can be in ascending or descending order and can include negative numbers.
`countdown <- 10:1`
`including_some_negative_numbers <- -4:3`
```
A single value is known as a **scalar**. An example from last session was the
number of samples in our experiment.
```{r}
samples <- 8
samples
```
But as we saw last session, R doesn't treat this single value any differently;
it is still a vector, just one that has a length of 1.
```{r}
length(days_of_the_week)
length(samples)
```
Almost every object in R is a vector of one kind or another, or is constructed
of vectors, so it's really important to understand these well.
```{block type = "rmdblock"}
**`length()`**
The `length()` function returns the number of elements in a vector.
`length(8:15)`
```
## Combining vectors
The other way we've encountered for creating a vector is to use **`c()`**. This
is actually a function and we can get help for it just as we can for any other
function.
```{r eval = FALSE}
?c
```
From the help page you can see that `c` stands for 'combine' (or perhaps
'concatenate' as both terms are used in the documentation).
One of the most useful things about the help pages for functions are the
examples that are given -- you'll need to scroll down to the bottom of the help
page to see them. These can be really helpful in demonstrating how a function
works. You can very easily cut and paste these examples and run these in the
console window as a way of experimenting with and learning about the function.
Let's have a look at the first example from the help page for `c()`. It's
slighly more complicated than what we did last session.
```{r}
c(1, 7:9)
```
This is actually combining two vectors, the first with a single value `1`
and the second with values `7`, `8` and `9`. Here's another example:
```{r}
cats <- c("felix", "kitty", "tigger")
dogs <- c("spot", "snoopy")
cats_and_dogs <- c(cats, dogs)
cats_and_dogs
```
```{block type = "rmdblock"}
**`c()`**
The `c()` function is a generic function that combines its arguments, i.e. the
things you pass to the function by including these within the parentheses, `()`.
You can pass as many vectors as you like to `c()` and it will concatenate these
into a single vector.
Arguments will be coerced to a common type.
`c(1:5, 10.5, "next")`
```
## Coercion
Atomic vectors must contain values that are all of the same type. A bit later
on, we'll introduce another type of data structure that doesn't have this
restriction -- the list. First though, a look at back at one of the exercises
from last session's assignment in which we tried to create vectors of things
that are of different types.
```{r}
integer_logical <- c(1:5, c(TRUE, FALSE, TRUE))
integer_logical
typeof(integer_logical)
```
Combining an integer vector, `1:5`, containing the numbers 1 through 5, with a
logical vector results in an integer vector. The logical values have been
*'coerced'* into integers. But why the logical values and not the integers to
produce a logical vector?
If you think about it, it makes more sense to convert logical values where
`TRUE` and `FALSE` are usually represented in computers by the bits `1` and `0`
respectively. `TRUE` and `FALSE` have natural and understandable equivalents in
the world of integers. Which logical value would you give to the number 5 for
example?
Similarly, integers get converted to doubles in this example:
```{r}
integer_double <- c(3.4, 7, 2.3, 6:-3)
integer_double
typeof(integer_double)
```
Again, this makes more sense than converting doubles (numbers with a decimal
point) to integers (whole numbers) and losing some of their precision.
Finally, it is really not obvious how to convert most character strings into
either logical or numeric values, so when combining vectors that contain
characters everything else gets 'coerced' into becoming characters.
```{r}
we_all_want_to_be_characters <- c(FALSE, 1:5, 23L, 3.23, 5.2e4, 1e-6, "matt")
we_all_want_to_be_characters
typeof(we_all_want_to_be_characters)
```
## Extracting subsets
One of the operations we do frequently on our data is to select subsets that
are of particular interest. For example, we may be interested in the top 50
genes in a differential expression analysis for our RNA-seq experiment where
those genes of interest are the ones with a log fold change above a certain
value and with a *p*-value below 0.01.
Having a good understanding of how to select a subset of values from a vector
is going to be invaluable when we come to do the same for more complicated
data structures so let's take a look.
The main subsetting operator we'll use is the square bracket, **`[]`**. Here's an
example.
```{r}
log2fc <- c(2.3, -1, 0.48, 0.97, -0.02, 1.23)
log2fc[3]
```
We have a vector of six log~2~ fold change values and we've chosen to select
the third value.
If you're familiar with other programming languages you will notice that the
indexing scheme in R starts from 1, not 0. So the first element in the vector
is referred to using the index 1, i.e. `log2fc[1]`.
Multiple values can be extracted by providing a vector of indices, e.g.
```{r}
log2fc[c(2, 4, 5)]
```
You can also extract elements in a different order, e.g.
```{r}
log2fc[c(2, 5, 4)]
```
It is possible to exclude values instead by providing negative indices, e.g. to
exclude the second element:
```{r}
log2fc[-2]
```
Or to exclude multiple elements:
```{r}
log2fc[-c(2, 3)]
```
Finally, we can also subset our vector using a vector of logical values.
```{r}
log2fc[c(TRUE, FALSE, TRUE, TRUE, FALSE, FALSE)]
```
```{block type = "rmdblock"}
**Parentheses `()` and brackets `[]`**
Remember to use **`()`** for **functions** and **`[]`** for **subsetting**.
`my_vector <- c(1, 7:9)`
`my_vector[2]`
```
### Conditional subsetting
You may be thinking that the last example in which we extracted a subset using
logical values seems very abstract and why on earth it could possibly be useful.
But actually, it is probably the most frequently used way of selecting values of
interest. To understand why, we'll need to introduce the concept of logical
operators.
Let's say we're interested in just the log~2~ fold changes that are above a
threshold of 0.5. We can test each of the values using the **`>`** logical
operator.
```{r}
log2fc > 0.5
```
This results in a logical vector containing `TRUE` and `FALSE` values for each
element. The values at positions 1, 4 and 6 in our vector are above the
threshold so result in `TRUE`, the others result in `FALSE`.
We can use this resulting vector to subset our original `log2fc` vector.
```{r}
above_threshold <- log2fc > 0.5
log2fc[above_threshold]
```
In practice, we wouldn't really create a variable containing our logical vector
signifying whether values are of interest. Instead we'd do this in a single
step.
```{r}
log2fc[log2fc > 0.5]
```
However, in a real R script, we might not want to hard-wire the threshold of 0.5
but instead let the user specify this each time the script is run, e.g. by
passing in the value as a command-line argument. If we have a variable storing
the desired threshold value, e.g. `log2fc_threshold`, then we would write the
above as follows.
```{r}
log2fc_threshold <- 0.5
log2fc_above_threshold <- log2fc[log2fc > log2fc_threshold]
log2fc_above_threshold
```
We also captured the result in another vector called `log2fc_above_threshold`
although we could have overwritten our original log2fc if we wanted to by
assigning the result back to log2fc.
```{r}
log2fc <- log2fc[log2fc > log2fc_threshold]
```
You can combine two or more conditions using **`&`** if you want both conditions
to be true or using **`|`** if either of the conditions holds.
```{r}
# reset our log2fc vector to how it was originally
log2fc <- c(2.3, -1, 0.48, 0.97, -0.02, 1.23)
# find small fold changes
log2fc[log2fc < 0.5 & log2fc > -0.5]
```
```{r}
# find large fold changes
log2fc[log2fc > 1 | log2fc < -1]
```
`&` and `|` are the R versions of the AND and OR operations in Boolean algebra
but applied to vectors.
### Logical operators
The following table lists the logical operators you can use in R.
| Operator | Description |
| -------- | ------------------------ |
| < | less than |
| <= | less than or equal to |
| > | greater than |
| >= | greater than or equal to |
| == | exactly equal to |
| != | not equal to |
| !x | NOT x |
| x \| y | x OR y |
| x & y | x AND y |
`x` and `y` in the last 3 of these operators are intended for logical values; if
you apply them to other types, those other types will be coerced to logicals in
exactly the same way we saw earlier.
## Modifying subsets
All subsetting operations can be combined with assignment. So we can modify or
overwrite the values at specified position in our vector.
```{r}
some_numbers <- 1:10
some_numbers[c(2, 4, 8)] <- c(150, 34, -10)
some_numbers
```
And, as before, we could use a condition to decide which values to change. For
example, you may decide that log~2~ fold changes above 1 are somewhat unreliable
with your detection method and so you'd like to put a cap on any values above
this limit.
```{r}
log2fc[log2fc > 1] <- 1
log2fc
```
## Vector arithmetic
Many operations in R are **vectorized**, which means that the operation is
applied to an entire set of values at once. We've already seen lots of examples
of this, like the following:
```{r}
some_numbers <- 1:10
square_numbers <- some_numbers ^ 2
square_numbers
```
Here we used the **`^`** exponent operator to raise our numbers to the power of
2. This happened in a single operation, i.e. just one line of code. In other
computer languages we might have had to write what is known as a loop in order
to iterate over and perform the calculation for each value in turn.
Another way of writing the above statement to get the same result would be to
multiply our numbers by themselves.
```{r}
square_numbers <- some_numbers * some_numbers
square_numbers
```
What actually happened here was slightly different though. Two vectors
(actually the same vector in this case) of the same length were multiplied
together. R did this **_element-by-element_**, which means that the first
element from the first vector was multiplied by the first element of the second
vector. Likewise, the second elements from each vector were multiplied by each
other and so on.
Here's another example that we will also show diagrammatically.
```{r}
a <- 1:6
b <- c(3.2, 0.4, 1.6, 0.5, 1.3, 0.1)
a * b
```
![](images/vector_arithmetic.png)
Usually vector arithmetic involves two vectors of the same length or involves
a vector and a scalar (a vector of length 1). One of the first examples from
last week was the second scenario involving a vector and a single value.
```{r}
1:36 * 2
```
### Vector recycling (advanced)
It is possible to perform calculations using two vectors of different sizes.
When R runs out of values to use from the shorter of the two vectors, it wraps
around to the beginning of that shorter one. For example, we can set every other
value in our set of numbers to be negative as follows:
```{r}
some_numbers * c(1, -1)
```
When R gets to the third element it has exhausted the shorter vector,
`c(1, -1)`, so it goes back to the beginning, i.e. back to the first value,
`1`. It uses the second vector five times in what is referred to as **_vector
recycling_**.
You will probably never have to do something like this (why would you?) but
without you knowing it you will carry out vector arithmetic using recycling
very frequently. This is because a very common operation is to carry out a
calculation on a vector using a single scalar value. For example, multiplying
all values by a constant.
```{r}
heights_in_metres <- c(1.86, 1.65, 1.72, 1.4, 1.79)
heights_in_centimetres <- 100 * heights_in_metres
heights_in_centimetres
```
The multiplier of 100 is effectively being recycled and so is equivalent
to us having written the following:
```{r}
heights_in_centimetres <- c(100, 100, 100, 100, 100) * heights_in_metres
```
R will give a warning if we carry out vector arithmetic on two vectors where the
length of one of those vectors is not an exact multiple of the length of the
other.
```{r}
1:7 * c(1, -1)
```
---
# Combining data of different types
Vectors are all very well and good but your data are almost certainly more
complicated than an ordered set of values all of the same type. You've probably
been working with Excel spreadsheets that contain some columns that are
numerical while others contain names or character strings, e.g. the following
table of Star Wars characters.
```{r echo = FALSE, message = FALSE}
DT::datatable(dplyr::select(dplyr::starwars, name, height, mass, gender, species, homeworld), rownames = FALSE)
```
Note that the first column contains character type data, the second and third
columns contain numerical data (of type double) and the remaining columns
contain the special type we briefly touched on last week, factors, that look
like character types but have a limited set of values or categories.
## Lists
R's simplest structure that combines data of different types is a **list**.
A list is a collection of vectors. It is also a vector itself but is a step up
in complexity from the atomic vectors we've been looking at up until now.
Vectors in a list can be of different types and different lengths.
```{r}
my_first_list <- list(1:10, c("a", "b", "c"), c(TRUE, FALSE), 100, c(1.3, 2.2, 0.75, 3.8))
my_first_list
```
`my_first_list` has five elements and when printed out like this looks quite
strange at first sight. Note how each of the elements of a list is referred to
by an index within 2 sets of square brackets. This gives a clue to how you can
access individual elements in the list.
```{r}
my_first_list[[2]]
```
The line of code in which we created this list is a little difficult to read and
might be better written split across several lines.
```{r}
my_first_list <- list(
1:10,
c("a", "b", "c"),
c(TRUE, FALSE), 100,
c(1.3, 2.2, 0.75, 3.8)
)
```
The editor in RStudio will indent code to help this look clearer. The R
interpretor is fully able to cope with code split across multiple lines; it will
assume this is what you're doing if it doesn't think the current line of code is
complete.
Elements in lists are normally named, e.g.
```{r}
genomics_instruments <- list(
sequencers = c("NovaSeq 6000", "HiSeq 4000", "NextSeq 500", "MiSeq"),
liquid_handling_robots = c("Mosquito HV", "Bravo")
)
genomics_instruments
```
We can still access the elements using the double square brackets but now we
can use either the index (position) or the name.
```{r}
genomics_instruments[[1]]
genomics_instruments[["sequencers"]]
```
Even more conveniently we can use the **`$`** operator.
```{r}
genomics_instruments$sequencers
```
You can see what the names of elements in your list are using the **`names()`**
function.
```{r}
names(genomics_instruments)
```
### Modifying lists
You can modify lists either by adding addition elements or modifying existing
ones.
```{r}
genomics_instruments$dna_rna_quality_control <- c("Bioanalyzer 2100", "Tapestation 4200")
genomics_instruments
```
```{r}
genomics_instruments$sequencers[3] <- "NextSeq 550 (upgraded)"
genomics_instruments
```
## Statistical test results
Lists can be thought of as a ragbag collection of things without a very clear
structure. You probably won't find yourself creating list objects of the kind
we've seen above when analysing your own data. However, the list provides the
basic underlying structure to the data frame that we'll be using throughout the
rest of this course.
The other area where you'll come across lists is as the return value for many
of the statistical tests and procedures such as linear regression that you can
carry out in R.
To demonstrate, we'll run a t-test comparing two sets of samples drawn from
subtly different normal distributions. We've already come across the `rnorm()`
function for creating random numbers based on a normal distribution.
```{r}
sample1 <- rnorm(n = 50, mean = 1.0, sd = 0.1)
sample2 <- rnorm(n = 50, mean = 1.1, sd = 0.1)
t.test(sample1, sample2)
```
The output from running the `t.test()` function doesn't much look like a list.
That's because it is a special type of list with some additional behaviours
including knowing how to print itself in a human-friendly way. But we can check
it is a list and use some of the list operations we've just looked at.
```{r}
result <- t.test(sample1, sample2)
is.list(result)
```
```{r}
names(result)
```
```{r}
result$p.value
```
## Data frames
A much more useful data structure and the one we will mostly be using for the
rest of the course is the **data frame**. This is actually a special type of
list in which all the elements are vectors of the same length. The data frame is
how R represents tabular data like the Star Wars table.
There are a number of example data frames lurking in the background just waiting
for you to call on them. Many of the examples for functions given in the help
pages make use of these. Two such data frames that are often used in example
code snippets are **`iris`** and **`mtcars`**. See, for example, the help page
for the `unique()` function in which `iris` appears in the last code example
without any explanation of what the mysterious `iris` is and potentially causing
some confusion to the uninitiated.
To bring one of these internal data sets to the fore, you can just start using it
by name.
```{r eval = FALSE}
iris
```
```{r echo = FALSE}
head(iris)
```
Here we've only displayed the first few rows. If you type `iris` into the
console pane you'll notice that it prints the entire table with row numbers
that indicate that the data frame contains measurements for 150 irises.
You can also get help for a data set such as `iris` in the usual way.
```{r eval = FALSE}
?iris
```
This reveals that `iris` is a rather famous old data set of measurements taken
by the esteemed British statistician and geneticist, Ronald Fisher (he of
Fisher's exact test fame).
### Creating a data frame
A data frame can be created in a similar way to how we created a list. The only
restriction is that each of the vectors should be named and all must have the
same length.
```{r}
beatles <- data.frame(
name = c("John", "Paul", "Ringo", "George"),
birth_year = c(1940, 1942, 1940, 1943),
instrument = c("guitar", "bass", "drums", "guitar")
)
beatles
```
### Extracting values from a data frame
A data frame is a special type of list so you can access its elements in the
same way as we saw previously for lists.
```{r}
names(iris)
iris$Petal.Width # or equivalently iris[["Petal.Width"]] or iris[[4]]
```
```{block type = "rmdblock"}
**`\$` operator**
Use `\$` to extract an element from a list or a column from a data frame by name.
`iris$Species`
```
In that last example we extracted the `Petal.Width` column which itself is a
vector. We can further subset the values in that column to, say, return the
first 10 values only.
```{r}
iris$Petal.Length[1:10]
```
We can also select a subset of columns as follows:
```{r eval = FALSE}
iris[c("Petal.Width", "Petal.Length", "Species")] # or equivalently iris[c(4, 3, 5)]
```
```{r echo = FALSE}
head(iris[c("Petal.Width", "Petal.Length", "Species")])
```
Data frames have rows and columns both of which have names that can be used to
extract subsets of our tabular data. You can get those names using
**`rownames()`** and **`colnames()`**.
```{r}
colnames(iris) # this is essentially the same as names()
rownames(iris)
```
In this case the row names are just numbers but did you notice that these
row numbers are all displayed in quotation marks? They are in fact character
strings.
```{r}
typeof(rownames(iris))
```
If we take a look at the `mtcars` data frame we can see that the row names are
models of cars.
```{r}
rownames(mtcars)
```
We could look up the row for a particular car using the square bracket notation
but in a slightly different and odd-looking way.
```{r}
mtcars["Ferrari Dino", ]
```
The **`,`** is somehow telling R to subset based on rows, not columns. If you
omit the comma, R will think you're referring to columns and will complain
because it can't find a column named "Ferrari Dino" (give it go and see for
yourself).
Similarly we can extract multiple rows by providing a vector of car names:
```{r}
mtcars[c("Ferrari Dino", "Maserati Bora"), ]
```
This way of accessing the data frame makes more sense when we look at how
we can access subsets of rows and columns at the same time, for example
selecting the first three rows and the first five columns.
```{r}
mtcars[1:3, 1:5] # equivalent to mtcars[c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710"), c("mpg", "cyl", "disp", "hp", "drat")]
```
We can extract just a single element in our table.
```{r}
mtcars[4, 3]
```
We can also use conditional subsetting to extract the rows that meet certain
conditions, e.g. all the cars with automatic transmission (those with `am` value
of 0).
```{r}
mtcars[mtcars$am == 0, ]
```
Here we have used the equality operator, **`==`**, which is not to be mistaken
for the assignment operator, **`=`**, used to specify arguments to functions.
`mtcars$am ==0` returns a logical vector with `TRUE` values for each car that
has automatic transmission (`am` equal to 0). We then use this to subset rows
(note the comma after the logical condition).
Other useful functions for data frames are **`dim()`**, **`nrow()`** and
**`ncol()`** that let you know about the dimensions of your table.
```{r}
dim(mtcars)
nrow(mtcars)
ncol(mtcars)
```
```{block type = "rmdblock"}
**Subsetting data frames**
Get the first element in the first column.
`iris[1, 1]`
Get the first element from the fifth column.
`iris[1, 5]`
Get the fourth column as a vector.
`iris[, 4]`
Get the fourth column as a data frame.
`iris[4]`
Get the first 10 elements from the fourth column.
`iris[1:10, 4]`
Get the third row as a data frame.
`iris[3, ]`
Get the first 6 rows (equivalent to `head(iris)`).
`iris[1:6, ]`
Get a column by name as a vector.
`iris$Petal.Length`
Get several columns by name as a data frame.
`iris[c("Petal.Length", "Petal.Width", "Species")]`
Get specific rows and columns.
`mtcars[c("Ferrari Dino", "Maserati Bora"), c("mpg", "cyl", "hp")]`
```
Some functions work just as well (or even better) with data frames as they do
with vectors. Remember the **`summary()`** function from last week? Let's give
that a go on the `iris` data frame.
```{r}
summary(iris)
```
Wow, that's amazing! One simple command to compute all those useful summary
statistics for our entire data set.
The summary for numerical columns contains the minimum and maximum values, the
median and mean, and the interquartile range. The `Species` column contains
categorical data (stored as a special `factor` type in R) and `summary()` shows
how many observations there are for each type of iris.
### Modifying data frames
We can use the subsetting operations for assigning values in order to modify or
update a data frame in a very similar way to what we saw earlier for vectors.
We can change a single value, such as the number of cylinders of the Ferrari
Dino.
```{r}
mtcars["Ferrari Dino", "cyl"] <- 8
mtcars["Ferrari Dino", ]
```
We can change multiple values, for example:
```{r}
mtcars[c(1, 4, 5), "gear"] <- c(6, 5, 5)
head(mtcars)
```
We could set these multiple values to a single value.
```{r}
mtcars[c(1, 4, 5), "gear"] <- 6
head(mtcars)
```
We can also create new columns, just like we did to create new elements in a
list, although with the additional constraint that the new column must have the
same length as all the other columns.
In the following, we add a column for kilometres per litre by mutliplying the
miles per gallon column (`mpg`) by the appropriate scaling factor.
```{r}
mtcars$kpl <- mtcars$mpg * 0.425144
mtcars[1:6, c("cyl", "mpg", "kpl")]
```
### Viewing data frames
One last aside before moving on to the more user-friendly tidyverse version of
the data frame, the tibble. Earlier we truncated the data frame when printing it
out because it was really a bit too long to digest in one go. Although we hid
this from view, we used the **`head()`** function.
```{r}
head(iris)
```
You can specify how many rows to return from the 'head' (top) of the data frame
-- have a look at the help page to see how. Also, the help page lets you know
about the equivalent function, **`tail()`**, for returning the last few rows.
Another way of inspecting the contents of a data frame in RStudio is to bring up
a spreadsheet-style data viewer using the `View()` function.