-
Notifications
You must be signed in to change notification settings - Fork 78
/
05_DataStructures.Rmd
2596 lines (1839 loc) · 78 KB
/
05_DataStructures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo=FALSE}
library(knitr)
hook_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
lines <- options$output.lines
if (is.null(lines)) {
return(hook_output(x, options)) # pass to default hook
}
x <- unlist(strsplit(x, "\n"))
more <- "etc ..."
if (length(lines) == 1) { # first n lines
if (length(x) > lines) { # truncate the output, but add ....
x <- c(head(x, lines), more)
}
} else {
x <- c(more, x[lines], more)
}
# paste these lines together
x <- paste(c(x, ""), collapse = "\n")
hook_output(x, options)
})
```
```{r, include=FALSE}
suppressPackageStartupMessages({
library(tidyverse)
})
```
# Data Structures {#DataStructures}
Introduction {-#intro-DataStructures}
------------
You can get pretty far in R just using vectors. That’s what
Chapter \@ref(SomeBasics) is all about. This chapter moves beyond vectors to
recipes for matrices, lists, factors, data frames, and tibbles (which are a special kind of data frames). If you have
preconceptions about data structures, we suggest you put them aside. R
does data structures differently than many other languages.
If you want to study the technical aspects of R’s data structures, we
suggest reading [*R in a
Nutshell*](http://oreilly.com/catalog/9780596801717) (O’Reilly) and the R
Language Definition. The notes here are more informal. These are things we
wish we’d known when we started using R.
### Vectors {-}
Here are some key properties of vectors:
Vectors are homogeneous
: All elements of a vector must have the same type or, in R
terminology, the same *mode*.
Vectors can be indexed by position
: So `v[2]` refers to the second element of `v`.
Vectors can be indexed by multiple positions, returning a subvector
: So `v[c(2,3)]` is a subvector of `v` that consists of the second and
third elements.
Vector elements can have names
: Vectors have a `names` property, the same length as the vector
itself, that gives names to the elements:
``` {r}
v <- c(10, 20, 30)
names(v) <- c("Moe", "Larry", "Curly")
print(v)
```
If vector elements have names, then you can select them by name
: Continuing the previous example:
``` {r}
v[["Larry"]]
```
### Lists {-}
Lists are heterogeneous
: Lists can contain elements of different types; in R terminology,
list elements may have different modes. Lists can even contain other
structured objects, such as lists and data frames; this allows you
to create recursive data structures.
Lists can be indexed by position
: So `lst[[2]]` refers to the second element of `lst`. Note the double
square brackets. Double brackets means that R will return the element as whatever type of element it is.
Lists let you extract sublists
: So `lst[c(2,3)]` is a sublist of `lst` that consists of the second
and third elements. Note the single square brackets. Single brackets means that R will return the items in a list. If you pull a single element with single brackets, like `lst[2]`, R will return a list of length 1 with the first item being the desired item.
List elements can have names
: Both `lst[["Moe"]]` and `lst$Moe` refer to the element named “Moe.”
Since lists are heterogeneous and since their elements can be retrieved
by name, a list is like a dictionary or hash or lookup table in other
programming languages (discussed in Recipe \@ref(recipe-id158), ["Building a Name/Value Association List"](#recipe-id158).
What’s surprising (and cool) is that in R, unlike most of those other
programming languages, lists can also be indexed by position.
### Mode: Physical Type {-}
In R, every object has a mode, which indicates how it is stored in
memory: as a number, as a character string, as a list of pointers to
other objects, as a function, and so forth (see Table \@ref(tab:phystype)).
-------------------------------------------------------------------------
Object Example Mode
----------------------------- ------------------------------- -----------
Number `3.1415` numeric
Vector of numbers `c(2.7.182, numeric
3.1415)`
Character string `"Moe"` character
Vector of character strings `c("Moe", "Larry", character
"Curly")`
Factor `factor(c("NY", "CA", numeric
"IL"))`
List `list("Moe", "Larry", list
"Curly")`
Data frame `data.frame(x=1:3, y=c("NY", list
"CA", "IL"))`
Function `print` function
-------------------------------------------------------------------------
: (\#tab:phystype) R object-mode mapping
The `mode` function gives us this information:
``` {r}
mode(3.1415) # Mode of a number
mode(c(2.7182, 3.1415)) # Mode of a vector of numbers
mode("Moe") # Mode of a character string
mode(list("Moe", "Larry", "Curly")) # Mode of a list
```
A critical difference between a vector and a list can be summed up this
way:
- In a vector, all elements must have the same mode.
- In a list, the elements can have different modes.
### Class: Abstract Type {-}
In R, every object also has a class, which defines its abstract type.
The terminology is borrowed from object-oriented programming. A single
number could represent many different things: a distance, a point in
time, or a weight, for example. All those objects have a mode of “numeric” because they
are stored as a number, but they could have different classes to
indicate their interpretation.
For example, a `Date` object consists of a single number:
``` {r}
d <- as.Date("2010-03-15")
mode(d)
length(d)
```
But it has a class of `Date`, telling us how to interpret that number—namely, as the number of days since January 1, 1970:
``` {r}
class(d)
```
R uses an object’s class to decide how to process the object. For
example, the generic function `print` has specialized versions (called
*methods*) for printing objects according to their class: `data.frame`,
`Date`, `lm`, and so forth. When you print an object, R calls the
appropriate `print` function according to the object’s class.
### Scalars {-}
The quirky thing about scalars is their relationship to vectors. In some
software, scalars and vectors are two different things. In R, they are
the same thing: a scalar is simply a vector that contains exactly one
element. In this book we often use the term *scalar*, but that’s just
shorthand for “vector with one element.”
Consider the built-in constant `pi`. It is a scalar:
```{r}
pi
```
Since a scalar is a one-element vector, you can use vector functions on
`pi`:
```{r}
length(pi)
```
You can index it. The first (and only) element is $\pi$ of course:
```{r}
pi[1]
```
If you ask for the second element, there is none:
``` {r}
pi[2]
```
### Matrices {-}
In R, a matrix is just a vector that has dimensions. It may seem strange
at first, but you can transform a vector into a matrix simply by giving
it dimensions.
A vector has an attribute called `dim`, which is initially `NULL`, as
shown here:
``` {r}
A <- 1:6
dim(A)
print(A)
```
We give dimensions to the vector when we set its `dim` attribute. Watch
what happens when we set our vector dimensions to 2 × 3 and print it:
``` {r}
dim(A) <- c(2, 3)
print(A)
```
Voilà! The vector was reshaped into a 2 × 3 matrix.
A matrix can be created from a list, too. Like a vector, a list has a
`dim` attribute, which is initially `NULL`:
``` {r}
B <- list(1, 2, 3, 4, 5, 6)
dim(B)
```
If we set the `dim` attribute, it gives the list a shape:
``` {r}
dim(B) <- c(2, 3)
print(B)
```
Voilà! We have turned this list into a 2 × 3 matrix.
### Arrays {-}
The discussion of matrices can be generalized to three-dimensional or even
*n*-dimensional structures: just assign more dimensions to the
underlying vector (or list). The following example creates a
three-dimensional array with dimensions 2 × 3 × 2:
``` {r}
D <- 1:12
dim(D) <- c(2, 3, 2)
print(D)
```
Note that R prints one “slice” of the structure at a time, since it’s
not possible to print a three-dimensional structure on a two-dimensional
medium.
It strikes us as very odd that we can turn a list into a matrix just by
giving the list a `dim` attribute. But wait: it gets stranger.
Recall that a list can be heterogeneous (mixed modes). We can start with
a heterogeneous list, give it dimensions, and thus create a
heterogeneous matrix. This code snippet creates a matrix that is a mix
of numeric and character data:
``` {r}
C <- list(1, 2, 3, "X", "Y", "Z")
dim(C) <- c(2, 3)
print(C)
```
To us, this is strange because we ordinarily assume a matrix is purely
numeric, not mixed. R is not that restrictive.
The possibility of a heterogeneous matrix may seem powerful and
strangely fascinating. However, it creates problems when you are doing
normal, day-to-day stuff with matrices. For example, what happens when
the matrix `C` (from the previous example) is used in matrix multiplication? What happens if
it is converted to a data frame? The answer is that odd things happen.
In this book, we generally ignore the pathological case of a
heterogeneous matrix. We assume you’ve got simple, vanilla matrices. Some
recipes involving matrices may work oddly (or not at all) if your matrix
contains mixed data. Converting such a matrix to a vector or data frame,
for instance, can be problematic
(see Recipe \@ref(recipe-id074), ["Converting One Structured Data Type into Another"](#recipe-id074).)
### Factors {-}
A factor looks like a character vector, but it has special properties. R keeps
track of the unique values in a vector, and each unique value is called
a *level* of the associated factor. R uses a compact representation for
factors, which makes them efficient for storage in data frames. In other
programming languages, a factor would be represented by a vector of
enumerated values.
There are two key uses for factors:
Categorical variables
: A factor can represent a categorical variable. Categorical variables
are used in contingency tables, linear regression, analysis of
variance (ANOVA), logistic regression, and many other areas.
Grouping
: This is a technique for labeling or tagging your data items
according to their group. See Chapter \@ref(DataTransformations).
### Data Frames {-}
A data frame is a powerful and flexible structure. Most serious R
applications involve data frames. A data frame is intended to mimic a
dataset, such as one you might encounter in SAS or SPSS, or a table in an SQL database.
A data frame is a tabular (rectangular) data structure, which means that
it has rows and columns. It is not implemented by a matrix, however.
Rather, a data frame is a list with the following characteristics:
- The elements of the list are vectors and/or factors.[^dataframe]
- Those vectors and factors are the columns of the data frame.
- The vectors and factors must all have the same length; in other
words, all columns must have the same height.
- The equal-height columns give a rectangular shape to the data frame.
- The columns must have names.
Because a data frame is both a list and a rectangular structure, R
provides two different paradigms for accessing its contents:
- You can use list operators to extract columns from a data frame,
such as `df[i]`, `df[[i]]`, or `df$name`.
- You can use matrix-like notation, such as `df[i,j]`, `df[i,]`,
or `df[,j]`.
Your perception of a data frame likely depends on your background:
To a statistician
: A data frame is a table of observations. Each row contains
one observation. Each observation must contain the same variables.
These variables are called columns, and you can refer to them
by name. You can also refer to the contents by row number and column
number, just as with a matrix.
To a SQL programmer
: A data frame is a table. The table resides entirely in memory, but
you can save it to a flat file and restore it later. You needn’t
declare the column types because R figures that out for you.
To an Excel user
: A data frame is like a worksheet, or perhaps a range within
a worksheet. It is more restrictive, however, in that each column
has a type.
To an SAS user
: A data frame is like a SAS dataset for which all the data resides
in memory. R can read and write the data frame to disk, but the data
frame must be in memory while R is processing it.
To an R programmer
: A data frame is a hybrid data structure, part matrix and part list.
A column can contain numbers, character strings, or factors, but not
a mix of them. You can index the data frame just like you index
a matrix. The data frame is also a list, where the list elements are
the columns, so you can access columns by using list operators.
To a computer scientist
: A data frame is a rectangular data structure. The columns are
typed, and each column must be numeric values, character
strings, or a factor. Columns must have labels; rows may
have labels. The table can be indexed by position, column name,
and/or row name. It can also be accessed by list operators, in which
case R treats the data frame as a list whose elements are the
columns of the data frame.
To a corporate executive
: You can put names and numbers into a data frame.
A data frame is like a little database.
Your staff will enjoy using data frames.
### Tibbles {-}
A *tibble* is a modern reimagining of the data frame, introduced by Hadley Wickham in the `tibble` package, which is a core package in the tidyverse.
Most of the common functions you would use with data frames also work with tibbles.
However, tibbles typically do less than data frames and complain more.
This idea of complaining and doing less may remind you of your least favorite coworker; however, we think tibbles will be one of your favorite data structures.
Doing less and complaining more can be a feature, not a bug.
Unlike data frames, tibbles:
* Do not give you row numbers by default.
* Do not give you strange, unexpected column names.
* Don't coerce your data into factors (unless you explicitly ask for that).
* Recycle vectors of length 1 but not other lengths.
In addition to basic data frame functionality, tibbles:
* Print only the top four rows and a bit of metadata by default.
* Always return a tibble when subsetting.
* Never do partial matching: if you want a column from a tibble, you have to ask for it using its full name.
* Complain more by giving you more warnings and chatty messages to make sure you understand what the software is doing.
All these extras are designed to give you fewer surprises and help you make fewer mistakes.
Appending Data to a Vector {#recipe-id048}
--------------------------
### Problem {-#problem-id048}
You want to append additional data items to a vector.
### Solution {-#solution-id048}
Use the vector constructor (`c`) to construct a vector with the
additional data items:
``` {r}
v <- c(1, 2, 3)
newItems <- c(6, 7, 8)
c(v, newItems)
```
For a single item, you can also assign the new item to the next vector element.
R will automatically extend the vector:
``` {r}
v <- c(1, 2, 3)
v[length(v) + 1] <- 42
v
```
### Discussion {-#discussion-id048}
If you ask us about appending a data item to a vector, we will likely suggest
that maybe you shouldn’t.
> **Warning**
>
> R works best when you think about entire vectors, not single data
> items. Are you repeatedly appending items to a vector? If so, then you
> are probably working inside a loop. That’s OK for small vectors, but
> for large vectors your program will run slowly. The memory management
> in R works poorly when you repeatedly extend a vector by one element.
> Try to replace that loop with vector-level operations. You’ll write
> less code, and R will run much faster.
Nonetheless, one does occasionally need to append data to vectors. Our
experiments show that the most efficient way of doing so is to create a new vector
using the vector constructor (`c`) to join the old and new data. This
works for appending single elements or multiple elements:
``` {r}
v <- c(1, 2, 3)
v <- c(v, 4) # Append a single value to v
v
w <- c(5, 6, 7, 8)
v <- c(v, w) # Append an entire vector to v
v
```
You can also append an item by assigning it to the position past the end
of the vector, as shown in the Solution. In fact, R is very liberal
about extending vectors. You can assign to any element and R will expand
the vector to accommodate your request:
``` {r}
v <- c(1, 2, 3) # Create a vector of three elements
v[10] <- 10 # Assign to the 10th element
v # R extends the vector automatically
```
Note that R did not complain about the out-of-bounds subscript. It just
extended the vector to the needed length, filling it with `NA`.
R includes an `append` function that creates a new vector by appending
items to an existing vector. However, our experiments show that this
function runs more slowly than both the vector constructor and the
element assignment.
Inserting Data into a Vector {#recipe-id049}
----------------------------
### Problem {-#problem-id049}
You want to insert one or more data items into a vector.
### Solution {-#solution-id049}
Despite its name, the `append` function inserts data into a vector by
using the `after` parameter, which gives the insertion point for the new
item or items:
`append(`*vec*`, `*newvalues*`, after = `*n*`)`
### Discussion {-#discussion-id049}
The new items will be inserted at the position given by `after`. This
example inserts 99 into the middle of a sequence:
``` {r}
append(1:10, 99, after = 5)
```
The special value of `after=0` means insert the new items at the *head* of
the vector:
``` {r}
append(1:10, 99, after = 0)
```
The comments in Recipe \@ref(recipe-id048), ["Appending Data to a Vector"](#recipe-id048) apply here,
too. If you are inserting single items into a vector, you might be
working at the element level when working at the vector level would be
easier to code and faster to run.
Understanding the Recycling Rule {#recipe-id050}
--------------------------------
### Problem {-#problem-id050}
You want to understand the mysterious Recycling Rule that governs how R
handles vectors of unequal length.
### Discussion {-#discussion-id050}
When you do vector arithmetic, R performs element-by-element operations.
That works well when both vectors have the same length: R pairs the
elements of the vectors and applies the operation to those pairs.
But what happens when the vectors have unequal lengths?
In that case, R invokes the Recycling Rule. It processes the vector
element in pairs, starting at the first elements of both vectors. At a
certain point, the shorter vector is exhausted while the longer vector
still has unprocessed elements. R returns to the beginning of the
shorter vector, “recycling” its elements; continues taking elements from
the longer vector; and completes the operation. It will recycle the
shorter-vector elements as often as necessary until the operation is
complete.
It’s useful to visualize the Recycling Rule. Here is a diagram of two
vectors, 1:6 and 1:3:
```{}
1:6 1:3
----- -----
1 1
2 2
3 3
4
5
6
```
Obviously, the 1:6 vector is longer than the 1:3 vector. If we try to
add the vectors using (1:6) + (1:3), it appears that 1:3 has too few
elements. However, R recycles the elements of 1:3, pairing the two
vectors like this and producing a six-element vector:
```{}
1:6 1:3 (1:6) + (1:3)
----- ----- ---------------
1 1 2
2 2 4
3 3 6
4 5
5 7
6 9
```
Here is what you see in the R console:
``` {r}
(1:6) + (1:3)
```
It’s not only vector operations that invoke the Recycling Rule;
functions can, too. The `cbind` function can create column vectors, such
as the following column vectors of 1:6 and 1:3. The two column have
different heights, of course:
```r}
cbind(1:6)
cbind(1:3)
```
If we try binding these column vectors together into a two-column
matrix, the lengths are mismatched. The 1:3 vector is too short, so
`cbind` invokes the Recycling Rule and recycles the elements of 1:3:
``` {r}
cbind(1:6, 1:3)
```
If the longer vector’s length is not a multiple of the shorter vector’s
length, R gives a warning. That’s good, since the operation is highly
suspect and there is likely a bug in your logic:
``` {r, warning=TRUE}
(1:6) + (1:5) # Oops! 1:5 is one element too short
```
Once you understand the Recycling Rule, you will realize that operations
between a vector and a scalar are simply applications of that rule. In
this example, the 10 is recycled repeatedly until the vector addition is
complete:
``` {r}
(1:6) + 10
```
Creating a Factor (Categorical Variable) {#recipe-id051}
----------------------------------------
### Problem {-#problem-id051}
You have a vector of character strings or integers. You want R to treat
them as a factor, which is R’s term for a categorical variable.
### Solution {-#solution-id051}
The `factor` function encodes your vector of discrete values into a factor:
```
f <- factor(v) # v can be a vector of strings or integers
```
If your vector contains only a subset of possible values and not the
entire universe, then include a second argument that gives the possible
levels of the factor:
```
f <- factor(v, levels)
```
### Discussion {-#discussion-id051}
In R, each possible value of a categorical variable is called a *level*.
A vector of levels is called a *factor*.
Factors fit very cleanly into the vector orientation of R,
and they are used in powerful ways for processing data and building statistical models.
Most of the time, converting your categorical data into a factor is a
simple matter of calling the `factor` function, which identifies the
distinct levels of the categorical data and packs them into a factor:
``` {r}
f <- factor(c("Win", "Win", "Lose", "Tie", "Win", "Lose"))
f
```
Notice that when we printed the factor, `f`, R did not put quotes around
the values. They are levels, not strings. Also notice that when we
printed the factor, R also displayed the distinct levels below the
factor.
If your vector contains only a subset of all the possible levels, then R
will have an incomplete picture of the possible levels. Suppose you have
a string-valued variable `wday` that gives the day of the week on which
your data was observed:
``` {r}
wday <- c("Wed", "Thu", "Mon", "Wed", "Thu",
"Thu", "Thu", "Tue", "Thu", "Tue")
f <- factor(wday)
f
```
R thinks that Monday, Thursday, Tuesday, and Wednesday are the only
possible levels. Friday is not listed. Apparently, the lab staff never
made observations on Friday, so R does not know that Friday is a
possible value. Hence, you need to list the possible levels of `wday`
explicitly:
``` {r}
f <- factor(wday, levels=c("Mon", "Tue", "Wed", "Thu", "Fri"))
f
```
Now R understands that `f` is a factor with five possible levels.
It knows their correct order, too.
It originally put Thursday before Tuesday because it assumes alphabetical order by default.
The explicit `levels` argument defines the correct order.
In many situations it is not necessary to call `factor` explicitly.
When an R function requires a factor, it usually converts your data to a
factor automatically.
The `table` function, for instance, works only on factors,
so it routinely converts its inputs to factors without asking.
You must explicitly create a factor variable when you want to specify
the full set of levels or when you want to control the ordering of levels.
### See Also {-#see_also-id051}
See Recipe \@ref(recipe-id137), ["Binning Your Data"](#recipe-id137), to create a factor from continuous data.
Combining Multiple Vectors into One Vector and a Factor {#recipe-id227}
-------------------------------------------------------
### Problem {-#problem-id227}
You have several groups of data, with one vector for each group. You
want to combine the vectors into one large vector and simultaneously
create a parallel factor that identifies each value’s original group.
### Solution {-#solution-id227}
Create a list that contains the vectors. Use the `stack` function to
combine the list into a two-column data frame:
```
comb <- stack(list(v1 = v1, v2 = v2, v3 = v3)) # Combine 3 vectors
```
The data frame’s columns are called `values` and `ind`. The first column
contains the data, and the second column contains the parallel factor.
### Discussion {-#discussion-id227}
Why in the world would you want to mash all your data into one big
vector and a parallel factor? The reason is that many important
statistical functions require the data in that format.
Suppose you survey freshmen, sophomores, and juniors regarding their
confidence level (“What percentage of the time do you feel confident in
school?”). Now you have three vectors, called `freshmen`, `sophomores`,
and `juniors`. You want to perform an ANOVA of the differences
between the groups. The ANOVA function, `aov`, requires one vector with
the survey results as well as a parallel factor that identifies the
group. You can combine the groups using the `stack` function:
``` {r}
freshmen <- c(1, 2, 1, 1, 5)
sophomores <- c(3, 2, 3, 3, 5)
juniors <- c(5, 3, 4, 3, 3)
comb <- stack(list(fresh = freshmen, soph = sophomores, jrs = juniors))
print(comb)
```
Now you can perform the ANOVA on the two columns:
```
aov(values ~ ind, data = comb)
```
When building the list we must provide tags for the list elements.
(The tags are `fresh`, `soph`, and `jrs` in this example.)
Those tags are required because `stack` uses them as the levels of the parallel factor.
Creating a List {#recipe-id053}
-------------------------------
### Problem {-#problem-id053}
You want to create and populate a list.
### Solution {-#solution-id053}
To create a list from individual data items, use the `list` function:
```
lst <- list(x, y, z)
```
### Discussion {-#discussion-id053}
Lists can be quite simple, such as this list of three numbers:
``` {r}
lst <- list(0.5, 0.841, 0.977)
lst
```
When R prints the list, it identifies each list element by its position
(`[[1]]`, `[[2]]`, `[[3]]`) and prints the element’s value (e.g.,
`[1] 0.5`) under its position.
More usefully, lists can, unlike vectors, contain elements of different
modes (types). Here is an extreme example of a mongrel created from a
scalar, a character string, a vector, and a function:
``` {r}
lst <- list(3.14, "Moe", c(1, 1, 2, 3), mean)
lst
```
You can also build a list by creating an empty list and populating it.
Here is our “mongrel” example built in that way:
``` {r}
lst <- list()
lst[[1]] <- 3.14
lst[[2]] <- "Moe"
lst[[3]] <- c(1, 1, 2, 3)
lst[[4]] <- mean
lst
```
List elements can be named. The `list` function lets you supply a name
for every element:
``` {r}
lst <- list(mid = 0.5, right = 0.841, far.right = 0.977)
lst
```
### See Also {-#see_also-id053}
See the [Introduction](#intro-DataStructures) to this chapter for more
about lists; see Recipe \@ref(recipe-id158), ["Building a Name/Value Association List"](#recipe-id158),
for more about building and using lists with named elements.
Selecting List Elements by Position {#recipe-id054}
---------------------------------------------------
### Problem {-#problem-id054}
You want to access list elements by position.
### Solution {-#solution-id054}
Use one of these ways. Here, `lst` is a list variable:
`lst[[`_n_`]]`
: Select the *n*th element from the list.
`lst[c(_n_~1~, _n_~2~, ..., _n_~k~)]`
: Returns a list of elements, selected by their positions.
Note that the first form returns a single element and the second form returns
a list.
### Discussion {-#discussion-id054}
Suppose we have a list of four integers, called `years`:
``` {r}
years <- list(1960, 1964, 1976, 1994)
years
```
We can access single elements using the double-square-bracket syntax:
``` {r}
years[[1]]
```
We can extract sublists using the single-square-bracket syntax:
``` {r}
years[c(1, 2)]
```
This syntax can be confusing because of a subtlety:
there is an important difference between `lst[[`*n*`]]` and `lst[`*n*`]`.
They are not the same thing:
`lst[[`_n_`]]`
: This is an element, not a list. It is the *n*th element of `lst`.
`lst[`_n_`]`
: This is a list, not an element. The list contains one element, taken
from the *n*th element of `lst`.
(The second form is a special case of
`lst[c(_n_~1~, _n_~2~, ..., _n_~k~)]` in which we eliminated the `c(...)` construct
because there is only one *n*.)
The difference becomes apparent when we inspect the structure of the
result — one is a number and the other is a list:
``` {r}
class(years[[1]])
class(years[1])
```
The difference becomes annoyingly apparent when we `cat` the value.
Recall that `cat` can print atomic values or vectors but complains about
printing structured objects:
``` {r, error=TRUE}
cat(years[[1]], "\n")
cat(years[1], "\n")
```
We got lucky here because R alerted us to the problem. In other
contexts, you might work long and hard to figure out that you accessed a
sublist when you wanted an element, or vice versa.
Selecting List Elements by Name {#recipe-id253}
-------------------------------
### Problem {-#problem-id253}
You want to access list elements by their names.
### Solution {-#solution-id253}
Use one of these forms. Here, `lst` is a list variable:
`lst[["*name*"]]`
: Selects the element called **name**.
Returns `NULL` if no element has that name.
`lst$*name*`
: Same as previous, just different syntax.
`lst[c(*name*~1~, *name*~2~, ..., *name*~k~)]`
: Returns a list built from the indicated elements of `lst`.
Note that the first two forms return an element, whereas the third form
returns a list.
### Discussion {-#discussion-id253}
Each element of a list can have a name. If named, the element can be
selected by its name. This assignment creates a list of four named
integers:
``` {r}
years <- list(Kennedy = 1960, Johnson = 1964,
Carter = 1976, Clinton = 1994)
```
These next two expressions return the same value—namely, the element
that is named “Kennedy”:
``` {r}
years[["Kennedy"]]
years$Kennedy
```
The following two expressions return sublists extracted from `years`:
``` {r}
years[c("Kennedy", "Johnson")]
years["Carter"]
```
Just as with selecting list elements by position
(see Recipe \@ref(recipe-id054), ["Selecting List Elements by Position"](#recipe-id054)),
there is an important difference between `lst[["*name*"]]` and `lst["*name*"]`.
They are not the same:
`lst[["*name*"]]`
: This is an element, not a list.
`lst["*name*"]`