-
Notifications
You must be signed in to change notification settings - Fork 0
/
week03.qmd
282 lines (234 loc) · 14.1 KB
/
week03.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
---
title: 'Vectors, slicing, and map(ping)'
author: "Jeremy Van Cleve"
date: 10 09 2024
format:
html:
self-contained: true
---
# Outline for today
- Matrices and arrays
- Indexing and slicing
- Mapping and applying
# Matrices and arrays
Previously, you have seen vectors, which are just lists of objects of a single type. However, you often want a matrix of objects of a single (or multiple!) types or even a higher dimensions group of objects. The two dimensional version of a vector is called a **`matrix`** and the *n*-dimensional version is called an **`array`**.
You can think of a matrix as a vector,
```{r}
vec = 1:16
vec
```
except you've specified that there are a certain number of rows and columns:
```{r}
matx = matrix(vec, nrow = 4, ncol = 4)
matx
```
Notice how the vector "filled" each column of the matrix. This is because R is a "column-major" language (Fortran, MATLAB, and Julia are other column-major langauges). Some languages, such as C and Python, and have row-major order.
Since you often deal with matrices and matrix-like objects, R has two functions to give you the number of rows and columns. Also, the length of the matrix is just the rows times the columns.
```{r}
nrow(matx)
ncol(matx)
length(matx)
```
Likewise, you can convert the vector to a 4x4 array:
```{r}
arr = array(vec, dim = c(4,4))
arr
```
Two-dimensional arrays are exactly the same as matrices:
```{r}
str(matx)
str(arr)
```
In the above, the `dim` argument specifies the dimensionality of the array. Thus, you can convert the vector to a multidimensional array of 2x2x2x2 as well:
```{r}
arr = array(vec, dim = c(2,2,2,2))
arr
```
Note that this is a four-dimensional object, so you can't print it without having to "flatten" it in some way.
# Indexing and slicing
Indexing matrices works similarly to indexing vectors except that you give a list of the elements you want from each dimension. First, suppose that you roll a twenty-sided die 100 times (because you are a super nerd) and you collect the results in a 10x10 matrix:
```{r}
set.seed(100) # this gives us the same "random" matrix each time
rmatx = matrix(sample(1:20, 100, replace = TRUE), nrow = 10, ncol = 10)
rmatx
```
To obtain the element in the second row, eighth column,
```{r}
rmatx[2,8]
```
You can also get a *"slice"* of the matrix by using the colon operator:
```{r}
rmatx[2,1:5]
```
yields the first five elements of the second row. Note that within the entry for the column element, you actually used a **vector for the index**. Thus, you can give any list of indices in any order you choose. For example,
```{r}
rmatx[c(1,3,5,7,9),c(2,4,6,8,10)]
```
returns a slice of the matrix with only the odd rows and even columns. You can keep all the elements in a specific dimension by just leaving that spot blank. For example, to get the fourth row,
```{r}
rmatx[4,]
```
In general, slicing an array involves giving a list of indices for each dimension of the array.
**The magic of data wrangling and analysis with complex data comes in the many creative ways one can create these lists of indices and thus the slices that contain exactly the subset of the data that you want and often in an order that you specify.**
## Slicing lists
Recall that lists are like vectors but with potentially multiple types of objects. Thus, they are a little bit more complicated to slice. Take the following list,
```{r}
x = list(1:3, 4:6, 7:9)
x
```
which is a list of three vectors each three elements long. To get the first element of the list as if it were a vector, you would try
```{r}
x[1]
str(x[1])
```
but (using the `str` function) you can see that actually returned the first vector as a list of length one. Thus, the single brackets simply return another list that contains the elements requested. You can give the single brackets a vector of indices, so its natural that you should get a list back. This is actually no different for vectors; when you slice them, you get vectors back, and for R a single number is just a vector of length one (so everything is consistent).
To get the component of the list itself, you must use double brackets:
```{r}
str(x[[1]])
```
This will be useful too when we're dealing with `data.frame`s, which are basically just special lists. Another useful thing with lists (and other objects as we'll see next week) is that you can name the elements with strings (that satisfy R objects naming rules!). This allows you to access elements of the list using the name instead of the index. For example, if
```{r}
x = list(a = 1:3, b = 4:6, c = 7:9)
x
```
then we can access the first element using its name `a` and the `$` operator:
```{r}
x$a
```
This is actually equivalent to
```{r}
x[[1]]
x[["a"]]
x$"a"
x$`a`
```
Any valid R name doesn't need quotes or backticks for using the `$`, but you need them for fancy non-R friendly names:
```{r}
x = list(`1 fancy name`=1:3, b = 4:6, c = 7:9)
x$"1 fancy name"
x$`1 fancy name`
x[["1 fancy name"]]
```
## Other ways to slice
### Negative integers omit the specified positions:
```{r}
rmatx
rmatx[-c(1,3,5,7,9),c(2,4,6,8,10)]
```
gives the even rows and even columns.
### Logical vectors selects elements of the matrix where the index vector is TRUE.
**This is one of the most useful ways to index.**
If you want to get only the rolls of the die that were less than 8, then you create a matrix of logicals (`TRUE` or `FALSE`)
```{r}
rmatx < 8
```
and use it to index the matrix:
```{r}
rmatx[rmatx < 8]
```
Notice here that you get a vector back, not a matrix. This is because when you give a single index to a matrix, it treats the matrix like a vector with the first column first, then the second column, etc (i.e., column-major order). It also makes sense that you don't get a matrix back since the "< 10" condition could be met anywhere any number of times in the matrix (i.e., no guarantee that the result would be square like a matrix).
You could also just slice rows based on columns (or vice versa). This is the kind of slicing we'll often do on a data table since we will want all rows (say, results from different experiments) whose column (say, variable or factor in the experiment) matches a certain condition. For example, suppose we want to get all rows of the matrix `rmatx` with a fourth column whose roll is less than 10. We first slice the fourth column with `rmatx[,4]` and then compare it to 10,
```{r}
rmatx
rmatx[,4] < 8
```
Note that we get a vector of logical values indicating whether the element in that row of the fourth column is less than 10. To get only those rows of the matrix `rmatx`, we then use this vector to slice the matrix:
```{r}
rmatx[rmatx[,4] < 8,]
```
We can also then slice to get a subset of the columns. Say we only want the even columns
```{r}
rmatx[rmatx[,4] < 8, seq(2,10,2)]
```
The function `seq(start,end,increment)` is a handy generalization of the `start:end` colon operator that allows us to choose a value to increment or sequence of numbers by.
Let's recap this last way of slicing a matrix since its very common and has the same logic as our later slice operations on data tables. If we want all **rows** of the matrix `mat` where **column** `n` is equal to `x`, then our slice statement is `mat[ mat[,n]==x, ]`.
# Mapping or applying
Given that you can slice matrices now, you will at some point want to apply some function to each element of that slice or to each row or column of the matrix. This can be done with a `for` loop like we have already seen, but there are functions designed precisely for this task. Such functions are `apply` functions in R and `map` functions in Python, Julia, and Mathematica.
In R, there are actually many `apply` functions since there are vectors, lists, and other types of objects with multiple elements that one might to iterate over. To see them all, you type
```{r}
#| eval: false
??base::apply
```
which searches for all functions with "apply" in the description in the "base" package.
If you have a vector or list, the easiest apply function is `sapply`, which applies a function to each element of a vector or list and returns the output as a vector or matrix if possible. For example, you could sum each vector in the list you created above using the `sum` function.
```{r}
x
sapply(x, sum)
```
Note that `sapply` gave you back a vector but with each element named according to the names of the list elements (handy!). The `s` in `sapply` stands (I think...) for `simplify`. The function `lapply` works like `sapply` but returns a list (hence `l(ist)apply`),
```{r}
lapply(x, sum)
```
so `sapply` is like taking the output of `lapply` and converting it to a vector if possible.
If you want to use apply over a matrix, the `apply` function is required. To add all the elements in each row (dimension 1) or column (dimension 2) of the matrix `matx`, you can try
```{r}
rmatx
apply(rmatx, 1, sum)
apply(rmatx, 2, sum)
```
However, you can do more complicated things by making custom functions and "applying" them. For example, in order to get the number of die rolls less than 8 in each row of your die roll matrix, you could try
```{r}
rmatx
apply(rmatx, 1, \(x) sum(x < 8))
```
where we use the anonymous function syntax from last time that creates a function that gets a logical vector for each row, TRUE for < 8 and FALSE for >= 8, and sums that vector (TRUE = 1 and FALSE = 0). For the number of rolls less than 8 in each column, we do
```{r}
apply(rmatx, 2, \(x) sum(x < 8))
```
Finally, what if we wanted the number of rolls less than 8 in the whole matrix? This is actually **easier** than getting the answer for rows or columns. This is because if compare `rmat` to 8,
```{r}
rmatx < 8
```
we get a matrix of logicals. We can then just use `sum`, which sums over all the elements of the matrix (recall: a matrix is really just a vector where you line up the columns in a long list),
```{r}
sum(rmatx < 8)
```
which matches of course other ways of doing the same calculation, say first getting the number of elements less than 8 for each row and then summing,
```{r}
sum(apply(rmatx, 1, \(x) sum(x < 8)))
```
# Lab ![](assets/beaker.png)
### Data
Now let's slice some real **DATA**. We'll use data on respiratory infections including COVID-19 from the Centers for Disease Control (CDC). These data are split up between [hospitalizations](https://data.cdc.gov/Public-Health-Surveillance/Weekly-United-States-Hospitalization-Metrics-by-Ju/aemt-mg7g/about_data) and [deaths](https://data.cdc.gov/NCHS/Provisional-COVID-19-Death-Counts-by-Week-Ending-D/r8kw-7aab/about_data). The data are accessible from a CDC database online, and we load them in below and combine them (see the code for an example of some of the data wrangling necessary to do this).
```{r}
#| message: false
library(tidyverse)
library(RSocrata)
# Read in hospitalization and deaths
us_hosps = read.socrata("https://data.cdc.gov/api/odata/v4/aemt-mg7g") |> as_tibble()
us_deaths = read.socrata("https://data.cdc.gov/api/odata/v4/r8kw-7aab") |> as_tibble()
#us_hosps = read_csv("US_COVID19_Hosps_ByWeek_ByState_20240125.csv")
#us_deaths = read_csv("US_COVID19_Deaths_ByWeek_ByState_20240125.csv")
us_hosps_deaths =
us_deaths |>
rename(week_end_date = week_ending_date) |> # rename this column to match column in `us_hosps`
select(-c(`data_as_of`, `start_date`, `end_date`, group, year, month, mmwr_week, footnote)) |> # get rid of excess columns in deaths table
inner_join( # join the two tables together
us_hosps |>
rename(state_abbrv = jurisdiction) |> # `us_hosps` has states as abbreviations so we'll need to add full state names
inner_join(tibble(state_abbrv = state.abb, state = state.name) |>
add_row(state_abbrv = c("USA", "DC", "PR"), state = c("United States", "District of Columbia", "Puerto Rico")))) |>
filter(state != "United States")
```
You can get the first few rows with
```{r}
head(us_hosps_deaths)
```
The data are saved as a `tibble`, which is really a fancy `data.frame`, which is a special kind of list that we will discuss in more detail next week. For this lab, we will practice slicing the `data.frame` as if it were a matrix.
**Please include the code chunk above that loads in the data and creates the `us_hosps_deaths` data table as the first code chunk in the markdown file that you submit for this assignment.** That way, the table will be present when I run the code in your markdown file.
### Problems
Use `us_hosps_deaths` data table from the code above for each of the problems belove. You should be able to use a few lines of R code to obtain the answer directly (i.e., you shouldn't just scroll though the data table and give the answer by hand). Use the vector indexing operations discussed this week to do your slicing (even if you know another way using `dplyr` or other R packages).
This is a big table so use the `str(us_hosps_deaths)` function to get information on all the columns and `view(us_hosps_deaths)` will bring up the GUI viewer of the table. Also, make use of the links above in the "Data" section to the CDC website to see more info on what data each column contains.
1. Which **state** had the most **deaths** during one week in 2021? Which week was that?\
(hint: remember you can compare dates: e.g., "2021-01-02" > "2021-01-01" is `TRUE`, etc.)\
(hint: you'll need the boolean `&`, **single ampersand**, to combine multiple conditions when they involve vectors)\
(hint: your solution might use the `which.max` function)
2. How many COVID-19 deaths has Kentucky had during 2021?\
(hint: use the `sum` function)
3. How many COVID-19 deaths were there outside of California in the month of August 2021?
4. Which US state has the second fewest COVID-19 hospitalizations in 2021? Use the column `total_admissions_all_covid_confirmed`\
(hint: get a vector of all the (unique) states)\
(hint: use `sapply` on that vector with an anonymous function that sums the hospitalization column for the right state and dates)\
(hint: use `sort` to help you find the second fewest hospitalizations)
5. Which week in 2021 had the highest number of hospitalizations in the country (across all states and territories)? Use the column `total_admissions_all_covid_confirmed`\
(hint: use `sapply` and `sum` to get the totals for each date in a way similar to question 4. Then use the `order` function to give you the right order to sort the dates...)