-
Notifications
You must be signed in to change notification settings - Fork 1
/
03-moremanipulation.Rmd
312 lines (237 loc) · 12.6 KB
/
03-moremanipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
# More Data Manipulation in R
```{r package_setup2, echo=FALSE, message=FALSE, warning=FALSE}
library(tidyverse)
library(knitr)
```
This chapter extends some of the data cleaning and data wrangling skills from Chapter 2.
It uses survey data from the [RStudio Learning R Survey](https://github.com/rstudio/learning-r-survey). You can access this data, which includes all transformations and cleanings from Chapter 2, [here]("data/rsurvey.RData").
You can load the data as follows:
```{r}
load("data/rsurvey.RData")
```
## Wide and Long Data
Data typically comes in two formats: wide and long.
A **wide** dataset has each individual's data in separate columns. It is typically easy for humans to read. Here is a simple example:
```{r echo=FALSE}
tribble(
~country, ~avgtemp.1994, ~avgtemp.1995, ~avgtemp.1996,
"Sweden", 6, 5, 7,
"Denmark", 6, 8, 8,
"Norway", 3, 11, 7
) -> wide_example
wide_example %>% kable()
```
A **long** dataset, each variable has its own column, and each observation has its own row. In the wide dataset above, *country*, *year*, and *temperature* are three variables. In a long dataset, it would look like this:
```{r echo=FALSE}
tribble(
~country, ~year, ~avgtemp,
"Sweden", 1994, 6,
"Denmark", 1994, 6,
"Norway", 1994, 3,
"Sweden", 1995, 5,
"Denmark", 1995, 8,
"Norway", 1995, 11,
"Sweden", 1996, 7,
"Denmark", 1996, 8,
"Norway", 1996, 7
) -> long_example
long_example %>% kable()
```
Long data is often referred to as tabular data and is more machine-readable than human-readable. It is also very similar to [Tidy data][Introduction to the Tidyverse].
## Pivoting data from wide to long - pivot_longer()
We can easily transform data from wide to long with the `pivot_longer()` function from the `Tidyverse`. Here is a simple example:
```{r}
library(tidyverse)
wide_example %>%
pivot_longer(avgtemp.1994:avgtemp.1996, names_to = "year", values_to = "avg.temp")
```
> **What does the code mean?**
>
> * `wide_example %>%` - use the wide example data and
> * `pivot_longer(` - make it long
> * `avgtemp.1994:avgtemp.1996,` - the columns we want to change from wide to long
> * `names_to = "year",` - take the column names from the wide data and put them into a column called "year"
> * `values_to = "avg.temp")` - take the values from the wide data and put them into a column called "avg.temp"
### Pivoting survey data
[Recall that we had a check-all question][Splitting variables with split()] in our survey that was a little bit messy. It is a good example of wide data:
```{r}
rsurvey %>%
select(use_1:use_11) %>% #select only the check-all questions
head(n = 8)
```
You will notice each column has different data, and there are many `NA`s. This means some checked 1 or 2 items and therefore have 10 or 11 `NA`s while others checked more items and have less `NA`s. This is a perfect example of data that we can pivot from wide to long and better organize it.
```{r}
rsurvey_use_long <- rsurvey %>%
rownames_to_column("id") %>%
select(id, use_1:use_11) %>%
pivot_longer(use_1:use_11, names_to = "use_number", values_to="use") %>%
mutate(use = ifelse(use == "", NA, use),
use = str_trim(use))
rsurvey_use_long %>% head(n=10)
```
> **What does the code mean?**
>
> * `rsurvey_use_long <-` - For this example, we will pivot the data into an object called "rsurvey_use_long"
> * `rsurvey %>%` - use the rsurvey data and
> * `rownames_to_column("id") %>%` - In the wide dataset, each row is an individual, but we do not have a column of individual ids. This function >will make a column of row numbers called "id". The first row will be >"1", the second row will be "2", etc. Now, we have ids for each individual in the dataset. We will use this later/
> * `select(id, use_1:use_11) %>%` - keep only the id column and the columns to pivot. We are removing them for this example, but you wouldn't do this if you wanted to keep the data in the "rsurvey" data object.
> * `pivot_longer(` - change (pivot) it from wide to long
> * `use_1:use_11,` - select the columns to pivot
> * `names_to = "use_number"`, - take the multiple column names and put them in a single column called "use_number"
> * `values_to="use") %>%` - take the values and put them in a column called "use"
> * `mutate(use = ifelse(use == "", NA, use))` - make any empty text `NA` (some will be `NA` and others will just be "", which is treated as a character space)
> * `drop_na(use)` - drop any `NA` values in the "use" column
We could use this data to visualize and analyze, or we could include it back into the same format as the dataset by making it wide, with each column a dichotomous variable for each use, with the values `1` for "checked" and `0` for "not checked". We will do that in the next section.
## Pivoting data from long to wide - pivot_wider()
We can easily transform data from long to wide with the `pivot_wider()` function from the `Tidyverse`. Here is a simple example:
```{r}
library(tidyverse)
long_example %>%
pivot_wider(names_from = year, values_from=avgtemp)
```
> **What does the code mean?**
>
> * `long_example %>%` - use the wide example data and
> * `pivot_wide(` - make it long
> * `names_from = year,` - choose the column from which our names should come from
> * `values_from = avgtemp)` - choose the values for each new row
Before we transform the data, we need to note that there is a problem. If we count the unique values in the dataset, we will find there are 158 different options whereas, [according to the survey](https://github.com/rstudio/learning-r-survey/blob/master/2019/Learning%20R%20Survey%20-%20Google%20Forms.pdf), there should be 11 + an "other" category.
```{r}
rsurvey_use_long %>%
distinct(use) %>% # find only unique values
count() # count unique values
```
This will cause a big problem if we pivot it. It will cause 158 additional columns to be created. The easiest way to deal with this issue is to use *only* the original categories and then create an other category. Here is what we can do:
```{r}
rsurvey_use_long <- rsurvey_use_long %>%
mutate(use_new = case_when(
str_detect(use, "Statistical analysis") ~ "Statistical analysis",
str_detect(use, "Data transformation") ~ "Data transformation",
str_detect(use, "Modeling") ~ "Modeling",
str_detect(use, "Visualization") ~ "Visualization",
str_detect(use, "Machine learning") ~ "Machine learning",
str_detect(use, "Text processing") ~ "Text processing",
str_detect(use, "Genomics") ~ "Genomics",
str_detect(use, "Medicine") ~ "Medicine",
str_detect(use, "Survey analysis") ~ "Survey analysis",
str_detect(use, "Clinical trials") ~ "Clinical trials",
str_detect(use, "Financial analysis") ~ "Financial analysis",
is.na(use) ~ "NA",
TRUE ~ "Other"
),
use_new = ifelse(is.na(use), NA, use_new),
use_other = ifelse(use_new == "Other", use, NA)
)
rsurvey_use_long %>%
distinct(use_new) %>% # find only unqiue values
count()
```
```{r}
rsurvey_use_wide <- rsurvey_use_long %>%
group_by(id) %>%
count(use_new) %>%
ungroup %>%
pivot_wider(names_from = use_new, values_from = n,
values_fill = list(n=0)) %>%
janitor::clean_names()
```
We can check if this worked if the number of observations in rsurvey is the same as our wide data:
```{r}
# check if the are the same
# == means "equal to"
count(rsurvey) == count(rsurvey_use_wide)
```
> **What does the code mean?**
> * `rsurvey_use_wide <-` - create a new data object (for demonstration / testing purposes) called "rsurvey_use_wide"
> * `rsurvey_use_long %>%` - use the data from "rsurvey_use_long" and
> * `group_by(id) %>%` - group it by each individual's id and
> * `add_count(use_new) %>%` - count each use per individual - this will create a column of 1's, indicating they checked the item (recall that we deleted unchecked items)
> * `ungroup %>%` - ungroup the data
> * `pivot_wider(` - pivot the data wider
> * `names_from = use, ` - take the names of the new rows from the "use" column
> * `values_from = n,` - take the values (all 1s) from the "n" column
> * `values_fill = list(n=0)) %>%` - create 0s for any columns that do not have a 1
> + `values_fill` allows you to fill empty values
> + `list(n=0)` - states to make empty values from n into zeroes
> * `janitor::clean_names()` - use the janitor package to make the names lowercase and underlined for ease of typing later
Now that we have demonstrated how to fix the check-all questions, we can simplify the steps by *combining* `pivot_longer()` and `pivot_wider()` in the same script:
```{r}
rsurvey_check_all <- rsurvey %>%
rownames_to_column("id") %>%
select(id, use_1:use_11) %>%
pivot_longer(use_1:use_11, names_to = "use_number", values_to="use") %>%
mutate(use = ifelse(use == "", NA, use),
use = str_trim(use),
use_new = case_when(
str_detect(use, "Statistical analysis") ~ "Statistical analysis",
str_detect(use, "Data transformation") ~ "Data transformation",
str_detect(use, "Modeling") ~ "Modeling",
str_detect(use, "Visualization") ~ "Visualization",
str_detect(use, "Machine learning") ~ "Machine learning",
str_detect(use, "Text processing") ~ "Text processing",
str_detect(use, "Genomics") ~ "Genomics",
str_detect(use, "Medicine") ~ "Medicine",
str_detect(use, "Survey analysis") ~ "Survey analysis",
str_detect(use, "Clinical trials") ~ "Clinical trials",
str_detect(use, "Financial analysis") ~ "Financial analysis",
is.na(use) ~ "NA",
TRUE ~ "Other"
),
use_new = ifelse(is.na(use), NA, use_new),
use_other = ifelse(use_new == "Other", use, NA)
) %>%
group_by(id) %>%
count(use_new) %>%
ungroup %>%
pivot_wider(names_from = use_new, values_from = n,
values_fill = list(n=0)) %>%
janitor::clean_names()
head(rsurvey_check_all, n=10) %>%
kable()
```
## Joining multiple data objects
This section will explain the different types of joins you can use in R to connect data objects together. Joins are kinds of merges where you add columns and/or rows from one data frame to another data frame by matching them on some variable or sets of variables.
There are many different types of joins, including left joins, right joins, inner joins, semi joins, full joins, and anti joins. See [https://dplyr.tidyverse.org/reference/join.html](https://dplyr.tidyverse.org/reference/join.html) for more information.
### Left Join
Left joins typically connect rows in a second data frame to matching rows in the first data frame.
Here is a simple example.
**Main data frame**
```{r echo=FALSE, message=FALSE, warning=FALSE}
band_members
```
**Second data frame**
```{r}
band_instruments
```
We can connect the second data frame to the main (first) data frame by joining on the column they have in common, "name". This will create a single, merged data frame. Any unmatched data in the second data frame will *not* be in the merged data frame.
```{r}
band_members %>% left_join(band_instruments)
```
### Left Joining our Survey Data
We have created a data frame called "rsurvey_check_all," which contains all our check-all questions and whether someone checked it (indicated by 1) or did not check it (indicated by 0). We can combined this with our original "rsurvey" data like this:
```{r message=FALSE, warning=FALSE}
rsurvey_joined <- rsurvey %>%
rownames_to_column("id") %>%
left_join(rsurvey_check_all)
```
> **What does the code mean?**
> * `rsurvey_joined <- ` - save to a new data frame
> + Note: this new data frame is created for demonstration purposes so you can compare the original to the joined data object. In reality, we would simply overwrite our existing data object like this: `rsurvey <-`
> * `rsurvey %>% ` - use the "rsurvey" data and...
> * `rownames_to_column("id") %>%` - make a column called "id" that is simply the number of each row. This creates the column we will join by ("rsurvey_check_all" already has this)
> * `left_join(rsurvey_check_all)` - join "rsurvey" with "rsurvey_check_all".
> + Note: because both data frames have an "id" column, `R` automatically joins them by this column. You can also specify:
> + `left_join(rsurvey_check_all, by = "id)` or
> + `left_join(df_name, by = c("col1" = "col2"))`if the columns have different names
To complete the rsurvey cleaning process, let's also join the "other" column from the check-all data:
```{r message=FALSE, warning=FALSE, include=FALSE}
rsurvey <- rsurvey %>%
rownames_to_column("id") %>%
left_join(rsurvey_check_all)
```
```{r}
rsurvey<- rsurvey %>%
left_join(rsurvey_use_long %>%
select(id, use_other) %>%
drop_na(use_other))
```