-
Notifications
You must be signed in to change notification settings - Fork 0
/
06-lab.rmd
202 lines (147 loc) · 5.31 KB
/
06-lab.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
---
title: "Lab 06 - Text Mining"
author: Amei Hao
output: html_document
---
```{r setup}
knitr::opts_chunk$set(eval = FALSE, include = FALSE)
```
# Learning goals
- Use `unnest_tokens()` and `unnest_ngrams()` to extract tokens and ngrams from text.
- Use dplyr and ggplot2 to analyze text data
# Lab description
For this lab we will be working with a new dataset. The dataset contains transcription samples from https://www.mtsamples.com/. And is loaded and "fairly" cleaned at https://raw.githubusercontent.com/USCbiostats/data-science-data/master/00_mtsamples/mtsamples.csv.
This markdown document should be rendered using `github_document` document.
# Setup the Git project and the GitHub repository
1. Go to your documents (or wherever you are planning to store the data) in your computer, and create a folder for this project, for example, "PM566-labs"
2. In that folder, save [this template](https://raw.githubusercontent.com/USCbiostats/PM566/master/content/assignment/06-lab.Rmd) as "README.Rmd". This will be the markdown file where all the magic will happen.
3. Go to your GitHub account and create a new repository, hopefully of the same name that this folder has, i.e., "PM566-labs".
4. Initialize the Git project, add the "README.Rmd" file, and make your first commit.
5. Add the repo you just created on GitHub.com to the list of remotes, and push your commit to origin while setting the upstream.
### Setup packages
You should load in `dplyr`, (or `data.table` if you want to work that way), `ggplot2` and `tidytext`.
If you don't already have `tidytext` then you can install with
### read in Medical Transcriptions
Loading in reference transcription samples from https://www.mtsamples.com/
```{r, warning=FALSE, message=FALSE}
library(readr)
library(dplyr)
mt_samples <- read_csv("https://raw.githubusercontent.com/USCbiostats/data-science-data/master/00_mtsamples/mtsamples.csv")
mt_samples <- mt_samples %>%
select(description, medical_specialty, transcription)
head(mt_samples)
```
---
## Question 1: What specialties do we have?
We can use `count()` from `dplyr` to figure out how many different catagories do we have? Are these catagories related? overlapping? evenly distributed?
```{r}
mt_samples %>%
count(medical_specialty, sort = TRUE)
```
---
## Question 2
- Tokenize the the words in the `transcription` column
- Count the number of times each token appears
- Visualize the top 20 most frequent words
Explain what we see from this result. Does it makes sense? What insights (if any) do we get?
```{r}
library(forcats)
mt_samples%>%
unnest_tokens(output = token,input = transcription)%>%
count(token, sort = TRUE) %>%
top_n(n = 10, wt =n)%>%
ggplot(aes(x = n, y = fct_reorder(token,n)))+
geom_col()
```
```
---
## Question 3
- Redo visualization but remove stopwords before
- Bonus points if you remove numbers as well
What do we see know that we have removed stop words? Does it give us a better idea of what the text is about?
---
```{r}
mt_samples%>%
unnest_tokens(word,transcription)%>%
anti_join(tidytext::stop_words) %>%
count(word)%>%
top_n(n = 10, wt =n)%>%
ggplot(aes(x = n, y = fct_reorder(word,n)))+
geom_col()
```
# remove the numbers
```{r}
number_words = as.character(seq(0,100))
mt_samples%>%
unnest_tokens(word,transcription)%>%
filter(!(word%in% number_words))%>%
count(word,sort = TRUE)%>%
top_n(n = 5, wt =n)%>%
ggplot(aes(x = n, y = fct_reorder(word,n)))+
geom_col()
```
```
# Question 4
repeat question 2, but this time tokenize into bi-grams. how does the result change if you look at tri-grams?
```{r}
library(forcats)
mt_samples%>%
unnest_ngrams(output = token,input = transcription, n = 2)%>%
count(token, sort = TRUE) %>%
top_n(n = 5, wt =n)%>%
ggplot(aes(x = n, y = fct_reorder(token,n)))+
geom_col()
```
```{r}
library(forcats)
mt_samples%>%
unnest_ngrams(output = token,input = transcription, n = 3)%>%
count(token, sort = TRUE) %>%
top_n(n = 10, wt =n)%>%
ggplot(aes(x = n, y = fct_reorder(token,n)))+
geom_col()
```
---
# Question 5
Using the results you got from questions 4. Pick a word and count the words that appears after and before it.
```{r}
library(tidyr)
mt_bigrams = mt_samples%>%
unnest_ngrams(output = token,input = transcription, n = 2)%>%
separate(col = token, into = c("word1","word2"), sep = " ")%>%
select(word1,word2)
mt_bigrams%>%
filter(word2 =="blood")%>%
count(word1,sort = TRUE)
mt_bigrams%>%
filter(word1 =="blood")%>%
count(word2,sort = TRUE)
mt_bigrams %>%
count(word1, word2, sort = TRUE)
```
```{r}
mt_bigrams%>%
anti_join(
tidytext::stop_words %>% select(word), by = c("word1" = "word")
)%>%
anti_join(
tidytext::stop_words %>% select(word), by = c("word2" = "word")
) %>%
count(word1,word2, sort = TRUE)
```
---
# Question 6
Which words are most used in each of the specialties. you can use `group_by()` and `top_n()` from `dplyr` to have the calculations be done within each specialty. Remember to remove stopwords. How about the most 5 used words?
```{r}
mt_samples%>%
unnest_tokens(token, transcription)%>%
anti_join(tidytext:: stop_words, by = c("token" = "word"))%>%
group_by(medical_specialty)%>%
count(token)%>%
top_n(5,n)
```
# Question 7 - extra
Find your own insight in the data:
Ideas:
- Interesting ngrams
- See if certain words are used more in some specialties then others