generated from kstreet13/assignment
-
Notifications
You must be signed in to change notification settings - Fork 0
/
submit.qmd
131 lines (106 loc) · 2.73 KB
/
submit.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
---
title: "Lab 6"
author: "Karisa Ke"
format: html
embed-resources: true
editor: visual
---
```{r}
library(readr)
library(dplyr)
mt_samples <- read_csv("https://raw.githubusercontent.com/USCbiostats/data-science-data/master/00_mtsamples/mtsamples.csv")
mt_samples <- mt_samples %>%
select(description, medical_specialty, transcription)
head(mt_samples)
```
## Question 1
```{r}
mt_samples %>%
count(medical_specialty, sort = TRUE)
```
Looks like these medical specialties are not related to each other, with surgery has the most count.
## Question 2
```{r}
library(tidytext)
mt_samples %>%
unnest_tokens(word, transcription)
```
```{r}
mt_samples %>%
unnest_tokens(word, transcription) %>%
count(word, sort = TRUE)
```
```{r}
library(dplyr)
library(ggplot2)
library(forcats)
mt_samples %>%
unnest_tokens(word, transcription) %>%
count(word, sort = TRUE) %>%
top_n(20, n) %>%
ggplot(aes(x = n, y = fct_reorder(word, n))) +
geom_col() +
labs(title = "Top 20 Most Frequent Words", x = "Frequency", y = "Word")
```
The top 20 most frequent words don't make any sense, they are very common words in English and we are not able to extract useful information about the original text.
## Question 3
```{r}
library(stopwords)
library(stringr)
mt_samples %>%
unnest_tokens(word, transcription) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "^\\d+$")) %>%
count(word, sort = TRUE) %>%
top_n(20, n) %>%
ggplot(aes(x = n, y = reorder(word, n))) +
geom_col() +
labs(title = "Top 20 Most Frequent Words (Stopwords and Numbers Removed)", x = "Frequency", y = "Word")
```
After removing stop words, we have a better idea about what the text is about, it's clearly related to medicals.
## Question 4
```{r}
mt_samples %>%
unnest_ngrams(ngram, transcription, n = 2) %>%
count(ngram, sort = TRUE)
```
```{r}
mt_samples %>%
unnest_ngrams(ngram, transcription, n = 3) %>%
count(ngram, sort = TRUE)
```
The bi-grams and tri-grams don't really make any differences.
## Question 5
```{r}
chosen_word <- "the patient"
mt_samples %>%
unnest_tokens(word, transcription) %>%
filter(word == chosen_word) %>%
select(-description) %>%
group_by(medical_specialty) %>%
arrange(word) %>%
summarise(
words_before = lag(word),
chosen_word = word,
words_after = lead(word)
)
```
## Question 6
```{r}
mt_samples %>%
unnest_tokens(word, transcription) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "^\\d+$")) %>%
group_by(medical_specialty) %>%
count(word, sort = TRUE) %>%
top_n(1, n)
```
```{r}
mt_samples %>%
unnest_tokens(word, transcription) %>%
anti_join(stop_words) %>%
filter(!str_detect(word, "^\\d+$")) %>%
group_by(medical_specialty) %>%
count(word, sort = TRUE) %>%
top_n(5, n)
```