generated from kstreet13/assignment
-
Notifications
You must be signed in to change notification settings - Fork 0
/
submit.qmd
157 lines (118 loc) · 4.07 KB
/
submit.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
---
title: "Lab 6- Text Mining"
author: "Jayson De La O"
format:
html:
embed-resources: true
---
```{r}
library(tidytext)
library(readr)
library(dplyr)
library(data.table)
library(ggplot2)
library(magrittr)
library(tidyverse)
```
```{r}
mt_samples <- read_csv("https://raw.githubusercontent.com/USCbiostats/data-science-data/master/00_mtsamples/mtsamples.csv")
mt_samples <- mt_samples %>%
select(description, medical_specialty, transcription)
head(mt_samples)
```
QUESTION 1
40 different categories of specialties. They are not related and not overlapping or evenly distributed .
```{r}
mt_samples %>%
count(medical_specialty, sort = TRUE)
```
QUESTION 2
We see the most used words in the transcription column and some words are used a lot more than others. The words are what we would expect because they are mostly common words in everyday language and considered stop words. It makes sense because these are the words that are used to make sentences. Although, most of these words are common, we do get some insights from the words that are not stop words. The words "patient" is used about 22 thousand times, which could hint at something healthcare related or time related.
```{r}
mt_samples %>%
unnest_tokens(token, transcription) %>%
count(token, sort = TRUE)%>%
top_n(20, n)
mt_samples %>%
unnest_tokens(token, transcription) %>%
count(token, sort = TRUE)%>%
top_n(20, n) %>%
ggplot(aes(n, token)) +
geom_col()
```
QUESTION 3
Removing the stop words gives us a better idea of what the text was about because now words that are commonly used should show some context of the writings. The commonly used words are mostly healthcare related words like patient, procedure,pain,mg, and blood.
```{r}
mt_samples %>%
unnest_tokens(token, transcription) %>%
filter(!str_detect(token, "[0-9]")) %>%
anti_join(stop_words, by = c("token" = "word")) %>%
count(token, sort = TRUE)%>%
top_n(20, n)
mt_samples %>%
unnest_tokens(token, transcription) %>%
filter(!str_detect(token, "[0-9]")) %>%
anti_join(stop_words, by = c("token" = "word")) %>%
count(token, sort = TRUE)%>%
top_n(20, n)%>%
ggplot(aes(n, token)) +
geom_col()
```
QUESTION 4
Tri-grams works much better because the stop words seem to be reintroduced if only looking at bi-grams. Using tri-grams we get more context for what the column is about. Tri-grams show three words that occur after each other and bi-grams is two words that are after each other.
```{r}
mt_samples %>%
unnest_ngrams(ngram, transcription, n = 2) %>%
anti_join(stop_words, by = c("ngram" = "word")) %>%
count(ngram, sort = TRUE)%>%
top_n(20, n)
mt_samples %>%
unnest_ngrams(ngram, transcription, n = 3) %>%
anti_join(stop_words, by = c("ngram" = "word")) %>%
count(ngram, sort = TRUE)%>%
top_n(20, n)
mt_samples %>%
unnest_ngrams(ngram, transcription, n = 2) %>%
anti_join(stop_words, by = c("ngram" = "word")) %>%
count(ngram, sort = TRUE)%>%
top_n(20, n) %>%
ggplot(aes(n, ngram)) +
geom_col()
mt_samples %>%
unnest_ngrams(ngram, transcription, n = 3) %>%
anti_join(stop_words, by = c("ngram" = "word")) %>%
count(ngram, sort = TRUE)%>%
top_n(20, n) %>%
ggplot(aes(n, ngram)) +
geom_col()
```
QUESTION 5
I picked the word "patient".The code below count the number of times word 1 and word 3 come before and after the word patient.
```{r}
mt_samples %>%
unnest_ngrams(ngram, transcription, n = 3) %>%
separate(ngram, into = c("word1", "word2","word3"), sep = " ") %>%
select(word1, word2,word3) %>%
filter(word2 == "patient") %>%
count(word1,word3, sort = TRUE)
```
QUESTION 6
Top 5 words used in each specialty.
```{r}
mt_samples %>%
group_by(medical_specialty) %>%
unnest_tokens(token, transcription) %>%
filter(!str_detect(token, "[0-9]")) %>%
anti_join(stop_words, by = c("token" = "word")) %>%
count(token, sort = TRUE) %>%
top_n(5,n)
```
QUESTION 7
See if "patient" is used more in some specialties than others
```{r}
mt_samples %>%
group_by(medical_specialty) %>%
unnest_tokens(token, transcription) %>%
filter(token == "patient")%>%
count(token, sort = TRUE)
```