06-lab.rmd

---
title: "Lab 06 - Text Mining"
author: Amei Hao
output: html_document
---

```{r setup}
knitr::opts_chunk$set(eval = FALSE, include  = FALSE)
```

# Learning goals

- Use `unnest_tokens()` and `unnest_ngrams()` to extract tokens and ngrams from text.
- Use dplyr and ggplot2 to analyze text data

# Lab description

For this lab we will be working with a new dataset. The dataset contains transcription samples from https://www.mtsamples.com/. And is loaded and "fairly" cleaned at https://raw.githubusercontent.com/USCbiostats/data-science-data/master/00_mtsamples/mtsamples.csv.

This markdown document should be rendered using `github_document` document.

# Setup the Git project and the GitHub repository

1. Go to your documents (or wherever you are planning to store the data) in your computer, and create a folder for this project, for example, "PM566-labs"

2. In that folder, save [this template](https://raw.githubusercontent.com/USCbiostats/PM566/master/content/assignment/06-lab.Rmd) as "README.Rmd". This will be the markdown file where all the magic will happen.

3. Go to your GitHub account and create a new repository, hopefully of the same name that this folder has, i.e., "PM566-labs".

4. Initialize the Git project, add the "README.Rmd" file, and make your first commit.

5. Add the repo you just created on GitHub.com to the list of remotes, and push your commit to origin while setting the upstream.

### Setup packages

You should load in `dplyr`, (or `data.table` if you want to work that way), `ggplot2` and `tidytext`.
If you don't already have `tidytext` then you can install with


### read in Medical Transcriptions

Loading in reference transcription samples from https://www.mtsamples.com/

```{r, warning=FALSE, message=FALSE}
library(readr)
library(dplyr)
mt_samples <- read_csv("https://raw.githubusercontent.com/USCbiostats/data-science-data/master/00_mtsamples/mtsamples.csv")
mt_samples <- mt_samples %>%
  select(description, medical_specialty, transcription)
head(mt_samples)
```

---

## Question 1: What specialties do we have?

We can use `count()` from `dplyr` to figure out how many different catagories do we have? Are these catagories related? overlapping? evenly distributed?

```{r}
mt_samples %>%
  count(medical_specialty, sort = TRUE)
```

---

## Question 2


- Tokenize the the words in the `transcription` column
- Count the number of times each token appears
- Visualize the top 20 most frequent words

Explain what we see from this result. Does it makes sense? What insights (if any) do we get?
```{r}
library(forcats)
mt_samples%>%
  unnest_tokens(output = token,input = transcription)%>%
  count(token, sort = TRUE) %>%
  top_n(n = 10, wt =n)%>%
  ggplot(aes(x = n, y = fct_reorder(token,n)))+
  geom_col()
```
```

---

## Question 3
- Redo visualization but remove stopwords before
- Bonus points if you remove numbers as well

What do we see know that we have removed stop words? Does it give us a better idea of what the text is about?

---

```{r}
mt_samples%>%
  unnest_tokens(word,transcription)%>%
  anti_join(tidytext::stop_words) %>%
  count(word)%>%
  top_n(n = 10, wt =n)%>%
  ggplot(aes(x = n, y = fct_reorder(word,n)))+
  geom_col()
```
# remove the numbers 
```{r}
number_words = as.character(seq(0,100))
mt_samples%>%
  unnest_tokens(word,transcription)%>%
  filter(!(word%in% number_words))%>%
  count(word,sort = TRUE)%>%
  top_n(n = 5, wt =n)%>%
  ggplot(aes(x = n, y = fct_reorder(word,n)))+
  geom_col()
```
```

# Question 4

repeat question 2, but this time tokenize into bi-grams. how does the result change if you look at tri-grams?
```{r}
library(forcats)
mt_samples%>%
  unnest_ngrams(output = token,input = transcription, n = 2)%>%
  count(token, sort = TRUE) %>%
  top_n(n = 5, wt =n)%>%
  ggplot(aes(x = n, y = fct_reorder(token,n)))+
  geom_col()
```
```{r}
library(forcats)
mt_samples%>%
  unnest_ngrams(output = token,input = transcription, n = 3)%>%
  count(token, sort = TRUE) %>%
  top_n(n = 10, wt =n)%>%
  ggplot(aes(x = n, y = fct_reorder(token,n)))+
  geom_col()
```


---

# Question 5

Using the results you got from questions 4. Pick a word and count the words that appears after and before it.
```{r}
library(tidyr)
mt_bigrams = mt_samples%>%
  unnest_ngrams(output = token,input = transcription, n = 2)%>%
  separate(col = token, into = c("word1","word2"), sep = " ")%>%
  select(word1,word2)
mt_bigrams%>%
  filter(word2 =="blood")%>%
  count(word1,sort = TRUE)
mt_bigrams%>%
  filter(word1 =="blood")%>%
  count(word2,sort = TRUE)
mt_bigrams %>%
  count(word1, word2, sort = TRUE)
```
```{r}
mt_bigrams%>%
  anti_join(
    tidytext::stop_words %>% select(word), by = c("word1" = "word")
  )%>%
  anti_join(
    tidytext::stop_words %>% select(word), by = c("word2" = "word")
  ) %>%
    count(word1,word2, sort = TRUE)
```


---

# Question 6 

Which words are most used in each of the specialties. you can use `group_by()` and `top_n()` from `dplyr` to have the calculations be done within each specialty. Remember to remove stopwords. How about the most 5 used words?
```{r}
mt_samples%>%
  unnest_tokens(token, transcription)%>%
  anti_join(tidytext:: stop_words, by = c("token" = "word"))%>%
  group_by(medical_specialty)%>%
  count(token)%>%
  top_n(5,n)
```


# Question 7 - extra

Find your own insight in the data:

Ideas:

- Interesting ngrams
- See if certain words are used more in some specialties then others