We can use count() from dplyr to figure out how many different catagories do we have? Are these catagories related? overlapping? evenly distributed?
+
+
In this dataset, you’ll find a diverse range of 40 different medical specialties. These specialties are all connected within the broader field of medicine. Some of them might center around the same body systems or organs. It’s interesting to note that there’s no overlap between these specialties. When we look at the cases in the dataset, surgery stands out with the most cases, followed by consult and cardiovascular/pulmonary specialties.
Tokenize the the words in the transcription column
+
Count the number of times each token appears
+
Visualize the top 20 most frequent words
+
+
Explain what we see from this result. Does it makes sense? What insights (if any) do we get?
+
+
The most frequent words in the transcription include common ones like “the,” “and,” “patient,” “she,” and “he.” These words are typically expected to be among the top 20 most frequently used words in the transcriptions. Many of them are prepositions that convey relationships between words, which are essential in most sentences. Additionally, the term “patient” is understandably prevalent in the transcriptions as physicians often refer to the patient in their notes.
What do we see know that we have removed stop words? Does it give us a better idea of what the text is about?
+
+
After removing stop words, the prominent words in the text are “patient,” “left,” “procedure,” and “pain,” among others. Many of the top 20 words are associated with surgical procedures, including terms like “anesthesia” and “incision.” This aligns with expectations since most entries are from the surgery specialty. These words provide valuable context about the procedures performed on patients.
repeat question 2, but this time tokenize into bi-grams. how does the result change if you look at tri-grams?
+
+
In both n-grams, the most frequent phrase is “the patient.” Many phrases also convey location descriptions. Tri-grams, which include an additional word in the phrase, provide even more contextual information, as expected.
Which words are most used in each of the specialties. you can use group_by() and top_n() from dplyr to have the calculations be done within each specialty. Remember to remove stopwords. How about the most 5 used words?
+
+
In the field of allergy/immunology, the top 5 frequently used words are “history,” “noted,” “patient,” “allergies,” and “nasal” (tied with “past” for fifth place). In dentistry, the leading words are “patient,” “tooth,” “teeth,” “left,” and “procedure.”