-
Notifications
You must be signed in to change notification settings - Fork 0
/
Twitter Text Mining Analysis.Rmd
472 lines (373 loc) · 16.1 KB
/
Twitter Text Mining Analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
---
title: "Twitter Text Mining Analysis"
author: "Qingqing (Larissa) Long"
date: "July 9, 2018"
output:
html_document:
keep_md: true
---
```{r setup, include=FALSE, comment=NA, echo=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, comment = NA, message = FALSE, fig.width = 9)
```
## Objectives
Twitter is an American online news and social networking service on which users post and interact with messages known as "tweets". The objectives of this study are:
• To deep dive into customer online reviews on Twitter
• To identify the overall brand-wise sentiment on Twitter
## Set up
Packages and libraries are pre-loaded for analysis.
```{r library}
# Install and load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load(
"SnowballC", # sentiment analysis
"tm", # text mining
"twitteR", # access to twitter
"syuzhet", # sentiment analysis
"httr", # working with HTTP connections
"base64enc", #handling base64 encoding
"tidyverse", # data cleaning and manipulation
"tidytext", # text cleaning
"lubridate", # get date-time info for different time zones
"cowplot", # create coplot
"wordcloud", # create comparison wordcloud
"wordcloud2" # create wordcloud
)
# disable scientific notation in plots
options(scipen = 999)
```
## Data Exploration
We load data to get a glimpse of tweets. The data contains 1,372 records with 16 variables created from June 26 to July 6, 2018. 54% of them were unique posts and 46% of them were retweets.
```{r}
# Load data
tweets <- readRDS(file = "data/Twitter/tweets_park_deleted.rds")
# Overview
str(tweets)
#summary(tweets)
```
After checking mising values, over 97% of tweets does not have longitude and latitude information.
```{r}
# Check missing values
sapply(tweets, function(x) sum(is.na(x)))
```
### Where people tweeted
For those 29 tweets with geographic information, the majority of them were posted from the east coast of the nation whereas others were from west, south coast of the States and South East Asia.
```{r}
geo_tweets <- tweets %>%
filter(is.na(longitude) == 0 & is.na(latitude) == 0) %>%
group_by(longitude, latitude) %>%
summarise(count = n())
ggplot() +
geom_polygon(data = map_data("world"), aes(x = long, y = lat, group = group), colour = "white", fill = "grey") +
geom_jitter(
data = geo_tweets, position = position_jitter(width = 0.5, height = 0.5),
aes(x = as.numeric(longitude), y = as.numeric(latitude), size = count), color = 'orange'
) +
ggtitle("Geographic Distribution of Users Whose Tweets Contain 'petco'") +
labs(x = "Longitude", y = "Latitude")
```
### When People Tweeted
There are all the tweets containing "petco“ from June 26 to July 6, 2018.
```{r}
tweets %>%
mutate(created_date = as.Date(created)) %>%
group_by(created_date) %>%
summarise(count = n()) %>%
ggplot(aes(created_date, count, fill = created_date)) +
geom_bar(stat = "identity", show.legend = F) +
xlab("All Tweets") +
ggtitle("NO. of Tweets by Date")
```
In terms of density, there was a peak of retweets on June 29th from 0am to 8am.
```{r}
# Create density plot of tweets by date
timeDist <- ggplot(tweets, aes(created)) +
geom_density(aes(fill = isRetweet), alpha = .5) +
scale_fill_discrete(guide = "none") +
xlab("All Tweets") +
ggtitle("When People Tweeted & Retweeted") +
theme(axis.text.x = element_text(size = 8))
# Zoom into peak day where there are the most retweets
peakDay <- filter(tweets, mday(created) == 29)
timeDistDayOf <- ggplot(peakDay, aes(created)) +
geom_density(aes(fill = isRetweet), adjust = .25, alpha = .5) +
theme(legend.justification = c(1, 1), legend.position = c(1, 1), axis.text.x = element_text(size = 8)) +
xlab("Tweets on June 29th")
cowplot::plot_grid(timeDist, timeDistDayOf)
```
### The Hottest Retweet
This tweet was retweeted 139,609 times.
```{r}
# The hottest retweet found
tweets %>%
arrange(desc(retweetCount)) %>%
head(1) %>%
.$text
```
### Entry Point
When it comes to the entry points users came from, mostly it is what we would expect: a significant amount of users posted tweets using Twitter App for iPhone, followed by Android and web client.
```{r}
# identify key pattern for root domain
tweets$source <- substr(
tweets$statusSource,
regexpr(">", tweets$statusSource) + 1,
regexpr("</a>", tweets$statusSource) - 1
)
# plot
tweets %>%
group_by(source) %>%
summarise(count = n()) %>%
arrange(desc(count)) %>%
head(10) %>%
ggplot(aes(reorder(source, count), count, fill = source)) +
geom_bar(stat = "identity", show.legend = F) +
coord_flip() +
labs(x = "Platforms", y = "Count", title = "Number of Tweets Posted by Entry Point")
```
### Popular Domains Shared
People use Twitter to share information and most of the times they include a url in their tweets. In this part, I would like to access only the 506 unique URLs included in the tweets, then use them to figure out which domains appear the most often.
```{r}
reg <- "([^A-Za-z_\\d#@:/']|'(?![A-Za-z_\\d#@:/]))" # set up patterns by regular expression
urls_temp <- tweets %>%
unnest_tokens(word, text, token = "regex", pattern = reg, to_lower = FALSE) %>%
mutate(
word = str_replace_all(word, "https|//t|http|&|<|>", ""),
word = str_replace_all(word, "co/", "https://t.co/")
) %>%
select(word) %>%
filter(grepl("https://t.co/", word, fixed = TRUE)) %>%
count(word, sort = TRUE) %>%
mutate(word = reorder(word, n)) # 506 unique urls
# focus on urls that appeared at least twice
minimum_occurrences <- 2
urls_common <- urls_temp %>%
filter(n >= minimum_occurrences) %>%
mutate(source_url = as.character(word)) %>%
select(source_url, count = n) # 19 unique urls that appeared at least twice
# obtain target URLs and return HTTP status codes
url <- t(sapply(urls_common$source_url, GET)) %>%
as_tibble() %>%
select(url, status_code)
# bind results to original URLs and frequency count, then filter out those 404 status (page not found),
# and those pointing back to Twitter (like quoted tweets)
url_list <- cbind(urls_common, unnest(url)) %>%
as_tibble() %>%
select(url, count, status_code) %>%
filter(
status_code != 404,
url != "https://t.co/",
!grepl("https://twitter.com/", url)
)
# use the below function to extract domains from URLs
## split the URL by "//" and taking everything to the right of that, then splitting it by "/" and taking everything to the left of the first slash, then dropping "www." if present
extract_domain <- function(url) {
return(gsub("www.", "", unlist(strsplit(unlist(strsplit(as.character(url), "//"))[2], "/"))[1]))
}
```
Then we can plot the most popular domains users point to as shown below:
```{r}
# count the frequency of a domain's occurrence in the most frequent URL list
url_list %>%
mutate(domain = mapply(extract_domain, url)) %>%
group_by(domain) %>%
summarize(domain_count = sum(count)) %>%
arrange(desc(domain_count)) %>%
mutate(domain = reorder(domain, domain_count)) %>%
ggplot(aes(domain, domain_count, fill = domain)) +
geom_bar(stat = "identity") +
xlab(NULL) +
ylab("Domain Count") +
ggtitle(paste('Top Domains in Tweets Containing "petco"')) +
theme(legend.position = "none") +
coord_flip()
```
'Half the way home', which is an animal shelter organization for adoption, ranks first as the most common domain appeared, followed by tunein, youtube and petco website.
### Tweets
Let's take a look at some tweets.
```{r}
# View tweets
head(tweets$text, 5)
```
Original text contains a lot of URLs, hashtags and other twitter handles. Thus it is essential to tidy data prior to text analysis.
## Text Cleaning
This snippet of code filters out retweets whose text begins with "RT", removes URLs and certain characters that signal something other than natural language text from the tweets, then tokenizes the tweets into words.
```{r}
tidy_tweets <- tweets %>%
filter(!str_detect(text, "^RT")) %>% # remove retweets
unnest_tokens(word, text, token = "regex", pattern = reg, to_lower = T) %>% # tokenize texts by word
mutate(
word = str_replace_all(word, "https|//t|http|&|<|>", ""),
word = str_replace_all(word, "https://t.co/[A-Za-z\\d]+https://t.co/[A-Za-z\\d]", ""),
word = str_replace_all(word, "amp", "")
) %>%
filter(
!word %in% stop_words$word,
word != "https",
str_detect(word, "[a-z]")
)
```
Tokenization means splitting the text by spaces, removing punctuation, converting all letters to lower-case, etc., and then applying all of the metadata for the tweet to each individual word. A tweet with 10 words is a single record in the tweets data frame, but after tokenizing will result in 10 records in the new tidy_tweets data frame, each with a different word, but identical metadata. After tokenizing, I filter out any records where the word is contained in a list of stop words(a, an, the, of, etc.) or where the "word" contains no letters (i.e., raw numbers).
## Frequent topics
Frequent topics can be used to compare frequencies across region, market and user segments.
### Word frequency
What did people talk about when they talk about "Petco"? To find the most common words associated with "Petco" in this tweet corpus, I count occurrences of each unique word, filter out the most 50 common words and order them most to least frequent.
```{r}
tidy_tweets %>%
count(word, sort = TRUE) %>%
filter(word != 'petco') %>%
head(50)
tidy_tweets %>%
count(word, sort = TRUE) %>%
filter(word != 'petco') %>%
head(50) %>%
wordcloud2(.)
```
All hashtags and Twitter handles! Let's see what's the most common if I **omit hashtags and handles**:
```{r}
tidy_tweets %>%
count(word, sort = TRUE) %>%
filter(
word != 'petco',
substr(word, 1, 1) != "#", # omit hashtags
substr(word, 1, 1) != "@" # omit @
) %>%
mutate(word = reorder(word, n)) %>%
head(20) %>%
ggplot(aes(word, n, fill = word)) +
geom_bar(stat = "identity") +
ylab("Word count") +
ggtitle(paste("Top 20 Words in Tweets Containing 'petco'")) +
theme(legend.position = "none") +
coord_flip()
```
### Term frequency
'TF-IDF' stands for Term Frequency, Inverse Document Frequency. It's an algorithm to score the importance of words in a document based on how frequently they appear across multiple documents, as opposed to the entire data set.
Let's randomly select two tweets as examples:
```{r}
tmp <- tidy_tweets %>%
select(id, word) %>%
count(id, word) %>%
bind_tf_idf(word, id, n) %>%
group_by(id) %>%
arrange(desc(tf_idf)) %>%
slice(1:5) %>% # %>% # get top 5 words per tweet in terms of tf-idf
ungroup() %>%
mutate(xOrder = n():1) %>%
inner_join(select(tweets, text, id), by = "id")
plDF <- tmp %>%
filter(id %in% c("1012025917953069056", "1012039377948758024")) %>% # select 2 tweet ids
ungroup() %>%
ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = id)) +
geom_bar(stat = "identity", show.legend = FALSE) +
facet_wrap(~ id, scales = "free", nrow = 1) +
coord_flip() +
labs(
x = "Word",
y = "TF-IDF",
title = "Top TF-IDF Words per Tweet"
) +
theme_bw()
plDF
```
Are the top TF-IDF words reflecting what orginal text is about? Let's take a look at the below original tweets:
```{r}
unique(tmp %>%
filter(id %in% c("1012025917953069056", "1012039377948758024"))
%>%
.$text)
```
They are quite matching! To conclude, words with high TF-IDF in a document will tend to be words that are rare (when compared to other documents) but not too rare. In this way they are informative about what the document is about.
### Bigram
Often times, looking for the most common bigrams (two-word phrases) is more instructive than individual words. I follow the same logic to clean text, and check the words sitting next to each other.
```{r}
tidy_bigrams <- tweets %>%
filter(!str_detect(text, "^RT")) %>% # filter out retweets
mutate(text = str_replace_all(text, "https://t.co/[A-Za-z\\d]+|http://[A-Za-z\\d]+|&|<|>|RT|https|//t", "")) %>% # clean text before tokenization
unnest_tokens(word, text, token = "regex", pattern = reg) %>%
mutate(next_word = lead(word)) %>%
filter(
!word %in% stop_words$word, # remove stop words
!next_word %in% stop_words$word,
substr(word, 1, 1) != "@", # remove user handles to protect privacy
substr(next_word, 1, 1) != "@",
substr(word, 1, 1) != "#", # remove hashtags
substr(next_word, 1, 1) != "#",
str_detect(word, "[a-z]"), # remove words containing ony numbers or symbols
str_detect(next_word, "[a-z]")
) %>%
filter(id == lead(id)) %>% # ensure bigrams to cross from one tweet into the next
unite(bigram, word, next_word, sep = " ") %>%
select(bigram, id, created)
# set bigram as stem word format
tidy_bigrams$bigram <-ifelse(tidy_bigrams$bigram == "adoption events", "adoption event", tidy_bigrams$bigram)
# plot
tidy_bigrams %>%
count(bigram, sort = TRUE) %>%
arrange(desc(n)) %>%
head(10) %>%
ggplot(aes(reorder(as.factor(bigram), n), n, fill = bigram)) +
geom_bar(stat = "identity") +
xlab("") +
ylab("Word count") +
ggtitle(paste("Top 10 Bigrams in Tweets Containing 'petco'")) +
theme(legend.position = "none") +
coord_flip()
```
## Sentiment Analysis
In recent years, there has been a steady increase in interest from brands, companies and researchers in Sentiment Analysis and its application to business analytics. With Sentiment Analysis from a text analytics point of view, we are essentially looking to get an understanding of the attitude of customers with respect to Petco and its polarity; whether it’s positive, negative or neutral.
Here I use NRC sentiment dictionary to calculate the presence of eight different emotions and their corresponding valence.
```{r}
sent_tweets <- tidy_tweets
# get sentiment score per word
sentText_nerc <- get_nrc_sentiment(sent_tweets$word)
sentText <- cbind(sent_tweets, sentText_nerc)
```
Then we can see what sorts of emotions are dominant in the tweets:
```{r}
# plot sentiment
sentText %>%
summarize(
anger = sum(anger), anticipation = sum(anticipation), disgust = sum(disgust), fear = sum(fear),
joy = sum(joy), sadness = sum(sadness), surprise = sum(surprise), trust = sum(trust),
negative = sum(negative), positive = sum(positive)
) %>%
gather(sentiment, count) %>%
ggplot(aes(reorder(sentiment, -count), count, fill = sentiment)) +
geom_bar(stat = "identity", show.legend = F) +
labs(
title = "Distribution of Emotion Categories for Tweets Containing 'petco'",
x = "Sentiment",
y = "Count"
)
```
Mainly positive! This is good news for Petco.
Finally, let’s see which word contributes to which emotion:
```{r}
# Comparison word cloud
# head(sentText, 10)
all <- c(
paste(sentText$word[sentText$anger > 0], collapse = " "),
paste(sentText$word[sentText$anticipation > 0], collapse = " "),
paste(sentText$word[sentText$disgust > 0], collapse = " "),
paste(sentText$word[sentText$fear > 0], collapse = " "),
paste(sentText$word[sentText$joy > 0], collapse = " "),
paste(sentText$word[sentText$sadness > 0], collapse = " "),
paste(sentText$word[sentText$surprise > 0], collapse = " "),
paste(sentText$word[sentText$trust > 0], collapse = " ")
)
all <- removeWords(all, stopwords("english"))
# create corpus
corpus <- Corpus(VectorSource(all))
# create term-document matrix
tdm <- TermDocumentMatrix(corpus)
# convert as matrix
tdm <- as.matrix(tdm)
tdm1 <- tdm[nchar(rownames(tdm)) < 11, ] # exclude words over 11 characters so as to fit nicely into wordcloud
#
# add column names
colnames(tdm1) <- c("anger", "anticipation", "disgust", "fear", "joy", "sadness", "surprise", "trust")
comparison.cloud(tdm1,
random.order = FALSE,
colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"),
title.size = 1, max.words = 100, scale = c(2, 0.4), rot.per = 0.4
)
```