-
Notifications
You must be signed in to change notification settings - Fork 370
/
Text_Mining_Hacker_News.Rmd
172 lines (136 loc) · 7.07 KB
/
Text_Mining_Hacker_News.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
---
title: "4 years of The Hacker News, in 5 charts"
output: html_document
---
## Introduction
[Hacker News](https://news.ycombinator.com/) is one of my favorite sites to catch up on technology and startup news, but navigating the minimalistic website can be sometimes tedious. Therefore, my plan in this post is to introduce you that how this social news site can be analyzed, in as non-technical a fashion as I can, as well as presenting some initial results, along with some ideas about where we will take it next.
To avoid dealing with SQL, I downloaded Hacker News dataset from [David Robinson's website](http://varianceexplained.org/), it includes one million Hacker News article titles from September 2013 to June 2017.
To begin, let's look at the visualization of the most common words in Hacker News titles.
```{r global_options, include=FALSE}
knitr::opts_chunk$set(echo=FALSE, warning=FALSE, message=FALSE)
```
```{r}
library(tidyverse)
library(lubridate)
library(tidytext)
library(stringr)
library(ggplot2)
library(ggthemes)
library(reshape2)
library(wordcloud)
library(tidyr)
library(igraph)
library(ggraph)
```
```{r}
hackernews <- read_csv("stories_1000000.csv.gz") %>%
mutate(time = as.POSIXct(time, origin = "1970-01-01"),
month = round_date(time, "month"))
```
## Some initial simple exploration
Before we get into the statistical analysis, the first step is to look at the most frequent words that appeared on Hacker News titles from September 2013 to June 2017.
```{r}
hackernews %>%
unnest_tokens(word, title) %>%
anti_join(stop_words) %>%
count(word, sort=TRUE) %>%
filter(n > 8000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip() +
labs(title = "The Most Common Words in Hacker News Titles") +
theme_fivethirtyeight()
```
For the most part, we would expect it is a fairly standard list of common words in Hacker News titles. The top word is "hn", because "ask hn", "show hn" are part of the social news site's structure. The second most frequent words such as "google", "data", "app", "web", "startup" and so on are all within our expectation for a social news site like Hacker News.
```{r}
tidy_hacker <- hackernews %>%
unnest_tokens(word, title) %>%
anti_join(stop_words)
```
```{r}
tidy_hacker %>%
count(word, sort = TRUE)
```
## Simple Sentiment Analysis
Let's address the topic of sentiment analysis. Sentiment analysis detects the sentiment of a body of text in terms of polarity (positive or negative). When used, particularly at scale, it can show you how people feel towards the topics that are important to you.
We can analyze word counts that contribute to each sentiment. From the Hacker news articles, We fount out how much each word contributed to each sentiment
```{r}
bing_word_counts <- tidy_hacker %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
bing_word_counts
```
```{r}
bing_word_counts %>%
group_by(sentiment) %>%
top_n(10) %>%
ungroup() %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n, fill = sentiment)) +
geom_col(show.legend = FALSE) +
facet_wrap(~sentiment, scales = "free_y") +
labs(title = "The Most Common Positive and Negative Words in Hacker News Titles", y = "Words Contribute to sentiment", x = NULL) +
coord_flip() + theme_fivethirtyeight()
```
Word cloud is always a good idea to identify trends and patterns that would otherwise be unclear or difficult to see in a tabular format. We can also compare most frequent positive and negative words in word cloud.
```{r}
tidy_hacker %>%
inner_join(get_sentiments("bing")) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("#F8766D", "#00BFC4"),
max.words = 100)
```
## Relationship between words
we often want to understand the relationship between words in a document. What sequences of words are common across text? Given a sequence of words, what word is most likely to follow? What words have the strongest relationship with each other? Therefore, many interesting text analysis are based on the relationships. When we exam pairs of two consecutive words, it is often called "bigrams"
```{r}
hacker_bigrams <- hackernews %>%
unnest_tokens(bigram, title, token = "ngrams", n = 2)
hacker_bigrams
```
```{r}
bigrams_separated <- hacker_bigrams %>%
separate(bigram, c("word1", "word2"), sep = " ")
bigrams_filtered <- bigrams_separated %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
bigram_counts <- bigrams_filtered %>%
count(word1, word2, sort = TRUE)
bigram_counts
```
```{r}
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
bigrams_united
```
```{r}
bigram_tf_idf <- bigrams_united %>%
count(bigram)
bigram_tf_idf <- bigram_tf_idf %>% filter(n>1000)
ggplot(aes(x = reorder(bigram, n), y=n), data=bigram_tf_idf) + geom_bar(stat = 'identity') + labs(title = "The Most Common Bigrams in Hacker News") + coord_flip() + theme_fivethirtyeight()
```
Winner of most common bigram in Hacker news data goes to "machine learning" and the second is "silicon valley".
The challenge in analyzing text data, as mentioned earlier, is in understanding what the words mean. The use of the word "deep" has different meaning if it is paired with the word "water" as opposed to the word "learning". As a result, a simple summary of word counts in text data will likely be confusing unless the analysis relate it to the other words that also appear without assuming an independent process of word choice.
```{r}
bigram_graph <- bigram_counts %>%
filter(n > 600) %>%
graph_from_data_frame()
bigram_graph
```
## Networks of words
Words networks analysis is one method for encoding the relationships between words in a text and constructing a network of the linked words. This technique is based on the assumption that language and knowledge can be modeled as networks of words and the relations between them.
For Hacker news data, we can visualize some details of the text structure. For example, we can see pairs or triplets that form common short phrases ("Social media network" or "neural networks").
```{r}
set.seed(2017)
ggraph(bigram_graph, layout = "fr") +
geom_edge_link() +
geom_node_point(color = "plum4", size = 5) +
geom_node_text(aes(label = name), vjust = 1.8) + labs(title = "Word Network in Hacker News Dataset Titles") +
theme_fivethirtyeight()
```
This type of network analysis is mainly showing us the important nouns in a text, and how they are related. The resulting graph can be used to get a quick visual summary of the text, read the most relevant excerpts.
Once we have the capability to automatically derive insights from text analytics, they then can translate the insights into actions.
There is no structured survey data does a better job predicting customer behavior as well as actual voice of customer text comments and messages!