-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfestival-chatter-part-2-evaluating-band-popularity-from-bonnaroo-tweets.html
290 lines (245 loc) · 32.9 KB
/
festival-chatter-part-2-evaluating-band-popularity-from-bonnaroo-tweets.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Data Piques | Festival Chatter (Part 2) - Evaluating Band Popularity from Bonnaroo Tweets</title>
<link rel="shortcut icon" type="image/png" href="/favicon.png">
<link rel="shortcut icon" type="image/x-icon" href="/favicon.ico">
<link rel="stylesheet" href="/theme/css/screen.css" type="text/css" />
<link rel="stylesheet" href="/theme/css/pygments.css" type="text/css" />
<link rel="stylesheet" href="/theme/css/print.css" type="text/css" media="print" />
<meta name="generator" content="Pelican" />
</head>
<body>
<header>
<nav>
<ul>
<li><a href="">Home</a></li>
<li><a href="/archives.html">Archives</a></li>
<li><a href="http://www.ethanrosenthal.com">Home Page</a></li>
</ul>
</nav>
<div class="header_box">
<h1><a href="">Data Piques</a></h1>
</div>
</header>
<div id="wrapper">
<div id="content"> <h4 class="date">Sep 09, 2014</h4>
<article class="post">
<h2 class="title">
<a href="/festival-chatter-part-2-evaluating-band-popularity-from-bonnaroo-tweets.html" rel="bookmark" title="Permanent Link to "Festival Chatter (Part 2) - Evaluating Band Popularity from Bonnaroo Tweets"">Festival Chatter (Part 2) - Evaluating Band Popularity from Bonnaroo Tweets</a>
</h2>
<p>In my previous <a href="{{page.previous.url}}">post</a>, I wrote about how I collected tweets about the Bonnaroo Music and Arts Festival during the entirety of the festival. There are a wide range of questions that could be answered by this dataset, like</p>
<ul>
<li>Do people spell worse as they become more intoxicated throughout the night?</li>
<li>Does text <a href="http://en.wikipedia.org/wiki/Sentiment_analysis">sentiment</a> decline as people go more days without bathing?</li>
<li>Who in the world tweets from a laptop during a music festival?</li>
</ul>
<p>I would honestly love to answer the above questions (and plan to), but I will focus on the most obvious question for this post:</p>
<p><strong>Which band was most <a href="http://youtu.be/RNc45FTenhg">popular</a>?</strong></p>
<p>And while this question seems simple to answer, there are many reasons this blog post is so long. To start, we do not even have a decent definition of the question!</p>
<p>What does it mean for an artist to be the most popular as measured by tweets? For now, let's work off of the oldest rule of PR: "Any publicity is good publicity". In that case, we shall rank band popularity simply by the number of tweets that mention each artist. Let's try to do this and see what happens.</p>
<h2>Of Pythons and Pandas</h2>
<p>From the previous post, I have my dataset of Bonnaroo tweets sitting in a MongoDB database. I would like to get these tweets out of the database and into <a href="http://ipython.org/">IPython</a>, a software package for interactive computing in Python. I exported the database as a giant JSON file and then loaded it into IPython with a list comprehension.</p>
<p>The next step is to get the JSON record into <a href="http://pandas.pydata.org/">pandas</a>, a Python library used mainly for manipulating tabular data. The main object for dealing with tabular data, the DataFrame, nicely reads in such a JSON record.</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">pandas</span> <span class="kn">as</span> <span class="nn">pd</span>
<span class="kn">from</span> <span class="nn">pandas</span> <span class="kn">import</span> <span class="n">DataFrame</span><span class="p">,</span> <span class="n">Series</span>
<span class="n">path</span> <span class="o">=</span> <span class="s">'tweetCollection.json'</span>
<span class="n">record</span> <span class="o">=</span> <span class="p">[</span><span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">line</span><span class="p">)</span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">)]</span>
<span class="n">df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">(</span><span class="n">record</span><span class="p">)</span>
</pre></div>
<p>If I type <code>df.count()</code> I see that, sure enough, all 157,600 tweets are present and that 8,656 of them contain location data.</p>
<div class="highlight"><pre><span class="n">_id</span> <span class="mi">157600</span>
<span class="n">created_at</span> <span class="mi">157600</span>
<span class="n">geo</span> <span class="mi">8656</span>
<span class="n">source</span> <span class="mi">157600</span>
<span class="n">text</span> <span class="mi">157600</span>
<span class="nl">dtype:</span> <span class="n">int64</span>
</pre></div>
<p>While watching the tweets stream in, I noticed that there are a lot of retweets. To me, these do not seem as "organic" as a bona-fide, original tweet. People and software can spam twitter all they want with retweets, but it is more difficult to spam with original tweets. So, I think a more robust measure of band popularity is <em>unique</em> tweets. Pandas allows us to easily investigate this:</p>
<div class="highlight"><pre><span class="n">df</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
</pre></div>
<div class="highlight"><pre><span class="n">count</span> <span class="mi">157600</span>
<span class="n">unique</span> <span class="mi">107773</span>
<span class="n">top</span> <span class="n">RT</span> <span class="err">@</span><span class="mi">502</span><span class="n">michael502</span><span class="o">:</span> <span class="n">Islam</span> <span class="n">The</span> <span class="n">Religion</span> <span class="n">Of</span> <span class="n">Truth</span><span class="p">...</span>
<span class="n">freq</span> <span class="mi">1938</span>
<span class="nl">Name:</span> <span class="n">text</span><span class="p">,</span> <span class="n">dtype</span><span class="o">:</span> <span class="n">object</span>
</pre></div>
<p>Of the 157,600 original tweets, only about two-thirds of them are unique tweets. And wait, what's that most popular tweet that is repeated 1,938 times?</p>
<div class="highlight"><pre><span class="k">print</span> <span class="n">df</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span><span class="o">.</span><span class="n">describe</span><span class="p">()[</span><span class="s">'top'</span><span class="p">]</span>
</pre></div>
<div class="highlight"><pre><span class="n">RT</span> <span class="err">@</span><span class="mi">502</span><span class="n">michael502</span><span class="o">:</span> <span class="n">Islam</span> <span class="n">The</span> <span class="n">Religion</span> <span class="n">Of</span> <span class="n">Truth</span>
<span class="nl">http:</span><span class="c1">//t.co/BO7Sjw6pSl</span>
<span class="cp">#FathersDay #AFLDonsDees #Bonnaroo #Brasil2014 #WorldCup #Jewish #…</span>
</pre></div>
<p>Wow. Sure enough, if you go to the twitter page for <a href="https://twitter.com/502michael502">@502michael502</a>, you will see that random pro-Islam messages are tweeted and retweeted thousands of times with an assorted collection of hashtags containing trending and religious words. I guess Bonnaroo was popular enough to make it onto @502michael502's trending hashtags! And here I thought he was just a big jam band fan.</p>
<p>Ok, now we can try to remove retweets. We start by grabbing only unique tweets.</p>
<div class="highlight"><pre><span class="c"># Retain only unique text-unique tweets</span>
<span class="n">uniques</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">drop_duplicates</span><span class="p">(</span><span class="n">inplace</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span> <span class="n">cols</span><span class="o">=</span><span class="s">'text'</span><span class="p">)</span>
<span class="c"># I don't know how to make .startswith() case insensitve,</span>
<span class="c"># so check both cases:</span>
<span class="n">organics</span> <span class="o">=</span> <span class="n">uniques</span><span class="p">[</span> <span class="n">uniques</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'RT'</span><span class="p">)</span><span class="o">==</span><span class="bp">False</span> <span class="p">]</span>
<span class="n">organics</span> <span class="o">=</span> <span class="n">organics</span><span class="p">[</span> <span class="n">organics</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'rt'</span><span class="p">)</span><span class="o">==</span><span class="bp">False</span> <span class="p">]</span>
<span class="c"># In case RT was placed further in the text than the beginning.</span>
<span class="c"># Include spaces around ' RT ' to prevent grabbing words like start</span>
<span class="n">organics</span> <span class="o">=</span> <span class="n">organics</span><span class="p">[</span> <span class="n">organics</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s">' RT '</span><span class="p">,</span> <span class="n">case</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">==</span><span class="bp">False</span> <span class="p">]</span>
<span class="k">print</span> <span class="n">organics</span><span class="o">.</span><span class="n">count</span><span class="p">()</span>
</pre></div>
<div class="highlight"><pre><span class="n">_id</span> <span class="mi">93311</span>
<span class="n">created_at</span> <span class="mi">93311</span>
<span class="n">geo</span> <span class="mi">8537</span>
<span class="n">source</span> <span class="mi">93311</span>
<span class="n">text</span> <span class="mi">93311</span>
<span class="nl">dtype:</span> <span class="n">int64</span>
</pre></div>
<p>There we go: we have gone from 157,600 total tweets to 93,311 "organic" tweets. There is still more work that we could do to get more organic tweets. For example, I would argue that news media sources tweeting about artists at Bonnaroo are not a good measure of band popularity. Such tweets are more difficult to detect, though. One method could be to look at the source of the tweet - maybe tweets from cell phones are more likely to be individuals than media organizations? I will save this for another post because we still have a lot of work to do!</p>
<h2>Most BeautifulSoup in the <a href="http://youtu.be/5YIxpNPhAQE">Room</a></h2>
<p>Now that I have grouped together all of the tweets that we care to investigate, we must search for mentions of each Bonnaroo artist. But I am lazy. There are <em>189</em> different artists performing at Bonnaroo, and by no means do I feel like typing them all out.</p>
<p>Enter <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a>, a Python library for scraping websites. All I have to do is check out the band lineup on the Bonnaroo website, figure out which <code>div</code> elements correspond to the listed bands, and <em>BeautifulSoup</em> will grab the contents.</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">from</span> <span class="nn">BeautifulSoup</span> <span class="kn">import</span> <span class="n">BeautifulSoup</span>
<span class="n">url</span><span class="o">=</span><span class="s">'http://lineup.bonnaroo.com/'</span>
<span class="n">ufile</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">url</span><span class="p">)</span>
<span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">ufile</span><span class="p">)</span>
<span class="n">bandList</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'div'</span><span class="p">,{</span><span class="s">'class'</span><span class="p">:</span><span class="s">'ds-lineup ds-player'</span><span class="p">})</span><span class="o">.</span><span class="n">findAll</span><span class="p">(</span><span class="s">'a'</span><span class="p">)</span>
<span class="n">fout</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'bonnarooBandList.txt'</span><span class="p">,</span><span class="s">'w'</span><span class="p">)</span>
<span class="k">for</span> <span class="n">row</span> <span class="ow">in</span> <span class="n">bandList</span><span class="p">:</span>
<span class="n">band</span><span class="o">=</span><span class="n">row</span><span class="o">.</span><span class="n">renderContents</span><span class="p">()</span>
<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">band</span> <span class="o">+</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="n">fout</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</pre></div>
<h2>With a Little Help from my (API) Friends</h2>
<p>When I wrote the above script, I thought I was done. Later on, I thought about the fact that people do not always call bands by their full name. For example, the Red Hot Chili Peppers are often abbreviated RHCP. I was amazed when I found that <a href="https://musicbrainz.org/">MusicBrainz</a>, an online music encyclopedia, not only keeps track of bands' aliases and mispellings, but MusicBrainz actually has an API for accessing this information. Even better, somebody created a <a href="http://python-musicbrainzngs.readthedocs.org/en/latest/">Python wrapper</a> for the API.</p>
<p>I also had to perform some "scrubbing" of the aliases that are retrieved from the MusicBrainz API. I consider that a band is "mentioned" in a tweet if all words in any of the band's aliases are present in the tweet text. For example, a match for both "arctic" and "monkeys" in the text would be a mention of "Arctic Monkeys". However, I do not want to miss a mention of "The Flaming Lips" because "the" is not included.</p>
<p>I ameliorated this issue by using <a href="http://www.nltk.org/">nltk</a>, a Natural Language Processing library. The library contains a list of English stopwords (common words like "the") which I used as a filter. <strong>Note</strong>: This may be an issue for bands like "The Head and the Heart" where the filter would leave behind "head" and "heart". Both these words could easily be in a tweet and not relate to the band.</p>
<p>The code below shows how I used the public API's and <em>nltk</em> in order to get searchable aliases.</p>
<div class="highlight"><pre><span class="kn">import</span> <span class="nn">musicbrainzngs</span> <span class="kn">as</span> <span class="nn">mbrainz</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">string</span>
<span class="kn">import</span> <span class="nn">re</span>
<span class="kn">import</span> <span class="nn">nltk</span>
<span class="kn">from</span> <span class="nn">nltk.corpus</span> <span class="kn">import</span> <span class="n">stopwords</span>
<span class="n">mbrainz</span><span class="o">.</span><span class="n">auth</span><span class="p">(</span><span class="n">username</span><span class="p">,</span><span class="n">password</span><span class="p">)</span> <span class="c"># Use a username and password</span>
<span class="n">mbrainz</span><span class="o">.</span><span class="n">set_useragent</span><span class="p">(</span><span class="n">program_version</span><span class="p">,</span><span class="n">email_address</span><span class="p">)</span> <span class="c"># Include name</span>
<span class="c"># of your program and email address</span>
<span class="c"># To be used for removing punctuation</span>
<span class="n">regex</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">'[</span><span class="si">%s</span><span class="s">]'</span> <span class="o">%</span> <span class="n">re</span><span class="o">.</span><span class="n">escape</span><span class="p">(</span><span class="n">string</span><span class="o">.</span><span class="n">punctuation</span><span class="p">))</span>
<span class="c">################################</span>
<span class="k">def</span> <span class="nf">clean_aliases</span><span class="p">(</span><span class="n">alias</span><span class="p">,</span> <span class="n">regex</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> This function converts each alias to lowercase, removes</span>
<span class="sd"> punctuation, and removes stop words. The output is a</span>
<span class="sd"> list containing the remaining words.</span>
<span class="sd"> """</span>
<span class="n">alias</span> <span class="o">=</span> <span class="n">alias</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' &amp;'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span> <span class="c"># Remove ampersands</span>
<span class="n">alias</span> <span class="o">=</span> <span class="n">regex</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">''</span><span class="p">,</span> <span class="n">alias</span><span class="p">)</span> <span class="c"># Remove punctuation</span>
<span class="n">alias_words</span> <span class="o">=</span> \
<span class="p">[</span><span class="n">w</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">alias</span><span class="o">.</span><span class="n">split</span><span class="p">()</span> <span class="k">if</span> <span class="n">w</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">stopwords</span><span class="o">.</span><span class="n">words</span><span class="p">(</span><span class="s">'english'</span><span class="p">)]</span>
<span class="k">return</span> <span class="n">alias_words</span>
<span class="c">################################</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'bonnarooBandList.txt'</span><span class="p">,</span><span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">:</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'bonnarooAliasList.json'</span><span class="p">,</span><span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
<span class="n">aliasDict</span><span class="o">=</span><span class="p">{}</span> <span class="c"># Initialize alias dictionary</span>
<span class="k">for</span> <span class="n">band</span> <span class="ow">in</span> <span class="n">fin</span><span class="p">:</span>
<span class="n">band</span> <span class="o">=</span> <span class="n">band</span><span class="o">.</span><span class="n">rstrip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
<span class="c"># Remove ampersands</span>
<span class="n">band_query</span> <span class="o">=</span> <span class="n">band</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' &amp;'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span>
<span class="c"># Remove punctuation</span>
<span class="n">band_query</span> <span class="o">=</span> <span class="n">regex</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">''</span><span class="p">,</span> <span class="n">band_query</span><span class="p">)</span>
<span class="c"># Only grab first result</span>
<span class="n">result</span> <span class="o">=</span> \
<span class="n">mbrainz</span><span class="o">.</span><span class="n">search_artists</span><span class="p">(</span><span class="n">artist</span><span class="o">=</span><span class="n">band_query</span><span class="p">,</span> <span class="n">limit</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c"># Remove stopwords</span>
<span class="n">band_query</span> <span class="o">=</span> <span class="n">clean_aliases</span><span class="p">(</span><span class="n">band_query</span><span class="p">,</span> <span class="n">regex</span><span class="p">)</span>
<span class="c">#Initialize with stripped version of name</span>
<span class="c"># listed on Bonnaroo website</span>
<span class="n">aliasList</span> <span class="o">=</span> <span class="p">[</span><span class="n">band_query</span><span class="p">]</span>
<span class="k">try</span><span class="p">:</span>
<span class="k">for</span> <span class="n">alias</span> <span class="ow">in</span> <span class="n">result</span><span class="p">[</span><span class="s">'artist-list'</span><span class="p">][</span><span class="mi">0</span><span class="p">][</span><span class="s">'alias-list'</span><span class="p">]:</span>
<span class="n">alias</span> <span class="o">=</span> <span class="n">clean_aliases</span><span class="p">(</span><span class="n">alias</span><span class="p">[</span><span class="s">'alias'</span><span class="p">],</span> <span class="n">regex</span><span class="p">)</span>
<span class="n">aliasList</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">alias</span><span class="p">)</span> <span class="c"># Build alias List</span>
<span class="k">except</span><span class="p">:</span> <span class="c"># Some artists do not return aliases</span>
<span class="k">pass</span> <span class="c"># So don't do anything!</span>
<span class="n">aliasDict</span><span class="p">[</span><span class="n">band</span><span class="p">]</span> <span class="o">=</span> <span class="n">aliasList</span>
<span class="n">json</span><span class="o">.</span><span class="n">dump</span><span class="p">(</span><span class="n">aliasDict</span><span class="p">,</span><span class="n">fout</span><span class="p">)</span>
<span class="n">fout</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="n">fin</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</pre></div>
<h2>The Final Histogram</h2>
<p>Ok, so we now have a dictionary of <em>easily searchable</em> aliases for all artists that performed at Bonnaroo. All we have to do now is go through each tweet and see if any of the aliases for any of the artists are mentioned. We can then build a histogram of "mentions" for each artist by adding up all of the mentions in all of the tweets for a given artist.</p>
<p>In the code below, I do just this. By running the function at the bottom, <code>get_bandPop</code>, we get a <em>pandas</em> Series in return that contains each artist and the number of times they were mentioned in all of the tweets.</p>
<div class="highlight"><pre><span class="c"># To be used for removing punctuation</span>
<span class="n">regex</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">compile</span><span class="p">(</span><span class="s">'[</span><span class="si">%s</span><span class="s">]'</span> <span class="o">%</span> <span class="n">re</span><span class="o">.</span><span class="n">escape</span><span class="p">(</span><span class="n">string</span><span class="o">.</span><span class="n">punctuation</span><span class="p">))</span>
<span class="k">def</span> <span class="nf">clean_sentence</span><span class="p">(</span><span class="n">sentence</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Converts each sentence to lowercase and removes</span>
<span class="sd"> punctuation.</span>
<span class="sd"> """</span>
<span class="n">sentence</span> <span class="o">=</span> <span class="n">sentence</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">' &amp;'</span><span class="p">,</span> <span class="s">''</span><span class="p">)</span> <span class="c"># Remove ampersands</span>
<span class="n">sentence</span> <span class="o">=</span> <span class="n">regex</span><span class="o">.</span><span class="n">sub</span><span class="p">(</span><span class="s">''</span><span class="p">,</span> <span class="n">sentence</span><span class="p">)</span> <span class="c"># Remove punctuation</span>
<span class="c"># sentence_words = [w for w in sentence.split() if w not in stopwords.words('english')]</span>
<span class="k">return</span> <span class="n">sentence</span>
<span class="k">def</span> <span class="nf">find_mention</span><span class="p">(</span><span class="n">sentence</span><span class="p">,</span> <span class="n">phrase_list</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Takes a phase_list, which is a list of phrases where</span>
<span class="sd"> each phrase corresponds to a list of the words in the phrase, and</span>
<span class="sd"> checks to see whether all the words of any of the phrases are</span>
<span class="sd"> present in "sentence".</span>
<span class="sd"> """</span>
<span class="k">for</span> <span class="n">words</span> <span class="ow">in</span> <span class="n">phrase_list</span><span class="p">:</span>
<span class="n">words</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>
<span class="k">if</span> <span class="n">words</span><span class="o">.</span><span class="n">issubset</span><span class="p">(</span><span class="n">sentence</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">True</span>
<span class="k">return</span> <span class="bp">False</span> <span class="c"># None of the word lists were subsets</span>
<span class="k">def</span> <span class="nf">check_each_alias</span><span class="p">(</span><span class="n">sentence</span><span class="p">,</span> <span class="n">alias_dict</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Checks to see whether any of the aliases for</span>
<span class="sd"> each band mentioned in alias_dict are mentioned</span>
<span class="sd"> in "sentence".</span>
<span class="sd"> band_bool is a dictionary that contains all band</span>
<span class="sd"> names as keys and True or False as values corresponding</span>
<span class="sd"> to whether or not the band was mentioned in the sentence.</span>
<span class="sd"> """</span>
<span class="n">band_bool</span><span class="o">=</span><span class="p">{}</span>
<span class="n">sentence</span> <span class="o">=</span> <span class="nb">set</span><span class="p">(</span><span class="n">clean_sentence</span><span class="p">(</span><span class="n">sentence</span><span class="p">)</span><span class="o">.</span><span class="n">split</span><span class="p">())</span>
<span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">alias_dict</span><span class="o">.</span><span class="n">iteritems</span><span class="p">():</span>
<span class="n">band_bool</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">find_mention</span><span class="p">(</span><span class="n">sentence</span><span class="p">,</span> <span class="n">v</span><span class="p">)</span>
<span class="k">return</span> <span class="n">pd</span><span class="o">.</span><span class="n">Series</span><span class="p">(</span><span class="n">band_bool</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">build_apply_fun</span><span class="p">(</span><span class="n">alias_dict</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> Turn check_each_alias into an anonymous function.</span>
<span class="sd"> """</span>
<span class="n">apply_fun</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span> <span class="p">:</span> <span class="n">check_each_alias</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">alias_dict</span><span class="p">)</span>
<span class="k">return</span> <span class="n">apply_fun</span>
<span class="k">def</span> <span class="nf">get_bandPop</span><span class="p">(</span><span class="n">df</span><span class="p">,</span> <span class="n">alias_dict</span><span class="p">):</span>
<span class="sd">"""</span>
<span class="sd"> For tweet DataFrame input "df", build histogram of of mentions</span>
<span class="sd"> for each band in alias_dict.</span>
<span class="sd"> """</span>
<span class="n">bandPop</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'text'</span><span class="p">]</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">build_apply_fun</span><span class="p">(</span><span class="n">alias_dict</span><span class="p">),</span> <span class="n">alias_dict</span><span class="p">)</span>
<span class="n">bandPop</span> <span class="o">=</span> <span class="n">bandPop</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">bandPop</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
<span class="k">return</span> <span class="n">bandPop</span>
</pre></div>
<p>And now, finally, all we have to do is type <code>bandPop[:10].plot(kind='bar')</code> (and maybe fiddle around in <em>matplotlib</em> for an hour adjusting properties of the figure) and we get a histogram of mentions for the top ten most popular bands at Bonnaroo:</p>
<p><img alt="Top Ten Bands" src="/assets/img/bandPop_top10.png" /></p>
<p>And of course it's Kanye! Is anybody surprised?</p>
<p>Wow, that was a lot of work for one, measly histogram! However, we now have a bunch of data analytical machinery that we can use to delve deeper into this dataset. In my <a href="{{page.next.url}}">next post</a>, I'll do just that!</p>
<div class="clear"></div>
<div class="info">
<a href="/festival-chatter-part-2-evaluating-band-popularity-from-bonnaroo-tweets.html">posted at 00:00</a>
</div>
</article>
<div class="clear"></div>
<footer>
<p>
<a href="https://github.com/jody-frankowski/blue-penguin">Blue Penguin</a> Theme
·
Powered by <a href="http://getpelican.com">Pelican</a>
</footer>
</div>
<div class="clear"></div>
</div>
</body>
</html>