This is a project based on Quil and Unfolding for Processing for visualizing urban multilingualism through Twitter data. The main goal of this project is to collect a Twitter corpus that provides detailed georeference and language information for tweets located in urban scenarios. Parallel to the dataset, we are releasing an application that allow to visualize the data on map in different ways. The novum of this dataset relies in that the data collection was restricted to four selected urban scenarios. These scenarios are, in alphabetical order, Amsterdam, Antwerp, Berlin and Brussels.
The general dataset consists of a collection of tweets directly retrieved from the Twitter Streaming API since December 2014. Given the nature of the dataset, only tweets with geolocation information were collected. According to Leetaru et al. 2012, only 1,6% of the Twitter stream is actually shipped with geolocation information. This is a heavy constraint on the top of the general twitter streaming limit of ~1%. Currently, the dataset has grown above 2 million tweets with the approximately following proportions (7/6/2015).
City | Number of tweets |
---|---|
Amsterdam | 679205 |
Antwerp | 415813 |
Berlin | 691998 |
Brussels | 497667 |
For the Berlin dataset, non-exhaustive bot detection was semi-manually performed with the aid of Bot or Not? and a set of heuristics based on profile information. A preselection of candidates was done by sorting ids by (i) total number of tweets in the database and (ii) total number of statuses. The rationale behind this strategy is twofold:
- First, bots are known to have a more productive tweeting behaviour than humans (Chu et al. 2010).
- Secondly, bots are known to have a more evenly distributed tweeting behaviour across time than humans. That means that in periods of the week of less human tweeting activity (night and weekends), proportionally more bot-authored tweets will be captured by the stream.
Once a sufficient number of known users were collected, a parallel tweet collection method was applied. This consisted in selectively retrieving tweets for the known ids, using the RESTful API for mining user timelines.
Currently there are datasets available for Berlin, Brussels, Antwerp and Amsterdam.
Due to the Twitter API Terms of Service, only tweet id
and the language identification postprocessing.
In order to fetch the data run this on your command line:
git clone https://github.com/emanjavacas/urban_dataset
You can use the REST API of Twitter to retrieve the actual tweets and locations. We plan to publish a script that does precisely this for you.
The visualization tool consists of a menu frame and the actual visualization frame. The menu frame allows selecting the specific settings in which the visualization will take place.
Settings are best navigated using the panel on the left-hand side of the menu.
The visualization doesn’t rely on the actual tweet data (which would be quite computationally heavy) but rather on a so called grid file. This file is an aggregation of datapoints to cells of a given size. As a result each cell contains a series of numbers indicating the amount of tweets written in a given language. A number of precompiled grid files are shipped with the program and a future version will provide functionality for creating such grid files out of raw twitter json data.
A description of each grid file is being shown inline at the bottom of the menu frame.
You can also select the width and height of the visualization frame. The two resting options loc? and filter are somehow more specific. If loc? is activated, the exact coordinates where the mouse is pointing will be shown on the map. filter influences the number of languages that the user can choose from at run time. A higher value will prevent rare languages from appearing in the dropdown list. It doesn’t affect the application behaviour when running on monolingual mode.
There are three types of visualization.
Visualization is carried out in a heat map fashion. Color is mapped to the total number of tweets written in a given language. The color hue will range from less to more dark with increasing number of tweets. A slider ALPHA controls the transparency. Another one, RED controls the amount of red that is being plotted. It can be used to affect the color range in which the heat map will move. A third and last slider BETA can be use to highlight and enhance the differences across cells. See section Sigmoid for an explanation. (to be done) Additionally, a dropdown list allows the user to select the current language.
The purpose of the bilingual visualization mode is to gain insights into the relative proportion of one language with respect to a second one. Two dropdown lists allow the selection of language one and two. A set of sliders, similar to the one in the monolingual settings, is available. Language one will be mapped to the lighter colour, whereas language two will be displayed darker.
In the multilingual setting a lighter colour is mapped to a higher cell values. The meaning of each cell value can be tuned with the option mode, which is available both in the menu frame and at run time in the form of a dropdown list.
Once all settings are selected the application can be run by clicking on the init button.
Language detection was carried out following [Lui & Baldwin 2014]. They found out that a majority approach using langid.py, cld2 and LangDetect consistenly outperformed any other considered individual system (see paper for more information on this).
Package | Coverage | Other |
---|---|---|
LDIG | 17 languages | Twitter-specific |
langid.py | 97 languages | |
CLD2 | > 80 languages | |
LangDetect | 53 languages |
Several libraries were employed. All of them are part of the JVM ecosystem and were ensambled into uniform Clojure code by taking advantage of the Java-interop facilities that Clojure offers.
- Quil (depends on Processing)
- Unfolding Maps
- ControlP5
- Seesaw (based on Swing)
The easiest way to run the application is downloading the jar executable Make sure that you have at least version 7 of the JDK installed by inputing this in your command line:
java -version javac -version
Double click on the downloaded file should work, otherwise try it from the command line as per:
java -jar path/to/urban-tweeters..jar
If you want to build the app yourself, you are going to need a couple of things:
- A Clojure installation.
- The easiest way of running Clojure code is using Leiningen.
- Unfortunately, some of the dependencies are not available from Clojars and won’t be automatically pulled by Leiningen. The workaround is to use the lein-localrepo plugin.
- Download the jars for unfolding, controlp5, log4j, json4proc and glgraphics and intall them locally following the lein-localrepo instructions.
The application has been reported to run on the vast majority of Mac OS versions and Windows. More concretely, it has been tested on the following Operative Systems:
OS | Processor | Memory |
---|---|---|
OS X Yosemite | 2,7 GHz Intel Core i5 | 8 GB |
Ubuntu 14.04 | 3,1 GHz Intel Core i5 | 8 GB |
Windows 7 | 2,6 GHz Intel Core i5 | 8 GB |
If you have any trouble trying to run the application I’d be happy to hear about that through a Issue.
There is a known bug that affects (at least some) computers running Ubuntu 15.04. The application starts but any attemp to close the visualization frame results in a core dump failure, meaning that it won’t close. In any case, check that you have a JDK version not older than 7.
- Accurate language identification of twitter messages Lui, M. & Baldwin T. EACL 2014
- Mapping Urban Mulitilingualism through Twitter Manjavacas, E & Verhoeven B. DH Benelux 2015
Copyright © 2015 Enrique Manjavacas