A Hadoop/R environment for analyzing the structure and dynamics of twitter communities. Currently it is being used to study the interaction of ideological communities forming around the major political parties in Spain and the media.
Tweets are piped via a Flume source as json into Hadoop's filesystem. The custom flume source uses Twitter4j to follow a number of accounts and track a list of keywords. Hive is setup to create daily partitioned tables for the collected tweets. At periodic intervals the hive query language is used to aggregate, summarize and extract information required to build the social network of retweets and mentions, as well as to associate content with each node in the social network. In R, the igraph package is then used to analyse the communities in the tweet network, to produce time-series capturing the up-and-down of popular content (e.g. hashtags), and to analyse how communities and the media influence each other.