Hot news aggregator based on tweets from leading news agencies
Tool output is (only) important & breaking news
The tool works as follows:
- Connects to Twitter API and downloads the last tweets from a list of Twitter handles for the last 10 minutes.
- Builds a dataframe out of the last 100 tweets (or the last hour, if there weren't enough tweets).
- Preprocessing: tokenizing, dealing with time and date, calculating retweets-per-second ratio.
- Runs TFIDF on the tokenized tweet texts.
- Calculates similarity matrix (cosine similarity)
- Clusters tweets (the default is Affinity Propagation, Kmeans and Spectral Clustering are other options)
- Checks whether there are potential "hot news" clusters, based on a) are reporting agencies diverse and plentiful, b) are there breaking news (according to keywords) and c) is retweet ratio high enough.
- Chooses the best cluster and the best tweet based on retweet ratio.