Analysis on twitter sentiment analysis benchmark datasets as described in the paper Shubhanshu Mishra and Jana Diesner. 2018. Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora. In Proceedings of the 29th on Hypertext and Social Media (HT '18). ACM, New York, NY, USA, 2-10. DOI: https://doi.org/10.1145/3209542.3209562
If you plan to use this analysis please cite the following items:
@inproceedings{Mishra2018,
doi = {10.1145/3209542.3209562},
url = {https://doi.org/10.1145/3209542.3209562},
year = {2018},
publisher = {{ACM} Press},
author = {Shubhanshu Mishra and Jana Diesner},
title = {Detecting the Correlation between Sentiment and User-level as well as Text-Level Meta-data from Benchmark Corpora},
booktitle = {Proceedings of the 29th on Hypertext and Social Media - {HT} {\textquotesingle}18}
}
@misc{shubhanshu_mishra_2018_1308462,
author = {Shubhanshu Mishra},
title = {Twitter sentiment benchmark data analysis},
month = jul,
year = 2018,
doi = {10.5281/zenodo.1308462},
url = {https://doi.org/10.5281/zenodo.1308462}
}
You can use the training, validation, and test splits data_with_train_dev_test_split.txt.gz
as used in the paper by downloading the data in the data folder:
$ ls -ltrh data/
total 11M
-rw-rw-r-- 1 smishra8 is-sailgroup 5.1M May 16 04:26 joined_data_all.txt.gz
-rw-rw-r-- 1 smishra8 is-sailgroup 5.1M May 16 04:48 data_with_train_dev_test_split.txt.gz
The file was created as follows:
cd data && gunzip joined_data_all.txt.gz
python create_data_splits.py
- SemEval - http://alt.qcri.org/semeval2017/task4/
- Airline - https://www.kaggle.com/crowdflower/twitter-airline-sentiment
- GOP Debate - https://www.kaggle.com/crowdflower/first-gop-debate-twitter-sentiment
- Clarin - https://www.clarin.si/repository/xmlui/handle/11356/1054
- HCR - https://bitbucket.org/speriosu/updown/wiki/Getting%20Started
- Obama - https://bitbucket.org/speriosu/updown/wiki/Getting%20Started
Detecting the correlation between sentiment and user-level as well as text-level meta-data from benchmark corpora
Code for this analysis will can be seen in following files:
- Prepare all data for analysis - Join_all_data.ipynb
- Analyze the original benchmark datasets - Empirical_Analysis.ipynb
- Models based on meta, text, and joint features - Text_models.ipynb
- Get user's timeline tweets - Get_user_timelines.ipynb
- Predict timeline tweets using Vader Sentiment - Vader_sentiment_prediction.ipynb
- Analyze the timeline data - Time_line_aggregates.ipynb
Code released under GNU General Public License v3.0