Skip to content

Latest commit

 

History

History
149 lines (82 loc) · 5.45 KB

annotated_twitter_corpus.md

File metadata and controls

149 lines (82 loc) · 5.45 KB

Manually Annotated Twitter Sentiment corpus

@@ If you cannot access the dataset, I would be super happy to help. Email me @: [email protected] @@

Update:

We now released the tweets and you can download them directly from this GitHub Repo

We provide the dataset used in our NaijaSenti paper. We will host Zindi compeitition soon on sentiment classification for Nigerian languages. Therefore, only training and validation set are released train_eval_split.

How to download the dataset?

Twitter has a strong policy for public distribuition of user data. Below is an excerpt from Twitter policy.

The best place to get Twitter Content is directly from Twitter. Consequently, we restrict the redistribution of Twitter Content to third parties. If you provide Twitter Content to third parties, including downloadable datasets or via an API, you may only distribute Tweet IDs, Direct Message IDs, and/or User IDs (except as described below). We also grant special permissions to academic researchers sharing Tweet IDs and User IDs for non-commercial research purposes.

As a result, we are unable to directly share the entire Tweet text. Instead, we realese the dataset with the following metadata for each language: tweet ids and the annotation labels. Below is an example of the dataset.

tweetIDs label
1329755580903415808 negative
1387857032523489280 negative
1177449493844787200 positive
1082503529007403008 neutral

We provide python and R code below to allow hydrating all the tweets in our dataset using Valid Twitter API credential. Please, if you have any trouble, please send an email to [email protected] and I will gladly assist you in obtaining the dataset.

Hydrating Tweets using Tweets IDs.

Our corpus was built using Twitter API v2 which allow access to historical Tweets from the entire archive of public conversation on Twitter, dating back to 2006 (using the full-archive search endpoint). However, Twitter API v2 is for academic researchers and you can apply here:academic research product track

Prerequisites

To crawl tweets you will need to have a set of keys and tokens to authenticate your request. You can generate these keys and tokens. See the following for more information on how to generate these keys

  1. Getting your keys and bearer token from the developer dashboard
  2. How to get access to the Twitter API

Hydrate Tweets using Tweet IDs in Python

We will be using the twarc library in Python. More info on using twarc

#Open up a new terminal and install twarc v2 
pip3 install --upgrade twarc

Once you've got your Twitter developer access set up you can tell twarc what they are with the configure command

twarc2 configure

twarc's hydrate command will read a file of tweet identifiers and write out the tweet JSON for them using Twitter's tweets API endpoint:

twarc2 hydrate ids.txt tweets.jsonl

The input file, ids.txt is expected to be a file that contains a tweet identifier on each line, without quotes or a header suh as:

919505987303886849
919505982882844672
919505982602039297

Processing data into common analysis formats

The resulting file (tweets.jsonl)is a data type called JSON, which has many advantages for moving large amounts of structured data. However, we need to take some steps to transform the JSON into a form more common for data analysis. We can use the twarc-csv module to convert the line oriented JSON to CSV which then should be more easy to use as DataFrames in tools like Pandas and R as follows:

# install  twarc-csv
pip3 install --upgrade twarc-csv

 # convert to CSV
twarc2 csv tweets.jsonl tweets.csv

You can load the CSV into a Pandas DataFrame.

Hydrate Tweets using Tweet IDs in R

For R users, you can use the academicTwitteR package in R. All the code can be found here. Install the academicTwitteR first:

# Install academicTwitteR package
install.packages("academicTwitteR")


# This will load the academicTwitteR package
library(academictwitteR)

# Set your own bearer token (replace the XXXXX with your own bearer token)
bearer_token <- "XXXXX"


hydrate_tweets(
  ids = c("919505987303886849", "919505982882844672", "919505982602039297")
  bearer_token = bearer_token,
  data_path = "data",
  bind_tweets = TRUE,
  verbose = TRUE
)

Processing data into common analysis formats

You can bind_tweets function to bundle the JSONs into a data.frame object for analysis in R as follows:

tweets <- bind_tweets(data_path = "data/")
users <- bind_tweets(data_path = "data/", user = TRUE)

#You can also bind JSONs into tidy format by specify a tidy output format.
bind_tweets(data_path = "data", output_format = "tidy")

I cannot download the tweets, what can I do?

Please send an email to [email protected] and I will gladly assist you in obtaining the dataset.