@@ If you cannot access the dataset, I would be super happy to help. Email me @: [email protected] @@
We now released the tweets and you can download them directly from this GitHub Repo
We provide the dataset used in our NaijaSenti paper. We will host Zindi compeitition soon on sentiment classification for Nigerian languages. Therefore, only training and validation set are released train_eval_split.
Twitter has a strong policy for public distribuition of user data. Below is an excerpt from Twitter policy.
The best place to get Twitter Content is directly from Twitter. Consequently, we restrict the redistribution of Twitter Content to third parties. If you provide Twitter Content to third parties, including downloadable datasets or via an API, you may only distribute Tweet IDs, Direct Message IDs, and/or User IDs (except as described below). We also grant special permissions to academic researchers sharing Tweet IDs and User IDs for non-commercial research purposes.
As a result, we are unable to directly share the entire Tweet text. Instead, we realese the dataset with the following metadata for each language: tweet ids and the annotation labels. Below is an example of the dataset.
tweetIDs | label |
---|---|
1329755580903415808 | negative |
1387857032523489280 | negative |
1177449493844787200 | positive |
1082503529007403008 | neutral |
We provide python and R code below to allow hydrating all the tweets in our dataset using Valid Twitter API credential. Please, if you have any trouble, please send an email to [email protected] and I will gladly assist you in obtaining the dataset.
Our corpus was built using Twitter API v2 which allow access to historical Tweets from the entire archive of public conversation on Twitter, dating back to 2006 (using the full-archive search endpoint). However, Twitter API v2 is for academic researchers and you can apply here:academic research product track
To crawl tweets you will need to have a set of keys and tokens to authenticate your request. You can generate these keys and tokens. See the following for more information on how to generate these keys
- Getting your keys and bearer token from the developer dashboard
- How to get access to the Twitter API
We will be using the twarc library in Python. More info on using twarc
#Open up a new terminal and install twarc v2
pip3 install --upgrade twarc
Once you've got your Twitter developer access set up you can tell twarc what they are with the configure command
twarc2 configure
twarc's hydrate command will read a file of tweet identifiers and write out the tweet JSON for them using Twitter's tweets API endpoint:
twarc2 hydrate ids.txt tweets.jsonl
The input file, ids.txt is expected to be a file that contains a tweet identifier on each line, without quotes or a header suh as:
919505987303886849
919505982882844672
919505982602039297
The resulting file (tweets.jsonl
)is a data type called JSON, which has many advantages for moving large amounts of structured data. However, we need to take some steps to transform the JSON into a form more common for data analysis. We can use the twarc-csv module to convert the line oriented JSON to CSV which then should be more easy to use as DataFrames in tools like Pandas and R as follows:
# install twarc-csv
pip3 install --upgrade twarc-csv
# convert to CSV
twarc2 csv tweets.jsonl tweets.csv
You can load the CSV into a Pandas DataFrame.
For R users, you can use the academicTwitteR package in R. All the code can be found here. Install the academicTwitteR first:
# Install academicTwitteR package
install.packages("academicTwitteR")
# This will load the academicTwitteR package
library(academictwitteR)
# Set your own bearer token (replace the XXXXX with your own bearer token)
bearer_token <- "XXXXX"
hydrate_tweets(
ids = c("919505987303886849", "919505982882844672", "919505982602039297")
bearer_token = bearer_token,
data_path = "data",
bind_tweets = TRUE,
verbose = TRUE
)
You can bind_tweets
function to bundle the JSONs into a data.frame object for analysis in R as follows:
tweets <- bind_tweets(data_path = "data/")
users <- bind_tweets(data_path = "data/", user = TRUE)
#You can also bind JSONs into tidy format by specify a tidy output format.
bind_tweets(data_path = "data", output_format = "tidy")
Please send an email to [email protected] and I will gladly assist you in obtaining the dataset.