-
Notifications
You must be signed in to change notification settings - Fork 100
TREC 2013 API Specifications
The TREC 2013 microblog track is experimenting with the "track as a service" model. Instead of distributing the collection, the evaluation will be conducted by having everyone use a common API to access the collection. This page describes the specification of the API.
Here's a sample invocation of the command-line interface to the search API:
etc/run.sh cc.twittertools.search.retrieval.TrecSearchThriftClientCli \
-host [HOSTNAME] -port 9090 -qid MB01 -q 'BBC World Service staff cuts' \
-max_id 34952194402811905 -num_results 1000 -runtag lucene \
-group [GROUP] -token [TOKEN]
After you've cloned the twitter-tools
repo and successfully built the project with ant
, the above command should work. Note that you need to specify three fields:
-
[HOSTNAME]
: the hostname serving the API -
[GROUP]
: your group id -
[TOKEN]
: your authentication token
NOTE: We're still figuring out the details of how exactly to distribute this information... for now, bug Jimmy Lin.
In this example, we are search for topic MB01
from TREC 2011 (assuming the service provides the Tweets2011 corpus
). The other command line parameters are:
-
port
: use 9090 -
qid
: the topic id -
q
: the query -
max_id
: return docids up to this value -
num_results
: number of hits to return -
runtag
: runtag to use (fromtrec_eval
output format)
This section describes the fields from Twitter’s JSON-represented statuses that the API indexes, stores, and exposes.
All fields are marked as Store.YES in the index, allowing users to access data from retrieved documents. Some fields are present in all statuses, while others only contain a value if the source JSON object contained a non-null entry in that slot. See the table below for details.
Field Name in API | Corresponding JSON element | Always Present | Data Type | Description |
id | status.id | yes | long | The unique identifier assigned to this document by Twitter |
screen_name | status.user.id | yes | String | The Twitter screen name of the Status author |
epoch | NA | yes | long | The unix epoch (in seconds) corresponding to the created_at JSON element |
text | status.text | yes | String | The text of the status |
retweeted_count | status_retweet_count | yes | long | Number of times this status has been retweeted. Non-retweeted documents show 0 |
followers_count | status.user.followers_count | yes | int | The number of followers that the author of this status has |
statuses_count | status.user.friends_count | yes | int | The number of statuses that the author of this status had at the time this status was created |
lang | status.lang | no | String | The two-character language of the status (not the user) as described by the Twitter language id system |
in_reply_to_status_id | status.in_reply_to_status_id | no | long | The unique identifier of the status that this document replies to |
in_reply_to_user_id | status.in_reply_to_user_id | no | long | The unique identifier of the user who posted the status that this document replies to |
retweeted_status_id | status.retweeted_status_id | no | long | The unique identifier of the tweet that this is a retweet of. |
retweeted_user_id | status.retweeted_user_id | no | long | The user ID of person who posted the tweet that this is a retweet of. |
Consider the following JSON object:
{
created_at: "Fri Mar 29 11:42:34 +0000 2013",
id: 317602681340432400,
id_str: "317602681340432384",
text: "@hanahani3310 alhamdulillah :)",
source: "<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>",
truncated: false,
in_reply_to_status_id: 317602575803363300,
in_reply_to_status_id_str: "317602575803363329",
in_reply_to_user_id: 1054214132,
in_reply_to_user_id_str: "1054214132",
in_reply_to_screen_name: "hanahani3310",
user: {
id: 28106647,
id_str: "28106647",
name: "cheryna",
screen_name: "cheryna27",
location: "seremban-kl-seremban",
url: "http://cheryna.blogspot.com",
description: "I. A lot. Go away if you can't stand the smell.",
protected: false,
followers_count: 214,
friends_count: 174,
listed_count: 1,
created_at: "Wed Apr 01 13:39:16 +0000 2009",
favourites_count: 1825,
utc_offset: -32400,
time_zone: "Alaska",
geo_enabled: true,
verified: false,
... snip ...
geo: {
type: "Point",
coordinates: [
3.14489609,
101.69596372
]
},
... snip ...
contributors: null,
retweet_count: 0,
favorite_count: 0,
favorited: false,
retweeted: false,
lang: "id"
}
If we retrieve this document from the index into a org.apache.lucene.document.Document object called hit and then invoke:
List<IndexableField> fields = hit.getFields();
Iterator<IndexableField> fieldIt = fields.iterator();
while(fieldIt.hasNext()) {
IndexableField field = fieldIt.next();
System.out.println(field.toString());
}
we see the following output:
stored<id:317602681340432384>
stored<epoch:1364557354>
stored,indexed,tokenized<screen_name:cheryna27>
stored,indexed,tokenized<text:@hanahani3310 alhamdulillah :)>
stored<retweet_count:0>
stored<in_reply_to_status_id:317602575803363329>
stored<in_reply_to_user_id:1054214132>
stored,indexed,tokenized<lang:id>
stored<friends_count:180>
stored<followers_count:160>
stored<statuses_count:27700>
Please note that all details here are open to change. Discussion can be found in the mailing list, and in issue #23.
The tokenizer creates a new token whenever it encounters whitespace or one of the following characters:
] [ ! " # $ % & ( ) * + , . / : ; < = > ? @ \ ^ _ ` { | } ~ - … ¬ ·
There are a number of exceptions where characters listed above will not cause a new token to be created:
- A period (.) will not cause a new token if it is used as part of an acronym.
- An ampersand (&) will not cause a new token if the character on both sides is uppercase (such as M&S, AT&T or H&M).
- The characters @, # and _ will not cause a new token if used as part of a mention or hashtag.
- Valid URLs are not tokenized
All text is converted to lowercase, with the exception of URLs which are left untouched due to prevalent use of URL shorteners in tweets, many of which use case-sensitive URLs.
The implementation of porter stemming which is provided with Lucene 4.1 is applied to all tokens, except mentions, hashtags, and URLs.
No stop word removal is performed.
Please note that each token is surrounded by vertical bars: |
AT&T getting secret immunity from wiretapping laws for government surveillance http://vrge.co/ZP3Fx5
|att|get|secret|immun|from|wiretap|law|for|govern|surveil|http://vrge.co/ZP3Fx5|
want to see the @verge aston martin GT4 racer tear up long beach? http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219 …
|want|to|see|the|@verge|aston|martin|gt4|racer|tear|up|long|beach|http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219|
Incredibly good news! #Drupal users rally http://bit.ly/Z8ZoFe to ensure blind accessibility contributor gets to @DrupalCon #Opensource
|incred|good|new|#drupal|user|ralli|http://bit.ly/Z8ZoFe|to|ensur|blind|access|contributor|get|to|@drupalcon|#opensource|
We're entering the quiet hours at #amznhack. #Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
|were|enter|the|quiet|hour|at|#amznhack|#rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz|
The 2013 Social Event Detection Task (SED) at #mediaeval2013, http://bit.ly/16nITsf supported by @linkedtv @project_mmixer @socialsensor_ip
|the|2013|social|event|detect|task|sed|at|#mediaeval2013|http://bit.ly/16nITsf|support|by|@linkedtv|@project_mmixer|@socialsensor_ip|
U.S.A. U.K. U.K USA UK #US #UK #U.S.A #U.K ...A.B.C...D..E..F..A.LONG WORD
|usa|uk|uk|usa|uk|#us|#uk|#u|sa|#u|k|abc|d|e|f|a|long|word|
this is @a_valid_mention and this_is_multiple_words
|thi|is|@a_valid_mention|and|thi|is|multipl|word|
PLEASE BE LOWER CASE WHEN YOU COME OUT THE OTHER SIDE - ALSO A @VALID_VALID-INVALID
|pleas|be|lower|case|when|you|come|out|the|other|side|also|a|@valid_valid|invalid|
@reply @with #crazy ~#at
|@reply|@with|#crazy|#at|
:@valid testing(valid)#hashtags. RT:@meniton (the last @mention is #valid and so is this:@valid), however this is@invalid
|@valid|test|valid|#hashtags|rt|@meniton|the|last|@mention|is|#valid|and|so|is|thi|@valid|howev|thi|is|invalid|
this][is[lots[(of)words+with-lots=of-strange!characters?$in-fact=it&has&Every&Single:one;of<them>in_here_B&N_test_test?test\test^testing`testing{testing}testing…testing¬testing·testing what?
|thi|is|lot|of|word|with|lot|of|strang|charact|in|fact|it|ha|everi|singl|on|of|them|in|here|bn|test|test|test|test|test|test|test|test|test|test|test|what|