TREC 2013 API Specifications

Lucene Analyzer

Please note that all details here are open to change. Discussion can be found in the mailing list, and in issue #23.

Tokenization

The tokenizer creates a new token whenever it encounters whitespace or one of the following characters:

_ - ? ! , ; : . ( ) [ ] @ # / \

It should be noted that although the @ and # characters are used as delimiters, they are preserved in cases where they proceed a valid mention or hashtag.

Text Normalization

All text is converted to lowercase, with the exception of URLs which are left untouched due to prevalent use of URL shorteners in tweets, many of which use case-sensitive URLs.

Stemming

The implementation of porter stemming which is provided with Lucene 4.1 is applied to all tokens, except mentions, hashtags, and URLs.

Stop Word Removal

No stop word removal is performed.

Examples

Please note that each token is surrounded by vertical bars: |

AT&T getting secret immunity from wiretapping laws for government surveillance http://vrge.co/ZP3Fx5
|att|get|secret|immun|from|wiretap|law|for|govern|surveil|http://vrge.co/ZP3Fx5|

want to see the @verge aston martin GT4 racer tear up long beach? http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219 …
|want|to|see|the|@verge|aston|martin|gt4|racer|tear|up|long|beach|http://theracersgroup.kinja.com/watch-an-aston-martin-vantage-gt4-tear-around-long-beac-479726219|

Incredibly good news! #Drupal users rally http://bit.ly/Z8ZoFe  to ensure blind accessibility contributor gets to @DrupalCon #Opensource
|incred|good|new|#drupal|user|ralli|http://bit.ly/Z8ZoFe|to|ensur|blind|access|contributor|get|to|@drupalcon|#opensource|

We're entering the quiet hours at #amznhack. #Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz
|were|enter|the|quiet|hour|at|#amznhack|#rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz|

The 2013 Social Event Detection Task (SED) at #mediaeval2013, http://bit.ly/16nITsf  supported by @linkedtv @project_mmixer @socialsensor_ip
|the|2013|social|event|detect|task|sed|at|#mediaeval2013|http://bit.ly/16nITsf|support|by|@linkedtv|@project_mmixer|@socialsensor_ip|

U.S.A. U.K. U.K USA UK #US #UK #U.S.A #U.K ...A.B.C...D..E..F..A.LONG WORD
|usa|uk|uk|usa|uk|#us|#uk|#u|sa|#u|k|abc|d|e|f|a|long|word|

this is @a_valid_mention and this_is_multiple_words
|thi|is|@a_valid_mention|and|thi|is|multipl|word|

PLEASE BE LOWER CASE WHEN YOU COME OUT THE OTHER SIDE - ALSO A @VALID_VALID-INVALID
|pleas|be|lower|case|when|you|come|out|the|other|side|also|a|@valid_valid|invalid|

＠reply @with #crazy ~＃at
|＠reply|@with|#crazy|＃at|

:@valid testing(valid)#hashtags. RT:@meniton (the last @mention is #valid and so is this:@valid), however this is@invalid
|@valid|test|valid|#hashtags|rt|@meniton|the|last|@mention|is|#valid|and|so|is|thi|@valid|howev|thi|is|invalid|

Indexing Details

This section describes the fields from Twitter’s JSON-represented statuses that the API indexes, stores, and exposes.

Indexed Fields

All fields are marked as Store.YES in the index, allowing users to access data from retrieved documents. Some fields are present in all statuses, while others only contain a value if the source JSON object contained a non-null entry in that slot. See the table below for details.

Field Name in API	Corresponding JSON element	Always Present	Data Type	Description
id	status.id	yes	long	The unique identifier assigned to this document by Twitter
screen_name	status.user.id	yes	String	The Twitter screen name of the Status author
epoch	NA	yes	long	The unix epoch (in seconds) corresponding to the created_at JSON element
text	status.text	yes	String	The text of the status
retweeted_count	status_retweet_count	yes	long	Number of times this status has been retweeted. Non-retweeted documents show 0
followers_count	status.user.followers_count	yes	int	The number of followers that the author of this status has
statuses_count	status.user.friends_count	yes	int	The number of statuses that the author of this status had at the time this status was created
lang	status.lang	no	String	The two-character language of the status (not the user) as described by the Twitter language id system
in_reply_to_status_id	status.in_reply_to_status_id	no	long	The unique identifier of the status that this document replies to
in_reply_to_user_id	status.in_reply_to_user_id	no	long	The unique identifier of the user who posted the status that this document replies to
latitude	status.geo.coordinates[1]	no	double	The latitude describing the location where the status was posted from
longitude	status.geo.coordinates[0]	no	double	The longitude describing the location where the status was posted from

Example

Consider the following JSON object:

 {
created_at: "Fri Mar 29 11:42:34 +0000 2013",
id: 317602681340432400,
id_str: "317602681340432384",
text: "@hanahani3310 alhamdulillah :)",
source: "<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>",
truncated: false,
in_reply_to_status_id: 317602575803363300,
in_reply_to_status_id_str: "317602575803363329",
in_reply_to_user_id: 1054214132,
in_reply_to_user_id_str: "1054214132",
in_reply_to_screen_name: "hanahani3310",
user: {
id: 28106647,
id_str: "28106647",
name: "cheryna",
screen_name: "cheryna27",
location: "seremban-kl-seremban",
url: "http://cheryna.blogspot.com",
description: "I fart. A lot. Go away if you can't stand the smell.",
protected: false,
followers_count: 214,
friends_count: 174,
listed_count: 1,
created_at: "Wed Apr 01 13:39:16 +0000 2009",
favourites_count: 1825,
utc_offset: -32400,
time_zone: "Alaska",
geo_enabled: true,
verified: false,
... snip ...
geo: {
type: "Point",
coordinates: [
3.14489609,
101.69596372
]
},
... snip ...
contributors: null,
retweet_count: 0,
favorite_count: 0,
favorited: false,
retweeted: false,
lang: "id"
}

If we retrieve this document from the index into a org.apache.lucene.document.Document object called hit and then invoke:

    List<IndexableField> fields = hit.getFields();
    Iterator<IndexableField> fieldIt = fields.iterator();
    while(fieldIt.hasNext()) {
      IndexableField field = fieldIt.next();
      System.out.println(field.toString());
    }

we see the following output:

stored<id:317602681340432384>
stored<epoch:1364557354>
stored,indexed,tokenized<screen_name:cheryna27>
stored,indexed,tokenized<text:@hanahani3310 alhamdulillah :)>
stored<retweet_count:0>
stored<in_reply_to_status_id:317602575803363329>
stored<in_reply_to_user_id:1054214132>
stored,indexed,tokenized<lang:id>
stored<friends_count:180>
stored<followers_count:160>
stored<statuses_count:27700>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly