-
Notifications
You must be signed in to change notification settings - Fork 4
WeSearch_StarSem
Some notes on the StarSEM 2012 shared task. I've used similar annotation conventions to our previous work, with <> for cues, {} for scope and now [] for events. For papers though we should probably follow Morante et al's (2011) conventions of bold for cues, underline for scope and italic for events.
3,640 sentences with 989 instances of negation.
99 instances have no scope; 101 instances have a discontinuous scope that is not bridged by the cue. Of the remaining 796 instances, 512 (64.3%) are aligned with some constituent from the Collins parser output (after slackening scope for constituent initial/final punctuation).
371 instances have no event; 14 instances have discontinuous events. In 6 instances the event lies outside of the scope---these seem to be annotation errors:
-
... only {an} <un>[ambitious] {one who abandons a London career for the country} ...
-
... {an} <un>[justifiable] {intrusion}, ...
-
{It} <never> [recovered] {from the blow}, ...
-
"But {I} [can]<'t> {forget them}, Miss Stapleton," said I.
-
... and means to [spare] <no> {pains or expense} to restore the grandeur of his family.
-
Coming down with an <un>[signed] {warrant}.
Collins' coverage of the training data is 99.4% (21 of 3,640 sentence). In those 21 there are 10 instances of negation, for example:
- "Know then that in the time of the Great Rebillion (the history of which by the learned Lord Clarendon I most earnestly commend to your attention) this Manor of Baskervill has held by Hugo of that name, nor <can> {[it] be gainsaid that he was a most wild, profane, and {[god]}<less> man}.
Collins can't lemmatise contractions, e.g. can't = <unknown>
The files are provided in CONLL format, with the first 7 columns corresponding to:
- Book_Chapter
- Sentence number within chapter
- token number within sentence
- word
- lemma
- part-of-speech
- syntax
If the sentence does not have negations:
- 8. ***
Otherwise there are three columns per negation:
- (8,11,14, ...) word (or part of word) that is part of the cue
- (9,12,15, ...) word (or part of word) that is part of the scope
- (10,13,16, ...) word (or part of word) that is part of the event
Lemmas, part-of-speech and syntax are automatically generated using the Shalmaneser (Erk and Padó, 2006) semantic parser, which in turn uses the Collins parser.
Training | Development | ||
Freq. | Word | Freq. | Word |
35 | don't | 17 | 't |
11 | can't | 3 | don't |
7 | n't | ||
6 | isn't | ||
5 | didn't | ||
2 | couldn't |
Of the training data bigrams ending in n't there are:
- 4 do n't
- 1 did n't
- 1 had n't
- 1 wo n't
Of the development data bigrams ending in 't there are:
- 7 don 't
- 4 can 't
- 3 didn 't
- 1 couldn 't
- 1 shan 't
- 1 wasn 't
There is a full listing of tokens containing punctuation here: JimWhite/StarSemTokenTabulation.
HoundOfTheBaskervilles_ch1, s1. prefixed cue, weirdness
-
Mr. Sherlock Holmes, who was usually very late in the mornings, save upon {those} not <in>{frequent occasions when he was up all night}, was seated at the breakfast table.
-
Mr. Sherlock Holmes, who was usually very late in the mornings, save upon {those} <not> {infrequent occasions when he was up all night}, was seated at the breakfast table.
-
Mr. Sherlock Holmes, {who was} usually {very late in the mornings,} <save> {upon those not infrequent occasions when he was up all night}, was seated at the breakfast table.
HoundOfTheBaskervilles_ch1, s12, prefixed cue
- Since {we have been so} <un>{[fortunate]] {as to miss him} and have no notion of his errand, this accidental souvenir becomes of importance.
HoundOfTheBaskervilles_ch1, s67: discontinuous scope
- If {he was} in the hospital and yet <not> {on the staff} he could only have been a house-surpeon or a house-physician: little more than a senior student.
HoundOfTheBaskervilles_ch1, s8: weirdness
- It is my experience that it is only an amiable man in this world who receives testimonials, only {an} <un>[ambitious] {one who abandons a London career for the country}, and only an absent-minded one who leaves his stick and not his visiting-card after waiting an hour in your room.
HoundOfTheBaskervilles_ch1, s89: discontinuous scope
- {The dog's jaw}, as shown in the space between these marks, {is} too broad in my opinion for a terrier and <not> {[broad] enough for a mastiff}.
HoundOfTheBaskervilles_ch3, s235: Multi-word cue, discontinuous scope
- Then, again, whom was he waiting for that night, and why was {he [waiting] for him} in the yew alley <rather than> {in his own house}?"
HoundOfTheBaskervilles_ch4, s154: contracted cue
- But as to my uncle's death: well, it all seems boiling up in my head, and {I [can]}<'t> {get it clear yet}.
Frq. | Cue | POS |
346 | not | RB |
137 | no | DT |
71 | un | JJ |
64 | no | UH |
58 | never | RB |
55 | nothing | NN |
36 | n't | RB |
24 | without | IN |
22 | less | JJ |
18 | no | RB |
17 | in | JJ |
16 | im | JJ |
12 | none | NN |
8 | n't | JJ |
6 | 't | RB |
6 | n't | VB |
5 | n't | NN |
5 | no | NNP |
5 | ir | JJ |
4 | nor | CC |
4 | un | RB |
4 | less | RB |
4 | in | RB |
3 | dis | NN |
3 | not | VB |
3 | less | NN |
2 | '<NULL>' | '<NULL>' |
2 | not | JJ |
2 | un | NN |
2 | not | NN |
2 | un | IN |
2 | nowhere | RB |
2 | by_no_means | IN_DT_NN |
2 | prevent | VB |
2 | n't | NNP |
2 | 't | NN |
2 | im | RB |
2 | on_the_contrary | IN_DT_NN |
1 | rather_than | RB_IN |
1 | nobody | NN |
1 | been | VBN |
1 | fail | VBP |
1 | neither_*_nor | CC_*_CC |
1 | absence | NN |
1 | other | JJ |
1 | nothing_at_all | NN_IN_DT |
1 | can | MD |
1 | neglected | VBN |
1 | ir | RB |
1 | un | VBG |
1 | refused | VBD |
1 | the | DT |
1 | yet | RB |
1 | never | NNP |
1 | save | VBP |
1 | not_for_the_world | RB_IN_DT_NN |
1 | un | VBN |
1 | signs | NNS |
1 | in | NNS |
1 | no | JJ |
1 | unusual | JJ |
1 | dis | VBN |
1 | neither_*_nor | DT_*_CC |
1 | by_no_means | IN_RB_VBZ |
1 | not_*_not | RB_*_RB |
1 | except | IN |
1 | dis | JJ |
The full list is here. There are 367 token/pos types.
Frq. | Word | POS |
51 | could | MD |
25 | can | RB |
19 | have | VBP |
14 | had | VBD |
12 | know | VB |
10 | know | VBP |
7 | able | JJ |
7 | seen | VBN |
6 | happy | JJ |
5 | pleasant | JJ |
5 | like | IN |
5 | sign | NN |
5 | say | VB |
5 | man | NN |
4 | likely | JJ |
4 | heard | VBN |
4 | saw | VBD |
4 | can | MD |
4 | possible | JJ |
4 | known | JJ |
Home | Forum | Discussions | Events