Skip to content

WeSearch_StarSem

JonathonRead edited this page Feb 19, 2012 · 27 revisions

Some notes on the StarSEM 2012 shared task. I've used similar annotation conventions to our previous work, with <> for cues, {} for scope and now [] for events. For papers though we should probably follow Morante et al's (2011) conventions of bold for cues, underline for scope and italic for events.

Training data

3,640 sentences with 989 instances of negation.

99 instances have no scope; 101 instances have a discontinuous scope that is not bridged by the cue. Of the remaining 796 instances, 512 (64.3%) are aligned with some constituent from the Collins parser output (after slackening scope for constituent initial/final punctuation).

371 instances have no event; 14 instances have discontinuous events. In 6 instances the event lies outside of the scope---these seem to be annotation errors:

  • ... only {an} <un>[ambitious] {one who abandons a London career for the country} ...

  • ... {an} <un>[justifiable] {intrusion}, ...

  • {It} <never> [recovered] {from the blow}, ...

  • "But {I} [can]<'t> {forget them}, Miss Stapleton," said I.

  • ... and means to [spare] <no> {pains or expense} to restore the grandeur of his family.

  • Coming down with an <un>[signed] {warrant}.

Collins' coverage of the training data is 99.4% (21 of 3,640 sentence). In those 21 there are 10 instances of negation, for example:

  • "Know then that in the time of the Great Rebillion (the history of which by the learned Lord Clarendon I most earnestly commend to your attention) this Manor of Baskervill has held by Hugo of that name, nor <can> {[it] be gainsaid that he was a most wild, profane, and {[god]}<less> man}.

Collins can't lemmatise contractions, e.g. can't = <unknown>

Pseudo-CoNNL format

The files are provided in CONLL format, with the first 7 columns corresponding to:

  1. Book_Chapter
  2. Sentence number within chapter
  3. token number within sentence
  4. word
  5. lemma
  6. part-of-speech
  7. syntax

If the sentence does not have negations:

  • 8. ***

Otherwise there are three columns per negation:

  • (8,11,14, ...) word (or part of word) that is part of the cue
  • (9,12,15, ...) word (or part of word) that is part of the scope
  • (10,13,16, ...) word (or part of word) that is part of the event

Lemmas, part-of-speech and syntax are automatically generated using the Shalmaneser (Erk and Padó, 2006) semantic parser, which in turn uses the Collins parser.

Tokenisation issues

Training Development
Freq. Word Freq. Word
35 don't 17 't
11 can't 3 don't
7 n't
6 isn't
5 didn't
2 couldn't

Of the training data bigrams ending in n't there are:

  • 4 do n't
  • 1 did n't
  • 1 had n't
  • 1 wo n't

Of the development data bigrams ending in 't there are:

  • 7 don 't
  • 4 can 't
  • 3 didn 't
  • 1 couldn 't
  • 1 shan 't
  • 1 wasn 't

There is a full listing of tokens containing punctuation here: JimWhite/StarSemTokenTabulation.

Some examples

HoundOfTheBaskervilles_ch1, s1. prefixed cue, weirdness

  • Mr. Sherlock Holmes, who was usually very late in the mornings, save upon {those} not <in>{frequent occasions when he was up all night}, was seated at the breakfast table.

  • Mr. Sherlock Holmes, who was usually very late in the mornings, save upon {those} <not> {infrequent occasions when he was up all night}, was seated at the breakfast table.

  • Mr. Sherlock Holmes, {who was} usually {very late in the mornings,} <save> {upon those not infrequent occasions when he was up all night}, was seated at the breakfast table.

HoundOfTheBaskervilles_ch1, s12, prefixed cue

  • Since {we have been so} <un>{[fortunate]] {as to miss him} and have no notion of his errand, this accidental souvenir becomes of importance.

HoundOfTheBaskervilles_ch1, s67: discontinuous scope

  • If {he was} in the hospital and yet <not> {on the staff} he could only have been a house-surpeon or a house-physician: little more than a senior student.

HoundOfTheBaskervilles_ch1, s8: weirdness

  • It is my experience that it is only an amiable man in this world who receives testimonials, only {an} <un>[ambitious] {one who abandons a London career for the country}, and only an absent-minded one who leaves his stick and not his visiting-card after waiting an hour in your room.

HoundOfTheBaskervilles_ch1, s89: discontinuous scope

  • {The dog's jaw}, as shown in the space between these marks, {is} too broad in my opinion for a terrier and <not> {[broad] enough for a mastiff}.

HoundOfTheBaskervilles_ch3, s235: Multi-word cue, discontinuous scope

  • Then, again, whom was he waiting for that night, and why was {he [waiting] for him} in the yew alley <rather than> {in his own house}?"

HoundOfTheBaskervilles_ch4, s154: contracted cue

  • But as to my uncle's death: well, it all seems boiling up in my head, and {I [can]}<'t> {get it clear yet}.

Instances of cues

Frq. Cue POS
346 not RB
137 no DT
71 un JJ
64 no UH
58 never RB
55 nothing NN
36 n't RB
24 without IN
22 less JJ
18 no RB
17 in JJ
16 im JJ
12 none NN
8 n't JJ
6 't RB
6 n't VB
5 n't NN
5 no NNP
5 ir JJ
4 nor CC
4 un RB
4 less RB
4 in RB
3 dis NN
3 not VB
3 less NN
2 '<NULL>' '<NULL>'
2 not JJ
2 un NN
2 not NN
2 un IN
2 nowhere RB
2 by_no_means IN_DT_NN
2 prevent VB
2 n't NNP
2 't NN
2 im RB
2 on_the_contrary IN_DT_NN
1 rather_than RB_IN
1 nobody NN
1 been VBN
1 fail VBP
1 neither_*_nor CC_*_CC
1 absence NN
1 other JJ
1 nothing_at_all NN_IN_DT
1 can MD
1 neglected VBN
1 ir RB
1 un VBG
1 refused VBD
1 the DT
1 yet RB
1 never NNP
1 save VBP
1 not_for_the_world RB_IN_DT_NN
1 un VBN
1 signs NNS
1 in NNS
1 no JJ
1 unusual JJ
1 dis VBN
1 neither_*_nor DT_*_CC
1 by_no_means IN_RB_VBZ
1 not_*_not RB_*_RB
1 except IN
1 dis JJ

Top 20 frequent tokens and pos for events (>=4 instances)

The full list is here. There are 367 token/pos types.

Frq. Word POS
51 could MD
25 can RB
19 have VBP
14 had VBD
12 know VB
10 know VBP
7 able JJ
7 seen VBN
6 happy JJ
5 pleasant JJ
5 like IN
5 sign NN
5 say VB
5 man NN
4 likely JJ
4 heard VBN
4 saw VBD
4 can MD
4 possible JJ
4 known JJ
Clone this wiki locally