Ontonotes data and experiments #32

anoopsarkar · 2016-06-09T07:11:03Z

Task: Convert the Ontonotes data into the CoNLL format.

The instructions for conversion are given here: http://cemantix.org/data/ontonotes.html

It also contains the script to convert to CoNLL format for all three languages: English, Chinese and Arabic.

jerryljq · 2016-06-09T07:19:13Z

Hi Anoop, I think this issue should also be assigned to me, right? Jiaqi.

anoopsarkar · 2016-06-09T07:22:46Z

@vista521 added you to the development team on github. once you accept I can add you as assignee.

jerryljq · 2016-06-09T20:21:42Z

@anoopsarkar Oops seems I didn't see any notifications to let me accept it. Probably there's something wrong...

anoopsarkar · 2016-06-10T23:56:40Z

@vista521 you should have access now.

for all the assignees, please only push to a branch for the ontonotes experiments along with your collaborators. when the convertors are written and experiments are done please send a pull request with any ontonotes conversion code and the log files for the experiments.

liushiqi9 · 2016-06-16T05:08:17Z

@anoopsarkar
Now we have converted only English data because we don't have conll-formatted for Chinese and Arabic.

This is how it looks like for English.conll:

bn/abc/00/abc_0009   0    4    optimistic    JJ       (S(ADJP*        -    -   -   -       *   (C-ARG1*     -
bn/abc/00/abc_0009   0    5         about    IN           (PP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    6           the    DT        (NP(NP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    7        future    NN              *)   future   -   1   -       *          *     -
bn/abc/00/abc_0009   0    8            of    IN           (PP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    9           the    DT           (NP*        -    -   -   -       *          *   (12
bn/abc/00/abc_0009   0   10       Mideast   NNP        *)))))))       -    -   -   -    (LOC)         *)   12)
bn/abc/00/abc_0009   0   11             .     .             *))       -    -   -   -       *          *     -

bn/abc/00/abc_0009   0    0          That    DT   (TOP(SINV(SBAR(S(NP*)       -    -   -   -       *   (ARG1*     -
bn/abc/00/abc_0009   0    1            's   VBZ                   (VP*        be   -   1   -       *        *     -
bn/abc/00/abc_0009   0    2           the    DT                (NP(NP*        -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    3    heartbreak    NN                      *)       -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    4            of    IN                   (PP*        -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    5          this    DT                   (NP*        -    -   -   -       *        *   (12
bn/abc/00/abc_0009   0    6        region    NN                 *))))))   region   -   3   -       *        *)   12)
bn/abc/00/abc_0009   0    7          says   VBZ                   (VP*)      say  01   1   -       *      (V*)    -
bn/abc/00/abc_0009   0    8           one    CD                   (NP*        -    -   -   -       *   (ARG0*     -
bn/abc/00/abc_0009   0    9         State   NNP                  (NML*        -    -   -   -   (ORG*        *     -
bn/abc/00/abc_0009   0   10    Department   NNP                      *)       -    -   -   -       *)       *     -
bn/abc/00/abc_0009   0   11      official    NN                      *)       -    -   -   -       *        *)    -
bn/abc/00/abc_0009   0   12             .     .                     *))       -    -   -   -       *        *     -

bn/abc/00/abc_0009   0    0    Whenever   WRB   (TOP(S(SBAR(WHADVP*)     -    -   -   -   *    (ARGM-TMP*)     *   (ARGM-TMP*      *             *   -
bn/abc/00/abc_0009   0    1         you   PRP                (S(NP*)     -    -   -   -   *        (ARG0*)     *            *      *             *   -
bn/abc/00/abc_0009   0    2        take   VBP                  (VP*    take  01   1   -   *           (V*)     *            *      *             *   -
bn/abc/00/abc_0009   0    3           a    DT                  (NP*      -    -   -   -   *        (ARG1*      *            *      *             *   -
bn/abc/00/abc_0009   0    4        step    NN                     *)   step   -   1   -   *             *)     *            *      *             *   -
bn/abc/00/abc_0009   0    5     forward    RB             (ADVP*))))     -    -   -   -   *    (ARGM-DIR*)     *            *)     *             *   -
bn/abc/00/abc_0009   0    6           ,     ,                     *      -    -   -   -   *             *      *            *      *             *   -
bn/abc/00/abc_0009   0    7         you   PRP                  (NP*)     -    -   -   -   *             *      *            *      *        (ARG1*)  -
bn/abc/00/abc_0009   0    8         are   VBP                  (VP*      be  03   -   -   *             *    (V*)           *      *             *   -
bn/abc/00/abc_0009   0    9       bound   VBN                  (VP*    bind  02   -   -   *             *      *          (V*)     *             *   -
bn/abc/00/abc_0009   0   10          to    TO                (S(VP*      -    -   -   -   *             *      *       (ARG1*      *             *   -
bn/abc/00/abc_0009   0   11          be    VB                  (VP*      be  03   -   -   *             *      *            *    (V*)            *   -
bn/abc/00/abc_0009   0   12      pushed   VBN                  (VP*    push  01   1   -   *             *      *            *      *           (V*)  -
bn/abc/00/abc_0009   0   13         way    RB                (ADVP*)     -    -   -   -   *             *      *            *      *    (ARGM-EXT*)  -
bn/abc/00/abc_0009   0   14        back    RB          (ADVP*)))))))     -    -   -   -   *             *      *            *)     *        (ARG2*)  -
bn/abc/00/abc_0009   0   15           .     .                    *))     -    -   -   -   *             *      *            *      *             *   -

bn/abc/00/abc_0009   0   0        Martha   NNP  (TOP(FRAG(NP*   -   -   -   -   (PERSON*   -
bn/abc/00/abc_0009   0   1       Raddatz   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   2             ,     ,              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   3           ABC   NNP           (NP*   -   -   -   -      (ORG*   -
bn/abc/00/abc_0009   0   4          News   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   5             ,     ,              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   6           the    DT           (NP*   -   -   -   -      (FAC*   -
bn/abc/00/abc_0009   0   7         State   NNP              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   8    Department   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   9             .     .             *))  -   -   -   -          *   -

anoopsarkar · 2016-06-17T23:25:48Z

According to the website: http://cemantix.org/data/ontonotes.html the script skeleton2conll.sh should work on all three Ontonotes languages. Does it do strange things when run on Chinese and Arabic?

anoopsarkar · 2016-06-22T22:03:49Z

Try the script on this page:
http://conll.cemantix.org/2012/data.html

jerryljq · 2016-06-27T17:23:41Z

I have downloaded the new script and format files and run the script based on the new files. Seems the new files worked on Chinese and Arabic.

anoopsarkar · 2016-06-27T17:49:53Z

Did you also repeat the conversion for English with the new script?

jerryljq · 2016-06-27T17:52:08Z

@anoopsarkar Yes, I think so. The new script comes with a new data set, including all three languages. I just saved all those converted files in a new folder.

anoopsarkar · 2016-06-27T17:53:29Z

ok. next step will be to create a new format file and config files for the new data. then the experiments can be run to train on ontonotes for each language and measure UAS on dev data.

anoopsarkar · 2016-06-29T22:56:10Z

English Ontonotes skel files are available at this location:

https://github.com/ontonotes/conll-formatted-ontonotes-5.0

(just for future reference)

jerryljq · 2016-07-19T04:57:31Z

@anoopsarkar Hi Anoop, I read the meeting notes last week. We should start doing training and testing on the dev sets. Since the data does not include the dependency tree, should we just run pos_tagger.py to train and test the POSTAG only?

anoopsarkar · 2016-07-19T05:20:30Z

@vista521 the plan was to use penn2malt for English and Chinese (we have the head rules for these two languages) to convert into dependency format.

jerryljq · 2016-07-19T22:12:06Z

@anoopsarkar I have extracted data as the input to Penn2Malt, but I have a problem here. When I try to run the tool using the headrules provided on its website, it told me that "could not find category" when it tried to match some keywords like TOP, NML and so on. These keywords are not in the headrule file. I doubt if there is a new version of the headrule file, since the one provided on the website could date back to 2003 while our data is in 2012. I also did not find any related files locally. Could you help on this problem?
The Penn2Malt website is: http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html

anoopsarkar · 2016-07-19T22:33:19Z

@vista521 was this for English or Chinese?

jerryljq · 2016-07-19T22:34:02Z

@anoopsarkar It's for both English and Chinese.

jerryljq · 2016-07-25T22:43:30Z

@anoopsarkar Hi Anoop, is there any updates or ideas for the issue above?

anoopsarkar · 2016-07-27T22:48:55Z

have a look at the English headrules given in this presentation:
http://nlp.mathcs.emory.edu/doc/tlt-2010-choi-slides.pdf

anoopsarkar · 2016-07-27T22:57:03Z

Implementation of the above presentation seems to be here:
https://github.com/clir/clearnlp/blob/master/src/main/java/edu/emory/clir/clearnlp/conversion/EnglishC2DConverter.java

You may have to install the entire clearNLP toolkit:

https://github.com/clir/clearnlp

jerryljq · 2016-08-02T23:33:55Z

@anoopsarkar I have searched the whole project you provided above and finally found the a txt file which contains organized head rules. Now the Penn2Malt tool could work. However, the head rule only works for English, so it seems Chinese and Arabic cannot be handled.

kalryoma · 2016-08-03T21:06:49Z

I've tested the converted data for English. Result shows as following: (5 iterations, 75000+sentences per iteration)

Total Training Time:  29926.6495328
Interface object FirstOrderFeatureGenerator detected
     with interface get_local_vector
Evaluating...
Unlabeled accuracy: 0.874648393964
Unlabeled attachment accuracy: 0.881284276603

anoopsarkar assigned kalryoma and liushiqi9 Jun 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ontonotes data and experiments #32

Ontonotes data and experiments #32

anoopsarkar commented Jun 9, 2016

jerryljq commented Jun 9, 2016

anoopsarkar commented Jun 9, 2016

jerryljq commented Jun 9, 2016

anoopsarkar commented Jun 10, 2016

liushiqi9 commented Jun 16, 2016

anoopsarkar commented Jun 17, 2016

anoopsarkar commented Jun 22, 2016

jerryljq commented Jun 27, 2016

anoopsarkar commented Jun 27, 2016

jerryljq commented Jun 27, 2016

anoopsarkar commented Jun 27, 2016

anoopsarkar commented Jun 29, 2016

jerryljq commented Jul 19, 2016

anoopsarkar commented Jul 19, 2016

jerryljq commented Jul 19, 2016

anoopsarkar commented Jul 19, 2016

jerryljq commented Jul 19, 2016

jerryljq commented Jul 25, 2016

anoopsarkar commented Jul 27, 2016

anoopsarkar commented Jul 27, 2016

jerryljq commented Aug 2, 2016

kalryoma commented Aug 3, 2016 •

edited

Loading

Ontonotes data and experiments #32

Ontonotes data and experiments #32

Comments

anoopsarkar commented Jun 9, 2016

jerryljq commented Jun 9, 2016

anoopsarkar commented Jun 9, 2016

jerryljq commented Jun 9, 2016

anoopsarkar commented Jun 10, 2016

liushiqi9 commented Jun 16, 2016

anoopsarkar commented Jun 17, 2016

anoopsarkar commented Jun 22, 2016

jerryljq commented Jun 27, 2016

anoopsarkar commented Jun 27, 2016

jerryljq commented Jun 27, 2016

anoopsarkar commented Jun 27, 2016

anoopsarkar commented Jun 29, 2016

jerryljq commented Jul 19, 2016

anoopsarkar commented Jul 19, 2016

jerryljq commented Jul 19, 2016

anoopsarkar commented Jul 19, 2016

jerryljq commented Jul 19, 2016

jerryljq commented Jul 25, 2016

anoopsarkar commented Jul 27, 2016

anoopsarkar commented Jul 27, 2016

jerryljq commented Aug 2, 2016

kalryoma commented Aug 3, 2016 • edited Loading

kalryoma commented Aug 3, 2016 •

edited

Loading