Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ontonotes data and experiments #32

Open
anoopsarkar opened this issue Jun 9, 2016 · 22 comments
Open

Ontonotes data and experiments #32

anoopsarkar opened this issue Jun 9, 2016 · 22 comments
Assignees

Comments

@anoopsarkar
Copy link
Member

Task: Convert the Ontonotes data into the CoNLL format.

The instructions for conversion are given here: http://cemantix.org/data/ontonotes.html

It also contains the script to convert to CoNLL format for all three languages: English, Chinese and Arabic.

@jerryljq
Copy link

jerryljq commented Jun 9, 2016

Hi Anoop, I think this issue should also be assigned to me, right? Jiaqi.

@anoopsarkar
Copy link
Member Author

@vista521 added you to the development team on github. once you accept I can add you as assignee.

@jerryljq
Copy link

jerryljq commented Jun 9, 2016

@anoopsarkar Oops seems I didn't see any notifications to let me accept it. Probably there's something wrong...

@anoopsarkar
Copy link
Member Author

@vista521 you should have access now.

for all the assignees, please only push to a branch for the ontonotes experiments along with your collaborators. when the convertors are written and experiments are done please send a pull request with any ontonotes conversion code and the log files for the experiments.

@liushiqi9
Copy link
Contributor

@anoopsarkar
Now we have converted only English data because we don't have conll-formatted for Chinese and Arabic.

This is how it looks like for English.conll:

bn/abc/00/abc_0009   0    4    optimistic    JJ       (S(ADJP*        -    -   -   -       *   (C-ARG1*     -
bn/abc/00/abc_0009   0    5         about    IN           (PP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    6           the    DT        (NP(NP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    7        future    NN              *)   future   -   1   -       *          *     -
bn/abc/00/abc_0009   0    8            of    IN           (PP*        -    -   -   -       *          *     -
bn/abc/00/abc_0009   0    9           the    DT           (NP*        -    -   -   -       *          *   (12
bn/abc/00/abc_0009   0   10       Mideast   NNP        *)))))))       -    -   -   -    (LOC)         *)   12)
bn/abc/00/abc_0009   0   11             .     .             *))       -    -   -   -       *          *     -

bn/abc/00/abc_0009   0    0          That    DT   (TOP(SINV(SBAR(S(NP*)       -    -   -   -       *   (ARG1*     -
bn/abc/00/abc_0009   0    1            's   VBZ                   (VP*        be   -   1   -       *        *     -
bn/abc/00/abc_0009   0    2           the    DT                (NP(NP*        -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    3    heartbreak    NN                      *)       -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    4            of    IN                   (PP*        -    -   -   -       *        *     -
bn/abc/00/abc_0009   0    5          this    DT                   (NP*        -    -   -   -       *        *   (12
bn/abc/00/abc_0009   0    6        region    NN                 *))))))   region   -   3   -       *        *)   12)
bn/abc/00/abc_0009   0    7          says   VBZ                   (VP*)      say  01   1   -       *      (V*)    -
bn/abc/00/abc_0009   0    8           one    CD                   (NP*        -    -   -   -       *   (ARG0*     -
bn/abc/00/abc_0009   0    9         State   NNP                  (NML*        -    -   -   -   (ORG*        *     -
bn/abc/00/abc_0009   0   10    Department   NNP                      *)       -    -   -   -       *)       *     -
bn/abc/00/abc_0009   0   11      official    NN                      *)       -    -   -   -       *        *)    -
bn/abc/00/abc_0009   0   12             .     .                     *))       -    -   -   -       *        *     -

bn/abc/00/abc_0009   0    0    Whenever   WRB   (TOP(S(SBAR(WHADVP*)     -    -   -   -   *    (ARGM-TMP*)     *   (ARGM-TMP*      *             *   -
bn/abc/00/abc_0009   0    1         you   PRP                (S(NP*)     -    -   -   -   *        (ARG0*)     *            *      *             *   -
bn/abc/00/abc_0009   0    2        take   VBP                  (VP*    take  01   1   -   *           (V*)     *            *      *             *   -
bn/abc/00/abc_0009   0    3           a    DT                  (NP*      -    -   -   -   *        (ARG1*      *            *      *             *   -
bn/abc/00/abc_0009   0    4        step    NN                     *)   step   -   1   -   *             *)     *            *      *             *   -
bn/abc/00/abc_0009   0    5     forward    RB             (ADVP*))))     -    -   -   -   *    (ARGM-DIR*)     *            *)     *             *   -
bn/abc/00/abc_0009   0    6           ,     ,                     *      -    -   -   -   *             *      *            *      *             *   -
bn/abc/00/abc_0009   0    7         you   PRP                  (NP*)     -    -   -   -   *             *      *            *      *        (ARG1*)  -
bn/abc/00/abc_0009   0    8         are   VBP                  (VP*      be  03   -   -   *             *    (V*)           *      *             *   -
bn/abc/00/abc_0009   0    9       bound   VBN                  (VP*    bind  02   -   -   *             *      *          (V*)     *             *   -
bn/abc/00/abc_0009   0   10          to    TO                (S(VP*      -    -   -   -   *             *      *       (ARG1*      *             *   -
bn/abc/00/abc_0009   0   11          be    VB                  (VP*      be  03   -   -   *             *      *            *    (V*)            *   -
bn/abc/00/abc_0009   0   12      pushed   VBN                  (VP*    push  01   1   -   *             *      *            *      *           (V*)  -
bn/abc/00/abc_0009   0   13         way    RB                (ADVP*)     -    -   -   -   *             *      *            *      *    (ARGM-EXT*)  -
bn/abc/00/abc_0009   0   14        back    RB          (ADVP*)))))))     -    -   -   -   *             *      *            *)     *        (ARG2*)  -
bn/abc/00/abc_0009   0   15           .     .                    *))     -    -   -   -   *             *      *            *      *             *   -

bn/abc/00/abc_0009   0   0        Martha   NNP  (TOP(FRAG(NP*   -   -   -   -   (PERSON*   -
bn/abc/00/abc_0009   0   1       Raddatz   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   2             ,     ,              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   3           ABC   NNP           (NP*   -   -   -   -      (ORG*   -
bn/abc/00/abc_0009   0   4          News   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   5             ,     ,              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   6           the    DT           (NP*   -   -   -   -      (FAC*   -
bn/abc/00/abc_0009   0   7         State   NNP              *   -   -   -   -          *   -
bn/abc/00/abc_0009   0   8    Department   NNP              *)  -   -   -   -          *)  -
bn/abc/00/abc_0009   0   9             .     .             *))  -   -   -   -          *   -

@anoopsarkar
Copy link
Member Author

According to the website: http://cemantix.org/data/ontonotes.html the script skeleton2conll.sh should work on all three Ontonotes languages. Does it do strange things when run on Chinese and Arabic?

@anoopsarkar
Copy link
Member Author

Try the script on this page:
http://conll.cemantix.org/2012/data.html

@jerryljq
Copy link

I have downloaded the new script and format files and run the script based on the new files. Seems the new files worked on Chinese and Arabic.

@anoopsarkar
Copy link
Member Author

Did you also repeat the conversion for English with the new script?

@jerryljq
Copy link

@anoopsarkar Yes, I think so. The new script comes with a new data set, including all three languages. I just saved all those converted files in a new folder.

@anoopsarkar
Copy link
Member Author

ok. next step will be to create a new format file and config files for the new data. then the experiments can be run to train on ontonotes for each language and measure UAS on dev data.

@anoopsarkar
Copy link
Member Author

English Ontonotes skel files are available at this location:

https://github.com/ontonotes/conll-formatted-ontonotes-5.0

(just for future reference)

@jerryljq
Copy link

@anoopsarkar Hi Anoop, I read the meeting notes last week. We should start doing training and testing on the dev sets. Since the data does not include the dependency tree, should we just run pos_tagger.py to train and test the POSTAG only?

@anoopsarkar
Copy link
Member Author

@vista521 the plan was to use penn2malt for English and Chinese (we have the head rules for these two languages) to convert into dependency format.

@jerryljq
Copy link

@anoopsarkar I have extracted data as the input to Penn2Malt, but I have a problem here. When I try to run the tool using the headrules provided on its website, it told me that "could not find category" when it tried to match some keywords like TOP, NML and so on. These keywords are not in the headrule file. I doubt if there is a new version of the headrule file, since the one provided on the website could date back to 2003 while our data is in 2012. I also did not find any related files locally. Could you help on this problem?
The Penn2Malt website is: http://stp.lingfil.uu.se/~nivre/research/Penn2Malt.html

@anoopsarkar
Copy link
Member Author

@vista521 was this for English or Chinese?

@jerryljq
Copy link

@anoopsarkar It's for both English and Chinese.

@jerryljq
Copy link

@anoopsarkar Hi Anoop, is there any updates or ideas for the issue above?

@anoopsarkar
Copy link
Member Author

have a look at the English headrules given in this presentation:
http://nlp.mathcs.emory.edu/doc/tlt-2010-choi-slides.pdf

@anoopsarkar
Copy link
Member Author

Implementation of the above presentation seems to be here:
https://github.com/clir/clearnlp/blob/master/src/main/java/edu/emory/clir/clearnlp/conversion/EnglishC2DConverter.java

You may have to install the entire clearNLP toolkit:

https://github.com/clir/clearnlp

@jerryljq
Copy link

jerryljq commented Aug 2, 2016

@anoopsarkar I have searched the whole project you provided above and finally found the a txt file which contains organized head rules. Now the Penn2Malt tool could work. However, the head rule only works for English, so it seems Chinese and Arabic cannot be handled.

@kalryoma
Copy link

kalryoma commented Aug 3, 2016

I've tested the converted data for English. Result shows as following: (5 iterations, 75000+sentences per iteration)

Total Training Time:  29926.6495328
Interface object FirstOrderFeatureGenerator detected
     with interface get_local_vector
Evaluating...
Unlabeled accuracy: 0.874648393964
Unlabeled attachment accuracy: 0.881284276603

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants