-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ontonotes data and experiments #32
Comments
Hi Anoop, I think this issue should also be assigned to me, right? Jiaqi. |
@vista521 added you to the development team on github. once you accept I can add you as assignee. |
@anoopsarkar Oops seems I didn't see any notifications to let me accept it. Probably there's something wrong... |
@vista521 you should have access now. for all the assignees, please only push to a branch for the ontonotes experiments along with your collaborators. when the convertors are written and experiments are done please send a pull request with any ontonotes conversion code and the log files for the experiments. |
@anoopsarkar This is how it looks like for English.conll:
|
According to the website: http://cemantix.org/data/ontonotes.html the script skeleton2conll.sh should work on all three Ontonotes languages. Does it do strange things when run on Chinese and Arabic? |
Try the script on this page: |
I have downloaded the new script and format files and run the script based on the new files. Seems the new files worked on Chinese and Arabic. |
Did you also repeat the conversion for English with the new script? |
@anoopsarkar Yes, I think so. The new script comes with a new data set, including all three languages. I just saved all those converted files in a new folder. |
ok. next step will be to create a new format file and config files for the new data. then the experiments can be run to train on ontonotes for each language and measure UAS on dev data. |
English Ontonotes skel files are available at this location: https://github.com/ontonotes/conll-formatted-ontonotes-5.0 (just for future reference) |
@anoopsarkar Hi Anoop, I read the meeting notes last week. We should start doing training and testing on the dev sets. Since the data does not include the dependency tree, should we just run pos_tagger.py to train and test the POSTAG only? |
@vista521 the plan was to use penn2malt for English and Chinese (we have the head rules for these two languages) to convert into dependency format. |
@anoopsarkar I have extracted data as the input to Penn2Malt, but I have a problem here. When I try to run the tool using the headrules provided on its website, it told me that "could not find category" when it tried to match some keywords like TOP, NML and so on. These keywords are not in the headrule file. I doubt if there is a new version of the headrule file, since the one provided on the website could date back to 2003 while our data is in 2012. I also did not find any related files locally. Could you help on this problem? |
@vista521 was this for English or Chinese? |
@anoopsarkar It's for both English and Chinese. |
@anoopsarkar Hi Anoop, is there any updates or ideas for the issue above? |
have a look at the English headrules given in this presentation: |
Implementation of the above presentation seems to be here: You may have to install the entire clearNLP toolkit: |
@anoopsarkar I have searched the whole project you provided above and finally found the a txt file which contains organized head rules. Now the Penn2Malt tool could work. However, the head rule only works for English, so it seems Chinese and Arabic cannot be handled. |
I've tested the converted data for English. Result shows as following: (5 iterations, 75000+sentences per iteration)
|
Task: Convert the Ontonotes data into the CoNLL format.
The instructions for conversion are given here: http://cemantix.org/data/ontonotes.html
It also contains the script to convert to CoNLL format for all three languages: English, Chinese and Arabic.
The text was updated successfully, but these errors were encountered: