-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to document script used for the data in treebank? #1032
Comments
ISO 15924 provides codes suitable for such a metadata item. There are probably finer distinctions that could be made about the spelling rules in the treebank, but those would be difficult to capture systematically, and ISO codes of scripts would be an improvement over no info (current status). |
I can make a change in the |
I did a quick check of the treebank comparison pages (those linked in home page) |
This is true at the moment as far as I know, but there are other languages that could use multiple writing systems, so it is definitely a property of the treebank rather than the language. |
I will raise this at a future meeting of the core group. Assuming there won't be objections, these are the next steps:
|
Another candidate is UD_Egyptian, which uses Schenkel transcription rather than hieroglyphs or Gardiner codes, either of which would be conceivable for Egyptian. |
If it should be automated, it can be tricky to find code for this (as script is ambiguous when searching for code), here are some existing solutions (last one by me, optimized for speed not RAM): https://github.com/cisnlp/GlotScript |
This is a case I came across when using UD Sanskrit (https://universaldependencies.org/sa/index.html) treebank(s).
The two treebanks use two different scripts, UFAL uses devanagari while Vedic uses latin.
I suspect this maybe true for some other languages (yet to do an audit) (I am also assuming other such cases, script mixing is not a case we have to worry about with this)
Currently, the tree bank page does not provide any explicit information about the script - although this can be inferred from the examples in the morphology overview section.
I think it would be nice to have that information surfaced more clearly in the treebank page since this can be an important tree bank characteristic to keep in mind for certain uses.
My preliminary proposal for it would be to add it as part of the description similar to Genre with a hyperlink to a scholarly source on scripts (scriptsource?)
The text was updated successfully, but these errors were encountered: