Some tools and resources for natural language processing of Scottish Gaelic.
Tools for the Universal Dependencies dependency treebank version of the Annotated Reference Corpus of Scottish Gaelic (ARCOSG) which is kept at https://github.com/UniversalDependencies/UD_Scottish_Gaelic-ARCOSG/
You can acquire ARCOSG itself from http://datashare.is.ed.ac.uk/handle/10283/2011 (original version) and the latest version from https://github.com/Gaelic-Algorithmic-Research-Group/ARCOSG
This is written up in:
- Colin Batchelor, 2019. Universal dependencies for Scottish Gaelic: syntax, in Proceedings of CLTW2019 at Machine Translation Summit XVII, Dublin, August.
brown_gd_to_conll.py
performs a rudimentary conversion of ARCOSG to CoNLL-U format.
In practice I have postprocessed the results with the following Python 3 scripts:
fix_feats.py
fills out the feature set.fix_text.py
adds "text" annotations.fix_whitespace.py
addsSpaceAfter=No
to the relevant parts of the tree.
There is one small test tree bank in ud
:
gd_iomasgladh-ud-test.conllu
is a hand-built corpus from 2014 which has been converted to UD.
The lemmatiser, code to convert ARCOSG parts of speech to UD features and categorial grammar code are now in the https://github.com/colinbatchelor/gd_tools repository.
Contains a categorial grammar generated from ARCOSG in dotccg format.
Contains an earlier, smaller, hand-built corpus in CoNLL-U format.
The corpus annotated in CoNLL-U format with the categorial annotations in column 6.
Each sentence has three lines beginning with hashes preceding it. These are an ID for the sentence, some versioning information, and the source.
The guidelines used for the construction of the corpus in LaTeX format. Currently no special packages are used for it.
brown_gd_to_dot_ccg.py
takes a Brown-format corpus assuming ARCOSG tags and outputs a .ccg filemend_xml.py
fixes the output of OpenCCG's ccg2xml.prepareARCOSG.py
takes a local installation of the Annotated Reference Corpus of Scottish Gaelic (ARCOSG), replaces spaces within tokens with underscores and puts the results inarcosg.pkl
.
In Python 3. In-progress grammar checker based largely on Richard Cox's Gearr-Ghràmar na Gàidhlig (2018). Does not run from the command line yet but test_checker.py
shows how the methods work.
The blog is at http://www.tantallon.org.uk/cggblog/
The citation for the files in conll
is:
@InProceedings{batchelor:2014:CLTW14, author = {Batchelor, Colin}, title = {gdbank: The beginnings of a corpus of dependency structures and type-logical grammar in Scottish Gaelic}, booktitle = {Proceedings of the First Celtic Language Technology Workshop}, month = {August}, year = {2014}, address = {Dublin, Ireland}, publisher = {Association for Computational Linguistics and Dublin City University}, pages = {60--65}, url = {http://www.aclweb.org/anthology/W14-4609} }
The citation for the material in ccg
and gramaran
is:
@InProceedings{batchelor:2016:CLTW, author = {Batchelor, Colin}, title = {Automatic derivation of categorial grammar from a part-of-speech-tagged corpus in Scottish Gaelic}, booktitle = {Actes de la conf\'erence conjointe JEP-TALN-RECITAL 2016, volume 6 : CLTW}, month = {July}, year = {2016}, address = {Paris, France}, pages = 1, url = {https://jep-taln2016.limsi.fr/actes/Actes%20JTR-2016/V06-CLTW.pdf} }
Colin Batchelor
2024-02-07