Skip to content

Latest commit

 

History

History
35 lines (30 loc) · 2.4 KB

README.md

File metadata and controls

35 lines (30 loc) · 2.4 KB

ParlaMint scripts

This directory contains various scripts that are used to validate or convert ParlaMint corpora to other formats. Most scripts have an explanation of how to run them in comments and the start of the script. Examples of usage are also given in the repository Makefile.

Validation

Conversion

  • parlamint-tei2text.xsl: transforms a ParlaMint corpus component file to plain text
  • parlamint2conllu.pl: runs the parlamint2conllu XSLT script as well as running the UD validator on the resulting files. Not that it is assumed that this directory contains (gitignored) the UD validator, which is installed with git clone [email protected]:UniversalDependencies/tools.git
  • parlamint2conllu.xsl: convert the linguistically annotated TEI corpus component to CoNLL-U format. It expects the TEI root corpus file as the value of the $meta parameter.
  • parlamint2xmlvert.xsl: convert the linguistically annotated TEI corpus compoment to vertical format for the CQP line of concordancers. It expects the TEI root corpus file as the value of the hdr parameter. Note that the produced files is still in XML - to convert it to "proper" vertical format, use parlamint-xml2vert.pl.
  • corpus2sample.xsl: takes a root corpus file as input and outputs a sample in output directory, which is specified via the $outDir parameter. The script retains the first and last component file from the corpus, and first and last $Range utterances in them.
  • classlisize.py: takes a 'plain text' ParlaMint TEI component file as input then uses the classla-stanfordnlp pipeline for linguistic processing, and outputs the linguistically annotated TEI file.