Lexicons for the Multilingual UCREL Semantic Analysis System
The UCREL semantic analysis system (USAS) is a framework for undertaking the automatic semantic analysis of text. The framework has been designed and used across a number of research projects since 1990.
The USAS framework initially in English is being extended to other languages. This repository houses the lexicons and tagsets for the non-English versions of the USAS tagger.
For more details about the USAS tagger, see our website: http://ucrel.lancs.ac.uk/usas/. Others collaborating on multilingual lexicon development are listed on this site.
This section will detail the USAS lexicon resources we have in this repository per language, in addition we have a machine readable JSON file, ./language_resources.json, that contains all of the relevant meta data to allow for easier processing of these lexicon resources.
The table summarising the lexicon data statistics can be found in the ./lexicon_statistics.md markdown file, each row in the table reports the statistics for one lexicon file in this repository, of which each language can have multiple lexicon files. The lexicon file can either be a single or Multi Word Expression (MWE) lexicon file, the format of these lexicon files are explained in more detail in the lexicon file format section below.
The table within ./lexicon_statistics.md markdown file was automatically generated using the following command within the CI GitHub Action:
python lexicon_statistics.py > ./lexicon_statistics.md
The ./language_resources.json is a JSON file that contains meta data on what each lexicon resource file contains in this repository per language. The structure of the JSON file is the following:
{
"Language one BCP 47 code": {
"resources": [
{
"data type": "single",
"file path": "FILE PATH"
},
{
"data type": "pos",
"file path": "FILE PATH"
}
],
"language data": {
"description": "LANANGUAGE NAME",
"macrolanguage": "Macrolanguage code",
"script": "ISO 15924 script code"
}
},
"Language two BCP 47 code": : {
"resources": [
{
"data type": "mwe",
"file path": "FILE PATH"
},
],
"language data": {
"description": "LANANGUAGE NAME",
"macrolanguage": "Macrolanguage code",
"script": "ISO 15924 script code"
}
},
...
}
- The BCP 47 code of the language, the BCP47 language subtag lookup tool is a great tool to use to find a BCP 47 code for a language.
resources
- this is a list of resource files that are associated with the given language. There is no limit on the number of resources files associated with a language.data type
value can be 1 of 3 values:single
- Thefile path
value has to be of the single word lexicon file format as described in the Lexicon File Format section.mwe
- Thefile path
value has to be of the Multi Word Expression lexicon file format as described in the Lexicon File Format section.pos
- Thefile path
value has to be of the POS tagset file format as described in the [POS Tagset File Format section].
file path
- value is always relative to the root of this repository, and the file is in the format specified by the associateddata type
.
language data
- this is data that is associated with theBCP 47
language code. To some degree this is redundant as we can look this data up through theBCP 47
code, however we thought it is better to have it in the meta data for easy lookup. All of this data can be easily found through looking up theBCP 47
language code in the BCP47 language subtag lookup tooldescription
- Thedescription
of the language code.macrolanguage
- The macrolanguage tag, note if this does not exist then give the primary language tag, which could be the same as the wholeBCP 47
code. Themacrolanguage
tag could be useful in future for grouping languages.script
- The ISO 15924 script code of the language code. TheBCP 47
code by default does not always include the script of the language as the default script for that language is assumed, therefore this data is here to make the default more explicit.
Below is an extract of the ./language_resources.json, to give as an example of this JSON structure:
{
"arb": {
"resources": [
{
"data type": "single",
"file path": "./Arabic/semantic_lexicon_arabic.tsv"
}
],
"language data": {
"description": "Standard Arabic",
"macrolanguage": "ar",
"script": "Arab"
}
},
"cmn": {
"resources":[
{
"data type": "single",
"file path": "./Chinese/semantic_lexicon_chi.tsv"
},
{
"data type": "mwe",
"file path": "./Chinese/mwe-chi.tsv"
},
{
"data type": "pos",
"file path": "./Chinese/simplified-pos-tagset-chi.txt"
}
],
"language data": {
"description": "Mandarin Chinese",
"macrolanguage": "zh",
"script": "Hani"
}
},
...
}
All lexicon files are tsv
formatted. There are two main type of file formats the single word and the multi word expression (MWE) lexicons. These two file formats can be easily distinguished in two ways:
- The MWE files will always have the word mwe within the file name e.g. the Welsh MWE lexicon file name is called
mwe-welsh.tsv
and can be found at ./Welsh/mwe-welsh.tsv. Where as the single word lexicon files will never have the word mwe within it's file name. - The MWE files compared to the single word would typically only contain 2 headers:
mwe_template
semantic_tags
These lexicons on each line will contain at minimum a lemma and a list of semantic tags associated to that lemma in rank order, whereby the most likely semantic tag is the first tag in white space separated list, an example of a line can be seen below:
Austin Z1 Z2
In the example above we can see that for the lemma Austin
the most likely semantic tag is Z1
.
A full list of valid TSV headers and their expected value:
Header name | Required | Value | Example |
---|---|---|---|
lemma |
✔️ | The base/dictionary form of the token . See Manning, Raghavan, and Schütze IR book for more details on lemmatization. |
car |
semantic_tags |
✔️ | A list of semantic/USAS tags separated by whitespace, whereby the most likely semantic tag is the first tag in the list. | Z0 Z3 |
pos |
❌ | Part Of Speech (POS) tag associated with the lemma and token . |
Noun |
token |
❌ | The full word/token form of the lemma . |
cars |
Example single word lexicon file:
lemma token pos semantic_tags
Austin Austin Noun Z1 Z2
car cars Noun Z0 Z3
If you are writing a lexicon which contains POS information, but you come across a token/lemma that can be a member of any POS tagset, e.g. the POS information would not help in disambiguating it, then assign it the POS value *
which represents the wildcard.
Example of a single word lexicon with this implicit POS information:
lemma token pos semantic_tags
Austin Austin Noun Z1 Z2
car cars Noun Z0 Z3
computer computer * Z1
These lexicons on each line will contain only a value for the mwe_template
and the semantic_tags
. The semantic_tags
will contain the same information as the semantic_tags
header for the single word lexicon data. The mwe_template
, which is best described in the The UCREL Semantic Analysis System paper (see Fig 3, called multiword templates in the paper), is a simplified pattern matching code, like a regular expression, that is used to capture MWEs that have similar structure. For example, *_* Ocean_N*1
will capture Pacific Ocean
, Atlantic Ocean
, etc. The templates not only match continuous MWEs, but also match discontinuous ones. MWE templates allow other words to be embedded within them. For example, the set phrase turn on
may occur as turn it on
, turn the light on
, turn the TV on
etc. Using the template turn*_* {N*/P*/R*} on_RP
we can identify this set phrase in various contexts.
You will have noticed that these mwe_templates
have the following pattern matching structure of {token}_{pos} {token}_{pos}
, etc. In which each token and/or POS can have a wildcard applied to it, the wildcard means zero or more additional characters.
A full list of valid TSV headers and their expected value:
Header name | Required | Value | Example |
---|---|---|---|
mwe_template |
✔️ | See the description in the paragraphs above | *_* Ocean_N*1 |
semantic_tags |
✔️ | A list of semantic/USAS tags separated by whitespace, whereby the most likely semantic tag is the first tag in the list. | Z2 Z0 |
Example multi word expression lexicon file:
mwe_template semantic_tags
turn*_* {N*/P*/R*} on_RP A1 A1.6 W2
*_* Ocean_N*1 Z2 Z0
This is used within the CI GitHub action:
- Converts all lexicon files (single and MWE) from text file format to TSV. The lexicon files are found through the meta data file (language_resources.json).
- Checks that the TSV files are formatted correctly:
- The minimum header names exist,
- All fields/columns have a header name,
- All lines contain the minimum information e.g. no comment lines exist in the middle of the file.
The test_all_collections.py script uses the test_collection.py script that was explained in the test file format section to test all of the single and multi word expression lexicon files within this repository to ensure that they conform to the file format specified in the Lexicon File Format section. The script takes no arguments as it uses the ./language_resources.json, which is explained in USAS Lexicon Meta Data section.
python test_all_collections.py
To test that a lexicon collection conforms to the file format specified in the Lexicon File Format section you can use the test_collection.py python script.. The script takes two arguments:
- The path to the lexicon file you want to check.
- Whether the lexicon file is a single word lexicon (
single
) or a Multi Word Expression lexicon (mwe
).
Example for a single word lexicon:
python test_collection.py Welsh/semantic_lexicon_cy.tsv single
Example for a Multi Word Expression lexicon:
python test_collection.py Welsh/mwe-welsh.tsv mwe
The script tests the following:
- The minimum header names exist.
- All fields/columns have a header name.
- All lines contain the minimum information e.g. no comment lines exist in the middle of the file.
This generates the lexicon statistics table, as show in USAS Lexicon Statistics section, it does not take an arguments, but uses the ./language_resources.json file, of which this meta data file is best explained in USAS Lexicon Meta Data section.
Example:
python lexicon_statistics.py
Given a Lexicon file path will remove the column with the given header name, and save the rest of the data from the Lexicon to the new lexicon file path. The script takes three arguments:
- The path to the existing lexicon file.
- The path to save the new lexicon file too, this lexicon will be the same as argument 1, but with the removal of the column with the given header name.
- The header name for the column that will be removed.
Example:
python remove_column.py Malay/semantic_lexicon_ms.tsv Malay/new.tsv pos
Tests for single word lexicon files if the token
and lemma
values per row/line are equal. Will output to stdout a JSON object for each line that contains a different token
and lemma
value. An example of the JSON object is shown below:
{"token": "A.E", "lemma": "A.E.", "row index": 0}
Example:
python test_token_is_equal_to_lemma.py SINGLE_WORD_LEXICON_FILE_PATH
Given a header name and a lexicon file path, it will output to stdout all of the unique values and how often they occur from that header's column from the given lexicon file.
Examples:
python column_unique_values.py French/semantic_lexicon_fr.tsv pos
# Output:
Unique values for the header pos
Value: prep Count: 60
Value: noun Count: 1633
Value: adv Count: 147
Value: verb Count: 448
Value: adj Count: 264
Value: det Count: 86
Value: pron Count: 56
Value: conj Count: 20
Value: null Count: 2
Value: intj Count: 8
python column_unique_values.py Welsh/mwe-welsh.tsv semantic_tags
# Output:
Unique values for the header semantic_tags
Value: G1.1 Count: 1
Value: Z1 Count: 1
Value: M3/Q1.2 Count: 3
Value: Q2.1 Count: 1
Value: I2.1/T2+ Count: 1
Value: P1/G1.1 G1.2 Count: 2
Value: A9- Count: 2
Value: Z2 Count: 1
Value: Y2 Count: 1
Value: X5.1+ Count: 1
Write to stdout the following:
- Number of unique values in the column with
header_name_1
fromlexicon_file_path_1
- Number of unique values in the column with
header_name_1
fromlexicon_file_path_1
- Number of unique values in common between the two files.
Example:
python compare_headers_between_lexicons.py Russian/semantic_lexicon_rus.tsv Russian/semantic_lexicon_rus_names.tsv lemma lemma
# Output
Number of unique values in lexicon file 1 17396
Number of unique values in lexicon file 2 7637
Number of unique values in common between the two files:3169
Given a column/header name, mapping file, and a lexicon file path it will map the values within the column name in the lexicon file using the mapping file. The resulting lexicon file will then be saved to the given output file path.
Example:
In this example we will map the values in pos
column of Finnish/semantic_lexicon_fin.tsv
using the mapper specified as a dictionary within pos_mapper.json
and save the new lexicon file with these mapped values, all other columns and data will remain the same, to the new lexicon file named Finnish/mapped_semantic_lexicon_fin.tsv
.
python map_column_values.py Finnish/semantic_lexicon_fin.tsv pos pos_mappers/Finnish_pos_mapper.json Finnish/pos_mapped_semantic_lexicon_fin.tsv
This script assumes that the semantic tags in the INPUT_FILE have been separated by tabs rather than spaces, this script will reverse this process and output the spaced version into a new OUTPUT_FILE. Both files are expected to be in TSV format. This is useful when data entered for USAS tags has been tab separated rather than space separated, e.g.:
lemma\tsemantic_tags
test\tZ1\tZ2
After running this script it will be converted to:
lemma\tsemantic_tags
test\tZ1 Z2
To run the script:
python tabs_to_spaces.py INPUT_FILE_NAME OUTPUT_FILE_NAME
Example:
python tabs_to_spaces.py Finnish/semantic_lexicon_fin.tsv Finnish/new_semantic_lexicon_fin.tsv
This script converts a file that contains only spaces to a file that separates fields/columns by tabs instead of spaces. For MWE files the optional POS tag argument is not used.
For single word lexicon files we expect a POS field, further if you provide a JSON formatted POS tagset file where the object keys are the valid POS tags in the tagset then the POS field values will be checked against the given POS tagset.
Example:
This converts a MWE file that contains only spaces (idioms_utf8.c7
) and converts it to tab separated by outputting it into the file mwe-en.txt
.
python spaces_to_tabs.py idioms_utf8.c7 mwe-en.txt mwe
This converts a single word lexicon file that contains only spaces (lexicon_utf8.c7
) and converts it to tab separated by outputting it into the file semantic_lexicon_en.txt
while also ensuring that all POS tags in the POS field conform to the POS tagset defined by the key values in the dictionary object within the ./pos_mappers/c7_to_upos.json
file.
python spaces_to_tabs.py lexicon_utf8.c7 semantic_lexicon_en.txt single --pos-tagset-file ./pos_mappers/c7_to_upos.json
This script finds duplicate entries within either a single word and MWE lexicon file and displays how many duplicates there are.
Single word lexicon example:
python duplicate_entires.py English/semantic_lexicon_en.txt single
MWE lexicon example:
python duplicate_entires.py English/mwe-en.txt mwe
If you want to output the stdout data into a TSV file you can provide an optional output-file
argument to save the data to a given TSV file, in this example we save it to duplicate_single_word_lexicon_english_upos.tsv
file:
python duplicate_entires.py English/semantic_lexicon_en.txt single --output-file duplicate_single_word_lexicon_english_upos.tsv
This script finds all of the unique POS values within a MWE lexicon file, including POS values that are part of a curly brace discontinues MWE expression. By default the unique POS values are output to stdout.
If the optional output-file
argument is passed the unique POS values are also saved to the output-file
in TSV format. If another optional argument, pos-mapper-file
, is given it will try to map the unique POS values given the JSON POS mapper file, any values it cannot map it will leave blank, and add a new column called mapped
with all the POS values it could map.
Example:
python extract_unique_pos_values_from_mwe_file.py English/mwe-en.txt
# Output
II22
N*2
NNO*
MC1
IF
UH*
NN1*
CSN
DA1
BCL21
DDQV
...
This example shows how to use the optional output-file
argument, whereby in this case the output will be print to stdout like before but also saved to the output-file
in TSV format:
python extract_unique_pos_values_from_mwe_file.py English/mwe-en.txt --output-file unique_pos_values_mwe_en.tsv
This example show how to use the optional pos-mapper-file
argument:
python extract_unique_pos_values_from_mwe_file.py English/mwe-en.txt --output-file unique_pos_values_mwe_en.tsv --pos-mapper-file pos_mappers/c7_to_upos.json
# Output
Unique POS Values:
VBZ VERB
VVD VERB
NN132
RRT ADV
NP*
RT ADV
V*
...
This maps POS values within a MWE lexicon file given a POS mapper, the mapped MWE lexicon file will be saved to the given output file.
python mwe_pos_mapping.py English/mwe-en.txt pos_mappers/mwe_c7_to_upos.json English/mapped-mwe-en.txt
This has been tested with Python >= 3.7
, to install the relevant python requirements:
pip install -r requirements.txt
In order to reference this further development of the multilingual USAS tagger, please cite our paper at NAACL-HLT 2015, which described our bootstrapping approach:
APA Format:
Piao, S. S., Bianchi, F., Dayrell, C., D’egidio, A., & Rayson, P. (2015). Development of the multilingual semantic annotation system. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1268-1274).
BibTeX Format:
@inproceedings{piao-etal-2015-development,
title = "Development of the Multilingual Semantic Annotation System",
author = "Piao, Scott and
Bianchi, Francesca and
Dayrell, Carmen and
D{'}Egidio, Angela and
Rayson, Paul",
booktitle = "Proceedings of the 2015 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = may # "{--}" # jun,
year = "2015",
address = "Denver, Colorado",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/N15-1137",
doi = "10.3115/v1/N15-1137",
pages = "1268--1274",
}
In 2015/16, we extended this initial approach to twelve languages and evaluated the coverage of these lexicons on multilingual corpora. Please cite our LREC-2016 paper:
APA Format:
Piao, S. S., Rayson, P., Archer, D., Bianchi, F., Dayrell, C., El-Haj, M., Jiménez, R-M., Knight, D., Křen, M., Lofberg, L., Nawab, R. M. A., Shafi, J., Teh, P. L., & Mudraya, O. (2016). Lexical coverage evaluation of large-scale multilingual semantic lexicons for twelve languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16) (pp. 2614-2619).
BibTeX Format:
@inproceedings{piao-etal-2016-lexical,
title = "Lexical Coverage Evaluation of Large-scale Multilingual Semantic Lexicons for Twelve Languages",
author = {Piao, Scott and
Rayson, Paul and
Archer, Dawn and
Bianchi, Francesca and
Dayrell, Carmen and
El-Haj, Mahmoud and
Jim{\'e}nez, Ricardo-Mar{\'\i}a and
Knight, Dawn and
K{\v{r}}en, Michal and
L{\"o}fberg, Laura and
Nawab, Rao Muhammad Adeel and
Shafi, Jawad and
Teh, Phoey Lee and
Mudraya, Olga},
booktitle = "Proceedings of the Tenth International Conference on Language Resources and Evaluation ({LREC}'16)",
month = may,
year = "2016",
address = "Portoro{\v{z}}, Slovenia",
publisher = "European Language Resources Association (ELRA)",
url = "https://aclanthology.org/L16-1416",
pages = "2614--2619",
}
If you are interested in getting involved in creating lexicons for new languages or updating the existing ones then please get in touch with: Paul Rayson ([email protected]) and/or the UCREL Research Centre ([email protected]) at Lancaster University.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.