Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POS tagsets used in the lexicon files #4

Open
apmoore1 opened this issue Nov 9, 2021 · 11 comments
Open

POS tagsets used in the lexicon files #4

apmoore1 opened this issue Nov 9, 2021 · 11 comments
Assignees

Comments

@apmoore1
Copy link
Member

apmoore1 commented Nov 9, 2021

Some of the lexicons have POS tags, of which we want to see if they all share the same POS tagset or different. When this is established we shall document the POS tagset that each lexicon uses. This will also allow us to create a POS tag checker to ensure that the lexicon files only contain valid POS tags given the tagset that is associated with that lexicon.

@apmoore1
Copy link
Member Author

In the repository, there are 4 POS tagset files:

  1. Chinese - Subset of the USAS core tagset, apart from four POS tags:
    1. loc which represents localiser
    2. msr which represents measure word (this might map to numerals, but this POS tagset does already have a POS tag for numerals)
    3. ono which represents onomatopoeia - only one lexicon entry is marked with this in the single semantic lexicon: 砰 ono W4 X3.2+ Q2.2 E3-
    4. mark which represents marker
  2. Spanish - Subset of the USAS core tagset, apart from two POS tags:
    1. port which represents Portmanteau word. This POS tag only occurs in the MWE semantic lexicon, e.g. acercarse_fw al_port sol_noun que_pron más_adv calienta_verb S7 X5.2
    2. sys which represents symbols. However no entry in either the single or MWE lexicon uses this POS tag.
  3. Portuguese - Subset of the USAS core tagset
  4. Dutch - There is only 1 POS tag that I think should be removed, it also does not feature in the Dutch semantic lexicon, sent, which represents sentence marker. The rest of the tagset is a subset of the USAS core tagset.

@apmoore1
Copy link
Member Author

apmoore1 commented Nov 16, 2021

USAS Core POS Tagset

This tagset has come from table 5 of the Towards A Welsh Semantic Annotation System paper

  1. noun
  2. verb
  3. Adj – Adjective
  4. Adv – Adverb
  5. Num – Numerals
  6. Pnoun – Proper Noun
  7. Intj – Interjection
  8. Art – Article
  9. Part – Particle
  10. Prep – Preposition
  11. Conj – Conjunction
  12. Pron – Pronoun
  13. Code – Special code, e.g. Math symbols
  14. Punc – Punctuation
  15. Fw – Foreign Word
  16. Abbrev – Abbreviation
  17. Lett – Letter
  18. Xx – Unrecognized token
  19. DET

@perayson
Copy link
Member

For Chinese, I will need to discuss this with Scott, and for Spanish with Antonio since it relates to his Grampal POS tagger I expect. Portuguese is fine if it is a subset as this is mappable. For Dutch, sent can be removed or replaced with the punctuation tag.

@apmoore1
Copy link
Member Author

For Chinese, I will need to discuss this with Scott, and for Spanish with Antonio since it relates to his Grampal POS tagger I expect. Portuguese is fine if it is a subset as this is mappable. For Dutch, sent can be removed or replaced with the punctuation tag.

I think we can safely remove sent for Dutch as it is not in any of the semantic lexicons.

@apmoore1
Copy link
Member Author

USAS Core Tagset

This tagset has come from table 5 of the Towards A Welsh Semantic Annotation System paper

  1. noun
  2. verb
  3. Adj – Adjective
  4. Adv – Adverb
  5. Num – Numerals
  6. Pnoun – Proper Noun
  7. Intj – Interjection
  8. Art – Article
  9. Part – Particle
  10. Prep – Preposition
  11. Conj – Conjunction
  12. Pron – Pronoun
  13. Code – Special code, e.g. Math symbols
  14. Punc – Punctuation
  15. Fw – Foreign Word
  16. Abbrev – Abbreviation
  17. Lett – Letter
  18. Xx – Unrecognized token
  19. DET

I think the Lett pos tag can be removed from this list as it does not occur in any of the semantic lexicons (single and Multi Word Expression (MWE))

@apmoore1
Copy link
Member Author

Other POS tags that might be of interest with regards to how often they occur:

Abbr occurs in:

  1. Spanish lexicon 2 times.
  2. Finish lexicon 409 times.

Art occurs in:

  1. Spanish 75 times
  2. Italian 659 times
  3. Dutch 2 times.

Det occurs in:

  1. Spanish 1 time.
  2. Portuguese 289 times.
  3. French 86 times.
  4. Chinese 411 times.

@perayson
Copy link
Member

It may be in the English lexicon since ZZ1 occurs in the CLAWS C7 POS tagset. I wonder what UD POS recommends instead?

@apmoore1
Copy link
Member Author

It may be in the English lexicon since ZZ1 occurs in the CLAWS C7 POS tagset. I wonder what UD POS recommends instead?

UD POS does not have Lett nor Abbr/Abbrev, I think in UD it comes down to what the full form of the Lett or Abbr is e.g. does it represent a person or company therefore I assume it will then be assigned a noun, this is what I took away from reading the SYM POS tag notes from UD. Or perhaps they would be mapped to X? I am not an expert in POS tagging so I will leave it to your much better judgement @perayson

@perayson
Copy link
Member

Ah, so CLAWS C5 tagset has ZZ0 for alphabetical symbol (http://ucrel.lancs.ac.uk/claws5tags.html) and Lou Burnard maps this to SYM (https://github.com/COST-ELTeC/Scripts/blob/master/posPipe/udpMap.py). I'm trying to think of counter examples, but I can't imagine separating SYM from Letter would help distinguish items semantically (which is the main point of the POS column for our purposes here). Probably need a longer look at this over all the languages at some point.

@apmoore1
Copy link
Member Author

The #12 Pull Request, contains the generated POS tagset per language, of which the format of these generated POS tagsets and where to find them is best explained in the Create Pos Tagsets section of PR's README

@apmoore1
Copy link
Member Author

apmoore1 commented Dec 5, 2021

Just a bit of a side note, in the PyMUSAS library I have changed the name of the tagset from UD to UPOS to reflect that the Part Of Speech tagset used in the Universal Dependencies Treebank is the Universal Part Of Speech (UPOS) tagset. This can be seen best in the pos mapping part of the PyMUSAS library:
https://ucrel.github.io/pymusas/api/pos_mapper

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants