-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
POS tagsets used in the lexicon files #4
Comments
In the repository, there are 4 POS tagset files:
|
USAS Core POS TagsetThis tagset has come from table 5 of the Towards A Welsh Semantic Annotation System paper
|
For Chinese, I will need to discuss this with Scott, and for Spanish with Antonio since it relates to his Grampal POS tagger I expect. Portuguese is fine if it is a subset as this is mappable. For Dutch, sent can be removed or replaced with the punctuation tag. |
I think we can safely remove |
I think the |
Other POS tags that might be of interest with regards to how often they occur: Abbr occurs in:
Art occurs in:
Det occurs in:
|
It may be in the English lexicon since ZZ1 occurs in the CLAWS C7 POS tagset. I wonder what UD POS recommends instead? |
UD POS does not have Lett nor Abbr/Abbrev, I think in UD it comes down to what the full form of the Lett or Abbr is e.g. does it represent a person or company therefore I assume it will then be assigned a noun, this is what I took away from reading the SYM POS tag notes from UD. Or perhaps they would be mapped to X? I am not an expert in POS tagging so I will leave it to your much better judgement @perayson |
Ah, so CLAWS C5 tagset has ZZ0 for alphabetical symbol (http://ucrel.lancs.ac.uk/claws5tags.html) and Lou Burnard maps this to SYM (https://github.com/COST-ELTeC/Scripts/blob/master/posPipe/udpMap.py). I'm trying to think of counter examples, but I can't imagine separating SYM from Letter would help distinguish items semantically (which is the main point of the POS column for our purposes here). Probably need a longer look at this over all the languages at some point. |
The #12 Pull Request, contains the generated POS tagset per language, of which the format of these generated POS tagsets and where to find them is best explained in the Create Pos Tagsets section of PR's README |
Just a bit of a side note, in the PyMUSAS library I have changed the name of the tagset from UD to UPOS to reflect that the Part Of Speech tagset used in the Universal Dependencies Treebank is the Universal Part Of Speech (UPOS) tagset. This can be seen best in the pos mapping part of the PyMUSAS library: |
Some of the lexicons have POS tags, of which we want to see if they all share the same POS tagset or different. When this is established we shall document the POS tagset that each lexicon uses. This will also allow us to create a POS tag checker to ensure that the lexicon files only contain valid POS tags given the tagset that is associated with that lexicon.
The text was updated successfully, but these errors were encountered: