Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Likely errors in XPOS, FEATS values and inconsistent/missing features #360

Closed
rhdunn opened this issue Oct 1, 2022 · 3 comments
Closed

Comments

@rhdunn
Copy link
Contributor

rhdunn commented Oct 1, 2022

I've been experimenting with computing the UPenn XPOS tag when given the other fields using a set of simple mapping rules loosely based on https://universaldependencies.org/tagset-conversion/en-penn-uposf.html. Note: I'm aware that some things -- like existential there (EX) -- need other data, such as the DEPREL, to identify correctly.

This has identified some inconsistencies in the EWT treebank trainingset data:

Likely Errors for UPOS=VERB

  1. weblog-juancole.com_juancole_20051126063000_ENG_20051126_063000-0015 token 8 -- Number=Plur should be Number=Sing as the XPOS is VBZ, not VBP.

  2. email-enronsent01_02-0025 token 14.1 -- copy of 1, but missing the features from that word (VBZ features).

  3. answers-20111108103957AAcF3iZ_ans-0009 token 49.1 -- copy of 38, but missing the features from that word (VBZ features).

  4. answers-20111108110012AAK8Azy_ans-0032 token 10 -- rays should be a NOUN, not a VERB.

  5. reviews-292841-0005 token 15.1 -- copy of 2, but missing the features from that word (VBZ features).

Likely Errors for UPOS=NOUN

  1. weblog-blogspot.com_rigorousintuition_20050518101500_ENG_20050518_101500-0033 token 8 -- "smuggling" is labelled as NOUN+VBG; shouldn't this be NOUN+NN like other -ing nouns?

  2. weblog-blogspot.com_rigorousintuition_20060511134300_ENG_20060511_134300-0085 token 4 -- "marrying" is labelled as NOUN+VBG; shouldn't this be NOUN+NN like other -ing nouns?

  3. weblog-blogspot.com_alaindewitt_20060924104100_ENG_20060924_104100-0133 token 18 -- "chiefs" is labelled as NOUN+NN when it should be NOUN+NNS (the lemma and feats are correct)

  4. email-enronsent41_01-0100 token 7 -- "Counterparty" is labelled as NOUN+NNP where it should either be PROPN+NNP or NOUN+NN. Note that this word is mostly NOUN+NN in thetreebank, but there are instances where it is tagged as PROPN+NNP.

  5. email-enronsent37_01-0090 token 15 -- "yrs." is labelled as NOUN+NN when it should be NOUN+NNS.

  6. answers-20111108084416AAoPgBv_ans-0004 token 10 -- NOUN+NN is missing the Number=Sing feature.

  7. reviews-070730-0001 token 31 -- NOUN+NN is missing the Number=Sing feature.

  8. reviews-342811-p0001 token 2 -- NOUN+NN is missing the Number=Sing feature.

  9. answers-20111108104551AAVAVQR_ans-0019 token 29 -- "fish" NOUN+NN has the feature Number=Plur. This looks like it is Number=Plur because the word before it is "two". Is that correct (in which case, should it be NOUN+NNS), or should it use Number=Sing? (It would be Plur from the additional semantic information of the word "two", but is Sing from a purely syntax point of view as the word itself isn't using a plural form like "fishes".)

  10. reviews-190256-0006 token 11 -- "guest" here is labelled as NOUN+NN with Number=Plur. In this case, it has a CorrectForm=guests due to the context. Should this be NOUN+NNS to reflect the corrected form (and to match the corrected Number=Plur feature?

Likely Errors for UPOS=PROPN

  1. email-enronsent17_01-0041 token 7 -- "July" is labelled as PROPN+NNP but is missing the Number=Sing feature.

Missing Features for UPOS=PART

  1. The word "to" (PART+TO) is missing PartType=Inf. Note that the following verb is labelled as VerbForm=Inf. The https://universaldependencies.org/u/feat/PartType.html documentation lists "to" as an example for English in PartType=Inf.

  2. The lemma "'s" (PART+POS) is missing Poss=Yes.

  3. The lemma "not" (PART+RB) is missing Degree=Pos according to https://universaldependencies.org/tagset-conversion/en-penn-uposf.html, but that is applying to the UPOS=ADJ suggested for the word mapping (which lists "not" and "n't"). I'm not sure what should be done in this case.

Other Inconsistencies/Missing Annotations

  1. There are many DET/DT and DET/PDT tokens that are missing PronType annotations, despite the corresponding words (e.g. "some" and "all") listed in https://universaldependencies.org/en/feat/PronType.html as having that feature for "en".

  2. Inconsistent/unpredictable ADJ+NNP and ADJ+JJ annotations in noun phrases -- "American/ADJ+JJ forces/NOUN+NNS", "Iraqi/ADJ+JJ army/NOUN+NNS", "Iraqi/ADJ+JJ High/ADJ+NNP Electoral/ADJ+NNP Commission/PROPN+NNP", "Iraqi/ADJ+NNP National/ADJ+NNP Congress/PROPN+NNP", etc.

  3. Punctuation is missing PunctType annotations, so it is difficult to differentiate ' tokens.

  4. Several tokens (57) that are labelled with XPOS=PRP do not have any features. Should these be PronType=Prs?

@nschneid
Copy link
Contributor

nschneid commented Oct 1, 2022

Thank you! We welcome pull requests to fix tag/feature issues. The NN vs. NNS/Number=Sing vs. Plur are probably the most straightforward.

Per PTB guidelines, NNP applies to all content words in proper names, including adjectives. UD diverges from that and uses ADJ even within proper names. So that explains the weird ADJ/NNP combinations.

PronType: see #230

PartType and PunctType are not universal features, at least according to https://universaldependencies.org/u/feat/.

rhdunn added a commit to rhdunn/UD_English-EWT that referenced this issue Oct 2, 2022
rhdunn added a commit to rhdunn/UD_English-EWT that referenced this issue Oct 2, 2022
rhdunn added a commit to rhdunn/UD_English-EWT that referenced this issue Oct 2, 2022
@nschneid
Copy link
Contributor

nschneid commented Dec 4, 2022

Thanks @rhdunn for all your work! Are there any other items left here or can I close this?

@rhdunn
Copy link
Contributor Author

rhdunn commented Dec 4, 2022

I'm happy for this to be closed now, thanks!

@nschneid nschneid closed this as completed Dec 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants