-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemmas of English personal pronouns #517
Comments
For morphologically rich/normal languages, the lemma serves also as a point of disambiguation in company with its pos sibling. Since spelling normalization is being discussed, it might serve our purpose to provide a spelling[norm]=xxx
in misc to cover the for the misspellings.
…Sent from my iPhone
On 21 Dec 2017, at 2.42, Nathan Schneider ***@***.***> wrote:
It is not obvious how pronouns should be lemmatized (cf. #276 for Slavic). The UD_English corpus does the following:
Nominative (PRP):
I -> I
you -> you
he -> he
she -> she
it -> it
we -> we
they -> they
Accusative (PRP):
me -> I
you -> you
him -> he
her -> she
it -> it
us -> we
them -> they
Dependent possessive (PRP$):
my -> my (!)
your -> you
his -> he
her -> she
its -> its (!)
our -> we
your -> you
their -> they
The pattern here is that they are normalized to nominative case, except for "my" and "its", which should probably be "I" and "it", respectively.
Independent possessive (PRP, no morphological features): no normalization
Reflexive (PRP): myself, yourself, ourselves, yourselves, themselves, etc.: no normalization
WH animate: who, whom, whoever, whomever: no normalization
I am not sure why whom, whomever, the independent possessives, and the reflexives aren't normalized to nominative as well.
There is one token where ’s in Let’s has been lemmatized as us (it should presumably be we for consistency).
That said, the simplest policy may be to use the lemma field only for spelling normalization (#513) and not perform case normalization at all. If the end user wants to map pronouns to nominative case, that is not hard to implement as postprocessing once spelling is consistent.
Thoughts?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Case normalization in lemmas is expected in languages where |
I guess I am not sure what the guiding principles are/should be for pronoun
normalization. It is clear that English nouns should be normalized by
number and verbs by number, person, and tense. So why are the pronouns
normalized by case but not person or number? If the goal is to remove all
inflectional information, shouldn't all personal pronouns map to the same
lemma?
Or is the goal to collapse dimensions of a paradigm which tend to have
common stems? By the common stem criterion it would make sense to give
possessives and accusatives the same lemma, and perhaps "he"/"him"/"his",
but it does not feel intuitive to give "I", "we", "me", and "our" the same
lemma.
From a more semantic/practical perspective, I could see an argument that
number and person are relevant to reference resolution whereas case is
primarily grammatical and is encoded in the syntactic relations.
Finally, one could argue that it's best to avoid worrying about all of
these competing criteria for closed-class POS categories and just keep the
(spelling-normalized) word as the lemma, because the benefits of
lemmatization in dealing with the long tail are not relevant as they are
for open classes. English doesn't have that many distinct pronouns to begin
with, and their commonalities are exposed in morphological features, so
what does lemmatization buy us?
…On Dec 23, 2017 9:32 PM, "Dan Zeman" ***@***.***> wrote:
Case normalization in lemmas is expected in languages where Case plays a
more important role than in English and I would expect it in English as
well.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#517 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA8Irx-Zx_7mE-Nt-wInmZd2pvxW6Q9jks5tDVVigaJpZM4RJJjW>
.
|
I think different language-specific guidelines differ on this, and it would be good to stay consistent with other corpora in the respective languages, since what 'lemma' means in each language is rather different. We already have a split between UPOS and language-specific tags, I wouldn't want to see 'native vs. UD lemmas' as well if possible... For GUM, we've simply used the behavior of the TreeTagger: PRP gets the nominative form (him -> he), PRP$ get their own form (my -> my, its -> its). The independent forms (mine etc.) technically have their own nominative form (mine is...) so they are lemmatized to themselves (mine -> mine). Basically this corresponds to only lemmatizing across case, and treating the possessive determiners as not a case form of the personal pronoun (which most of them are not, historically). I don't necessarily think this is ideal, but I think it doesn't matter much for personal pronouns, and inventing new standards for this sounds like it would ultimately create more work and complications than benefits... |
For future reference, I'm finding many inconsistencies between columns in UD_English that point to tagging, morphology, or parse errors involving pronouns. Some commands:
|
@sebschu do you have an opinion on pronoun lemmatization? |
Interesting discussion of lemmatisation of pronominals. However, it seems like programming experts giving their opinions ignore the issues at automatic analysing particular parts of speech, as in the analysis of pronouns which demand a wider understanding of the functions underlying pronouns across sentences and paragraphs of a text. The deitic element, for example, is mostly absent in the programming of pronoun detection and analysis, as in automatically determining the average of pronoun lemmas which is of course not a bad idea. A big however here is that pronominals (a type of cohesion referential) signal back and forth referentials (e.g., anaphoric, cataphoric). Nevertheless, it seem as NLP tools have deliberately been minimising this important aspect in the analysis of pronouns. Ignoring functional linguistic elements keep new NLP programmers meeting and replicating the same big mistakes in the analysis of lemmatised pronouns. |
@WaukyJose this is the documentation for Universal Dependencies, a project creating resources with syntactic, rather than semantic analyses. However some datasets do actually contain annotations from other projects, including explicit analysis of anaphora, cataphora, and other forms of coreference. If you're looking for English data covering both UD syntax and coreference, you may want to look at this one: https://github.com/UniversalDependencies/UD_English-GUM You can find coreference indices and entity types in the last column, inside the annotation Entity (e.g. |
This issue has reared its head again in UniversalDependencies/UD_English-EWT#293, with some arguing that a standard for pronoun lemmas across Germanic languages should be attempted. After making corrections for consistency, here is the full set of pronouns in EWT—for the lemma, the ones it italics are normalized to the first item in the row: Personal pronouns
(Items in parentheses are unattested in EWT.) ☞ Clearly my and its are outliers, as noted at the top of the issue. The least disruptive change would be to replace my => I and its => it. But we should at least make sure that EWT and GUM agree; GUM does not presently lemmatize possessives. ☞ The features do not currently distinguish dependent and independent genitives/possessives. Would it make sense to use Other pronouns
☞ If personal pronouns are normalized for case, it would make sense to normalize whom => who and whomever => whoever. ☞ If dependent possessive personal pronouns are normalized, it would make sense replace whose, although technically it is shared between who and what, so semantics would be required to resolve the correct lemma.
☞ No one is currently analyzed as det(one/NOUN, no/DET). Perhaps one should be PRON.
For the remaining groups only plural demonstratives these and those are normalized, which makes sense. N.B. when, wherever, somewhere, etc. are tagged as ADV, not PRON. |
Thanks for writing this up so clearly! For convenience I will repeat what I said in the EWT issue - basically I think case forms like "them" should be lemmatized to the nominative "they", but possessive determiners form a separate paradigm because:
I would like to see this behave as similarly as possible across German languages, though of course not at all costs :) |
Somehow it seems we missed "none" (and, as noted in the PTB tag guidelines, "naught"). Will add these to the PRON table with |
@dan-zeman points out that
@amir-zeldes thoughts on the above list? https://en.wikipedia.org/wiki/Pro-form is useful, though I'm not sure we want to start dealing with "however", "therefore", and so on. |
I think that mostly makes sense; for "either" and "neither", I'm not sure it should apply in determiner/preconj usage, since they're not exactly pronouns there. But this is all mainly useful if other languages implement this as well. For 'therefore' and 'however' in the discourse use I think they are probably no longer perceived as pronominal, even if they are etymologically. |
I would expect PronType to accord with the UPOS. At present preconj "either" and "neither" are tagged CCONJ, so let's not give them a PronType. DETs do receive PronTypes though, as documented previously: https://universaldependencies.org/en/pos/DET.html TBC, I listed "(n)either" above for the ADV uses ("I don't want a sandwich, either"). (I keep having to remind myself that "PronType" is a misnomer, it actually covers all pro-forms.) |
Yeah, I think ProType would have been better! In any case, let me know what you want to do and I'll match it for GU corpora, this all sounds fine to me. |
OK how about these guidelines: https://universaldependencies.org/en/pos/ADV.html |
…cies/docs#517); involves some changes from interrogative to free relative structure (#278)
…d always be adverbs (#132); also apply PronType=Ind to the retagged ones (UniversalDependencies/docs#517)
Implemented in EWT! (modulo some existing |
So we should update (among other changes) |
anything to be done for
|
These are both mainly discourse connectives, so I'm not sure they need a PronType.
there_PRON: for expletive "there" I'm not sure if any of the PronType values would be a good fit. This is documented at https://universaldependencies.org/en/pos/PRON.html#expletive-there any_ADV: "any" is normally DET. I see "any/ADV longer/ADV" and similar; not sure this is actually correct. Also "it doesn't hurt any/ADV" (= at all). Could these be DET attaching as |
Agreed that the discourse versions are fine w/o. They are not always discourse, though, especially
(those are the only ones I saw for |
Technically you're right, the "however optimistic" ones should be
"however/whenever possible": as "however" is the first item in coordination I suppose it should be the head of the free relative |
(insert satisfied seal meme here) |
Aha, apparently "however" receives a different xpos: RB for the discourse connective use and WRB for the interrogative or relative use! (This is documented in the PTB tagging guidelines.) So we can require PronType conditional on that. |
…iated spellings (UniversalDependencies/docs#517 - also fix neaten.py cause of false negative in #532); some typos (including "develope", #526)
It is not obvious how pronouns should be lemmatized (cf. #276 for Slavic). The UD_English corpus does the following:
Nominative (
PRP
):Accusative (
PRP
):Dependent possessive (
PRP$
):The pattern here is that they are normalized to nominative case, except for "my" and "its", which should probably be "I" and "it", respectively.
Independent possessive (
PRP
, no morphological features): mine, yours, ours, theirs, etc.: no normalizationReflexive (
PRP
): myself, yourself, ourselves, yourselves, themselves, etc.: no normalizationWH animate: who, whom, whoever, whomever: no normalization
I am not sure why whom, whomever, the independent possessives, and the reflexives aren't normalized to nominative as well.
There is one token where ’s in Let’s has been lemmatized as us (it should presumably be we for consistency).
That said, the simplest policy may be to use the lemma field only for spelling normalization (#513) and not perform case normalization at all. If the end user wants to map pronouns to nominative case, that is not hard to implement as postprocessing once spelling is consistent.
Thoughts?
The text was updated successfully, but these errors were encountered: