Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Criteria for compound in English nominals #757

Closed
nschneid opened this issue Jan 18, 2021 · 23 comments
Closed

Criteria for compound in English nominals #757

nschneid opened this issue Jan 18, 2021 · 23 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Jan 18, 2021

Many of the issues we've been wrestling with lately for English revolve around the somewhat vague definition of compound in noun phrases and how to delineate it from relations like amod (though this may be resolved in #756), nmod, appos, and flat.

I want to use this as a meta-thread to try to document the criteria that have been proposed and examples that are difficult to resolve. Let's debate specific details elsewhere.

By current guidelines,

  1. compound requires a distinct head, as opposed to flat and appos which are by definition left-to-right. However, whether an expression has a distinct head is not always obvious.
  2. In English, the head is on the right.
  3. amod vs. compound for terms like "hot dog" #756: consensus seems to be that if the modifier is an adjective, it is amod, even if the expression falls under the broader traditional definition of compound (e.g. "hot dog", pronounced with stress on the first word).
  4. Two Nominals notes the difference between compound and nmod for English:
    • nmod usually has prepositional or possessive case marking: the dog's tail, the tail of the dog; compound does not: the dog tail
    • nmod can be pluralized: the dogs' toy; compound modifiers ordinarily cannot: *the dogs toy
    • a prepositional nmod can have its own determiner: the tail of a dog; but possessive nmod and compound modifiers cannot: *the a dog's tail; *the a dog tail
  5. appos applies for two nominals that are adjacent and reversible modulo punctuation:
    • the French president Emmanuel Macron <--> Emmanuel Macron, the French president
    • Sam, my brother, arrived <--> My brother, Sam, arrived
  6. Current guidelines say flat, though the first part is arguably a modifier of the second (le président Macron and President Trump: flat? #503, compound/flat inconsistency UD_English-EWT#59, Syntax for "you guys" amir-zeldes/gum#71):
    • President Obama, Mr. Obama (no determiner: title/appellation)
      • Finnish calls this compound:nn
    • French actor + Gaspard Ulliel (no determiner: embellishment/false title characteristic of news genre)
  7. These criteria are insufficient for a relation between two nominals where both lack prepositional/possessive marking and one is a pronoun or proper name (which would not be expected to have its own determiner). Problematic examples include:
  8. Phrasal modifiers of nouns: "must see" #753
@nschneid
Copy link
Contributor Author

nschneid commented Jan 18, 2021

Do we want non-pluralization of the modifier to be a firm criterion for compound? Consider that in some of these constructions the modifier agrees with the head in number:

  • Presidents Clinton and Obama
  • French actors Ulliel and Marceau
  • my brothers Sam and Isaac
  • the years 1776 and 1789
  • Forts Bragg and Meade
  • ?Mounts Everest and Fuji (attested, sounds a bit awkward to me)

Consider also

  • the Mississippi and Allegheny Rivers

which is a bit like saying "he nominated the defense and treasury secretaries", where we conclude semantically that there are 2 distinct entities in distributive coordination, whereas "they installed enter and exit signs" could mean 2 varieties of signs but more than 2 individual signs.

@Stormur
Copy link
Contributor

Stormur commented Jan 20, 2021

As for all points under 7, they seem clear cases of flat to me.

my brother Sam is not really different from President Obama. The key here is that all elements are coreferential, whereas in phone book or dog tail they are not (i.e. there is not any kind of identity between phone and book, or dog and tail).

On the contrary, the king himself sees a det.


I do not see the third and fourth points of 4 as decisive. Something similar happens in (obsolete) German, where the genitive article of a modifier "blocks" the article of the head:

  • der Frau Gesicht 'the lady's face', not *das der Frau Gesicht, but possibly das Gesicht der Frau

It can only be an nmod either way.

As for *the dogs toy: confront Italian

  • coda di cane = dog tail
  • coda del cane = dog's tail

The presence or absence of an article is telling us something different: generic vs. specific. That's why either modifier stays invariable independently from coda's ('tail') number or article. This is left more "implicit" i nEnglish, but it is the same structure.

@amir-zeldes amir-zeldes mentioned this issue Jan 20, 2021
@amir-zeldes
Copy link
Contributor

A couple of things that occur to me:

Do we want non-pluralization of the modifier to be a firm criterion for compound?

That would be bad for Semitic languages, in which compound modifiers can be pluralized (with corresponding difference in meaning), while multiple articles are prohibited (so only the head can take a determiner, unlike nmod/obl)

The presence or absence of an article is telling us something different: generic vs. specific

No, I don't think so. Historically the reason for the lack of plural modifiers in Germanic compounds is that the modifier is not a complete word, but just an uninflected stem, similar to the Greek modifiers in -o, Sanskrit in -a, etc. (cf. Greco-Roman, where "Greco" is not an inflectible, full adjective form). In languages that do allow compounding with plural modifiers, genericity is not necessarily implied by pluralization, and vice versa. Here are some examples from Hebrew:

  • sarei ha-memshala - the government ministers (government is singular, and may or may not be a specific government)
  • sarei ha-memshalot be-rusia ve-yapan - the governments ministers in Russia and Japan (government is plural, specific governments of Russia and Japan are mentioned)

These examples are annotated as compounds in UD Hebrew and the same is done in UD Arabic. In those languages, the main distinguishing property of compounding is the use of a single article for the entire nominal construction, which is placed between the head and modifier.

das Gesicht der Frau

These are all indeed nmod, but in German the compound form would be spelled together: "das Frauengesicht", so in German UD almost never needs to use the compound relation for nominal compounds.

@nschneid
Copy link
Contributor Author

Note the title of this issue mentions English. I think it is inevitable that compound will have different morphosyntactic criteria in different languages. Semitic languages have overt marking/constraints on compounds different from those for English.

@amir-zeldes
Copy link
Contributor

Oh, yes, good point! So yes, I think ordinarily English noun compounds do not allow pluralization of the modifier, but there are occasional exceptions, usually where you would say the canonical form would have had singular (esp. for irregular plurals, e.g. "mice shit" - from GUM!), for plurale tantum nouns we conventionally tag as NNS (e.g. "data") or cases that are conventionally pluralized ("special ops mission"). But mostly there is a strong tendency for the modifier to appear as 'singular' without necessarily meaning something singular.

@Stormur
Copy link
Contributor

Stormur commented Jan 20, 2021

A couple of things that occur to me:

Do we want non-pluralization of the modifier to be a firm criterion for compound?

That would be bad for Semitic languages, in which compound modifiers can be pluralized (with corresponding difference in meaning), while multiple articles are prohibited (so only the head can take a determiner, unlike nmod/obl)

The presence or absence of an article is telling us something different: generic vs. specific

No, I don't think so. Historically the reason for the lack of plural modifiers in Germanic compounds is that the modifier is not a complete word, but just an uninflected stem, similar to the Greek modifiers in -o, Sanskrit in -a, etc. (cf. Greco-Roman, where "Greco" is not an inflectible, full adjective form). In languages that do allow compounding with plural modifiers, genericity is not necessarily implied by pluralization, and vice versa. Here are some examples from Hebrew:

* sarei ha-memshala - the government ministers (government is singular, and may or may not be a specific government)

* sarei ha-memshalot be-rusia ve-yapan - the governments ministers in Russia and Japan (government is plural, specific governments of Russia and Japan are mentioned)

I will try to explain myself better. In general, I think we could see the gradual loss of morphological markings (including some determiners like articles in a wider sense) as a way to express a shift towards a more generic reference: for example, in Greek, as you mention, we see a different behaviour between ξυλόφωνο 'xylophone', with ξυλ-ο instead of the genitive ξύλου, and Κονσταντινούπολη 'Constantinople', with the "regular" genitive of Κονσταντίνος 'Constantine'. It is clear that the former refers to generic wood or wooden sticks or similar, while the second refers to a very specific Constantine. It is a possible strategy that some languages have and it might be a general tendency, but, as it mixes morphologic with semantic considerations, this might not mean all languages use it the same way or use it at all. The Hebrew sentences might be showing us this. In the end, it is exactly the goal to take into account this variety of approaches that I see as the major point for favouring a transversal *mod representation over a quite limited compound.

For example, I know there has been (is?) discussion about whether (German) compounds such as Frauengesicht should be split or not for syntactical annotation, and I am quite sure something in the way of splitting has been done for Sanskrit, but I don't know the exact details thereof (note: we could actually take this -o-, -en-, etc. segments as a kind of "compounding inflection", and if I am not mistaken something similar has been done for Sanskrit). But if Frauengesicht were indeed to be split, should it not use nmod, parallelly to the more analytic and equivalent examples we agree upon? And should it not be the same for ξυλόφωνο and Κονσταντινούπολη? In some languages, there is some kind of arbitrariety going on whether compunds should be written together or separately.

I mean: does compound aims at representing an underlying syntactic structure or rather some kind of formal aspect? If it is the latter case, I don't see a good reason to keep it (at least for noun-noun combinations); if it is the former case, it seems to me that it is already perfectly explicable with the *mod relations. We could probably go on for days examining specific behaviours of compounds 🙂, but probably they are all just limited to single languages' idiosyncracies (e.g. admitting double articles at the beginning of a noun phrase, admitting "internal" pluralization,...) and don't help us grasping the wider frame.


Sorry for the long posts! And please pardon me if you think I am going overboard.

@nschneid
Copy link
Contributor Author

I think the rationale for compound as a label separate from nmod in general is a perfectly valid topic to question and discuss, but I would suggest doing so in another issue. :) In the near term I don't imagine it will be removed but maybe in UDv3....and in the meantime we can simply regard compound (for languages that use it, when it is headed by a noun) as a special case of nominal modification.

@Stormur
Copy link
Contributor

Stormur commented Jan 20, 2021

@nschneid again you are right, sorry for highjacking it! Let's start a new issue! 🙂

@arademaker
Copy link
Contributor

arademaker commented Mar 6, 2021

Just comment here about the issue in UniversalDependencies/UD_English-EWT#133

What is the guideline for names of people and organizations with function words like “Universidade Federal do Rio de Janeiro” I fell like annotating the contraction “de+o” as flat and not case and det confuse parsers! Does it make sense?

But I can understand the annotation of “Roberto da Silva Júnior” as flat(Silva, Roberto) with det(a,Silva) and case(de,Silva)

@nschneid
Copy link
Contributor Author

nschneid commented Mar 6, 2021

@arademaker in English or in Portuguese? Per https://universaldependencies.org/u/dep/flat.html, foreign names should be flat even if they have internal syntax in the foreign language.

If the text is in Portuguese, I take it “Universidade Federal do Rio de Janeiro” uses regular syntax for the name, so it wouldn't need to be flat at all.

@arademaker
Copy link
Contributor

Portuguese. But what about names like the one above “Roberto da Silva Júnior”? For the organization name, I can see a complete syntax analysis without flat. But for names like this, I need flat to connect Silva to Roberto. But I don’t like “de“ and “a” as flat ...

@nschneid
Copy link
Contributor Author

nschneid commented Mar 6, 2021

So I think flat(Roberto, Júnior), nmod(Roberto, Silva), case(Silva, da) would technically work, essentially saying the flat expression is "Roberto Júnior" and it has a modifier "da Silva".

If a non-initial noun in the name had a PP modifier, that would be a problem. Does it arise in Portuguese?

@arademaker
Copy link
Contributor

arademaker commented Mar 6, 2021

Oh, just notice my mistake in the syntax of the examples above rel(head,dependent) and not rel(dependent,head).

Thank you, so if I get it right, because “da” was introducing “da Silva” you make it nmod of “Roberto”. For this particular case, this avoid the problem of the issue UniversalDependencies/UD_English-EWT#133 since Roberto is the head of the flat structure. But I can also have names where “da Silva” would modify not the first name: “Roberto Paulo da Silva Júnior”

sorry, this example precisely answer your last question! Yes, we have these cases in Portuguese.

@nschneid
Copy link
Contributor Author

nschneid commented Mar 6, 2021

So it should be Roberto [Paulo da Silva] Júnior? Yeah if the PP is analyzed I think this would break the idealized notion of a flat structure under the current guidelines. Maybe the rule should be relaxed to say that flat dependents usually do not have any dependents of their own, but this would be an exception mixing linear order flat (Roberto + Paulo + Júnior) with an internal PP modifier.

@sylvainkahane
Copy link
Contributor

What are the phrases in “Roberto da Silva Júnior”? Clearly "da Silva" is a phrase, so case(Silva, da). But "da Silva" is a family name that works exactly as Rodriguez in "Roberto Rodriguez". If we have flat(Roberto, Rodriguez), we must have flat(Roberto, Silva).
For Júnior, I'm not sure how it works. I suppose that it combines with the family name only (OK da Silva Júnior, but *Roberto Júnior). It seems to be a modifier of the family name. My guess would be nmod(Silva, Júnior).

@nschneid
Copy link
Contributor Author

nschneid commented Mar 6, 2021

This is a fast-moving discussion. :) Apparently there was a consensus that this restriction on flat is too strict; it should allow internal modification. So we can use flat to link together the parts of the name and case for the preposition.

@nschneid
Copy link
Contributor Author

nschneid commented Mar 6, 2021

As for Júnior, I would prefer to treat it as some sort of nmod because it seems like a modifier, but right now the guidelines have English "Jr." as flat.

@nschneid
Copy link
Contributor Author

nschneid commented Mar 6, 2021

Please see #543 for proposed new guidelines on flat.

@sylvainkahane
Copy link
Contributor

"Ludwig van Beethoven" should be analyzed as "Roberto da Silva". Whatever the language where it appears, "van Beethoven" is analyzed as a family name, so we must have a flat relation between Ludwig and "van Beethoven".
The internal analysis of "van Beethoven" can differ according to the language: In a Germanic language it will be clearly understood as case(Beethoven, van). But in some other languages, it can be analyzed as flat:foreign(van, Beethoven).
In this case we will have a flat dependent with a flat dependent:

flat:name(Ludwig, van)
flat:foreign(van, Beethoven)

This is justified because the internal structure is (Ludwig)(van Beethoven) and both relations are flat.
In conclusion I think that both analyses of "Ludwig van Beethoven" on https://universaldependencies.org/u/dep/flat.html should be changed.

@nschneid
Copy link
Contributor Author

nschneid commented Mar 6, 2021

@sylvainkahane Linguistically speaking I see the logic there—each subunit in the name would be separate—but since the :name and :foreign subtypes are optional I don't know whether it is practical to have a rule that involves layering of different kinds of flat relations.

@nschneid
Copy link
Contributor Author

nschneid commented Mar 6, 2021

Even setting aside the subtypes, we have to balance the extra expressivity that would afford a very precise analysis (such as making "van Beethoven" a subunit) vs. a simple and enforceable rule that will prevent errors like a chained flat structure when the bouquet structure is correct.

@arademaker
Copy link
Contributor

arademaker commented Mar 6, 2021

The only problem that I can anticipate is that different annotators would have different intuitions about the internal structures of the names: “Roberto [Paulo da Silva] Júnior” vs “[Roberto Paulo] da Silva] Júnior” vs ...

and the decision is outside any syntactic theory...

@sylvainkahane
Copy link
Contributor

Of course “Roberto Paulo da Silva Júnior” is syntactically ambiguous out of context, as many sentences ("I saw the man with a telescope" and so on). But the annotator must decide what is the most probable analysis given the context.
It is not because it is syntactically ambiguous (out of context) that we must adopt an underspecified syntactic analysis.
Anyway, "Ludwig van Beethoven" or "Roberto da Silva" is not ambiguous and any educated annotator knows what is the first name and what is the second name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants