-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UD's fundations: functionalism vs distributionalism #1063
Comments
I can give another example. English has (at least) four syntactic constructions where a clause modifies a noun:
It would be justified, from the distributionalist point of view, to distinguish these three constructions. The English treebanks only distinguish the relative clause, with the Anyway, the fact to only consider the relative clause (<- a fifth construction: non-finiteness and no omission) could be justified for English, but universal guidelines should consider all the possibilities and propose a more complete terminology. It would avoid many of the inconsistencies we find today in the annotation of adjectival clauses ( |
Thanks for a nice synopsis of these two ways of thinking about dependency relations.
My understanding is that the main/universal relation is meant to follow the functionalist approach, whereas subtypes (if present) are more language-specific and often follow a distributionalist approach. So the main relation and the subtype represent two different levels. On some of the specifics:
|
@sylvainkahane I agree with the point about However
My sense is that there are too many languages that do not clearly distinguish relative clauses, so while I support strongly recommending the subtype for languages with the distinction, I think making it a universal category would make generalizations about adnominal clauses would be harder if it was made a major type. |
I think this discussion is very important, especially looking forward to a potential version 3 of the guidelines. Unfortunately, I don't think it is as simple as universal relations always being based on functional criteria. Functional criteria works for relations that are (part of) constructions in Croft's sense, such as An independent problem with the current way of representing syntactic relations in UD is that the subtyping mechanism is extremely crude and has to do double duty in cross-linguistically prominent subtypes, like "acl:relcl" and "nsubj:pass", as well as more truly language-specific phenomena. Since subtypes are furthermore both atomic and non-recursive, the expressivity is severely limited, which means that many interesting subtypes cannot be represented at all because some other subtype has been given priority. Another desideratum for v3 is therefore to have a more expressive mechanism for subclassifying syntactic relations, in the same way that we can subclassify morphological categories using features. |
@jnivre I agree with you that the functional definition of syntactic relations can only concern the relations between content words in UD due to particular status given to function words in UD. @amir-zeldes I think that the current definition of |
For UDv3 I think we can consider the possibility that |
@sylvainkahane I totally agree that this is not a sufficient definition, though this is true of many deprels, especially at the universal level. There is a lot of literature on what compounds are and aren't typologically, but I think it really only makes sense as an annotation guideline to consider it on a language-by-language basic. In Semitic languages, there is a tradition regarding construct states as compounds, even though they are much more flexible than, say, English compounds. By contrast, compounds in German are less flexible than English ones - yet the term compound is still traditionally applied. At the end of the day, as Croft also pointed out as early as Radical Construction Grammar, and earlier, there are no 1:1 correspondences across languages. But I think as a project that serves the linguistic community, UD can still be helpful in labeling some things as
Actually there are some differences, and they relate to whether or not the modifier has the properties of a normal noun in the language. For example, the modifier has no restriction on number in the genitive construction - "the teachers' book" vs. "the teacher's books" or any other combination - either the head or modifier can be pluralized, or both or neither, like in other environments. This is not true for English compounds, suggesting the modifier is not quite a complete noun in itself. And if we say phrases are like words in that they can be pronominalized (at least for nouns/NPs), then that is another criterion by which compound modifiers are not normal noun modifiers or phrases - they cannot generally be referred back to by a pronoun: "the book club read it" cannot mean that the club read the book after which it was named "the book club" - we have to introduce some other book earlier in the discourse as the antecedent. |
Of course, compounds in English have specific properties both distributional and functional. The fact that the dependent noun cannot be inflected and is not referential is very important. Let’s give another example. In French we have a similar contrast between two constructions, whether the dependent noun has a determiner or not: le livreur de la pizza ’the pizza’s boy’ vs le livreur de pizza ’the pizza boy’. In the second case, ‘pizza’ is not referential, it can less easily be modified, it is almost impossible to add an adjective before de pizza. But there is a difference between French and English, because both constructions use the same preposition de and they merge when we have a proper noun: le théorème de Lebesgue ’The Lebesgue theorem, Lebesgue’s theorem’. Do we think that we must distinguish the two constructions and use @jnivre What is your opinion? (but maybe you are biased because you are also native of a Germanic language). |
I am probably biased but maybe in a different way than English speakers, because the distinction between compounding and modification is more clear-cut in Swedish, not only because of orthography (compounds are written without internal spaces, at least in normative orthography) but also because of prosody. As you may remember, Swedish is a tone language, and compounds have one of our two word tones, which phrases including modifiers never have. However, to complicate things, the first part of a compound can be referential, as in "Palmemordet" (the Palme murder), which is the normal way of referring to the (still unsolved) murder of our prime minister Olof Palme in 1986, and this compound is basically synonymous with the phrase "mordet på Palme" (lit. the-murder on Palme), and even the Saxon genitive "Palmes mord" (Palme's murder) is marginally possible (although unnatural in most contexts). One way of describing this state of affairs is then that Swedish can use three different morphosyntactic strategies (which Croft would call juxtaposition, flag, and linker, respectively) for one and the same functionally defined construction, nominal modification. And from this point of view, it makes perfect sense to use the First of all, the current UD taxonomy of syntactic relations was not defined from the beginning with the goal of separating functionally defined universal constructions from morphosyntactic strategies, even though parts of the taxonomy are perhaps compatible with such a view. It would therefore be hard to implement this idea for the entire taxonomy, which is why I personally see this discussion as mostly relevant for version 3 of the guidelines, which could involve a revision of this taxonomy. Secondly, the main motivation for having the Thirdly, even though compound modification can be referential in Swedish, it doesn't have to be, which means that the strategy of juxtaposing two lexical stems to form a single word can be associated with multiple (functionally defined) constructions. Maybe some or most of this are similar enough to be grouped under |
This is interesting to read, because from a "synchronic" point of view this kind of separation seems (at least to me) to be one of the main goals. Personally, it is an impression that actually grew stronger and stronger annotating data myself, as a kind of necessity. One could also argue that it is not possible to do otherwise if the goal is to achieve comparability. If we, for example, say that the vague notion of There are many layers of annotation in UD, and we do have means (linear position; morphological features; presence of functional elements...) to distinguish all the cases discussed here. This makes for interesting annotations in my opinion, not blurring and conflating these layers. Just my 2 cents on some more specific issues raised by @jnivre : prosody and lexical integrity.
|
I think it is confusing to use one definition of "syntactic word" for purposes of determining the tokenization/units that get dependency relations, and another, more nebulous definition of "word" for purposes of grouping some of those units together via Inevitably, orthographic conventions will dictate the tokenization to some extent. For some terms, N+N spelling preferences can differ within a language community ("tabletop" or "table-top" or "table top"?). Semantically, it is tempting to say that these are all similar and captured by some broad notion of wordhood, such that even if tokenization differs the In addition to the extremely frequent N+N combinations, the term "compound" in English can also apply to complex attributive modifiers written with spaces or hyphens, like "4-legged" or "fire-breathing", as discussed in §4 of our Mischievous Nominals paper. Some of these are productive: considering "fire-breathing" and "church-going" as two examples of one pattern, one could argue there is a morphological process at work rather than a syntactic one, with V+obj or V+obl combinations being repackaged (with the V second) as effectively adjectives. Here, though "fire" and "church" are nouns and dependents of "breathing" and "going" respectively (because the participles better reflect the distribution of the phrase), it is hard to say that "fire" and "church" attach as |
It seems that I did not quite manage to get my points across so let me try to express myself more clearly. Two cornerstones of the UD annotation framework are (a) lexicalism and (b) dependency. Lexicalism means drawing a strict boundary between word-internal structure, handled in the morphological annotation layer, and word-external structure, handled in the syntactic layer. Dependency means analysing syntax in terms of functional relations between words, rather than constituent structure. Neither of these assumptions is perfectly upheld in the current version of UD, and there is a lot to say about dependency as well, but I will focus on lexicalism for now. A consequence of lexicalism is that, if language A uses morphology to encode a phenomenon, while language B uses syntax, then the annotations will look radically different even if the function encoded is (essentially) the same. Thus, if language A uses instrumental case and language B uses a preposition to encode that a nominal is an oblique agent phrase in a passive construction, then this will be captured in the annotation by the presence of a feature Case=Ins on the noun in language A and by the presence of relation labeled Now, in a perfect world, this would be the only case where annotations look radically different even if the function is essentially the same. Unfortunately, we also have cases where the "words" used as annotation units in a treebank are not true morphosyntactic words. Therefore, we have at least three relations that are not true syntactic relations, but rather exist for the purpose of fixing segmentation mismatches, namely For Now, if people don't think that "orange juice" in English is one syntactic word, then I think we should stop using Finally, to address one of @nschneid's comments, I don't think there are two different definitions of "syntactic word", but I think we have not made explicit enough in the guidelines that, because of the inevitable segmentation mismatches due to standard tokenisers, some of our "syntactic" relations are really tools for stitching together syntactic words. Incidentally, this is also why I think it is wrong -- under the current UD guidelines -- to segment compounds written without spaces in Swedish and German, because they are syntactic words (and their constituent parts are not). Despite my best intentions, I may have ended up rambling, so do let me know if anything is still unclear. :) |
Thanks, that helps! I guess what I am trying to say in regard to English compounds is that
So, if we were to go with "syntactic word that happens to contain multiple tokens" as the criterion for |
Thanks, @nschneid. The coordination case occurs in Swedish too, but the standard orthography indicates that it is a case of ellipsis: "köks- eller matsalsbord" = "köksbord eller matsalsbord". In addition to the hyphen, which indicates the missing second part, it is worth noting that "köks" ("kitchen"+s) has the special "s" morph, which only occurs in compound formation. Is an elliptic analysis conceivable in English too? This would treat "kitchen and dining room table" as elliptic for "kitchen table and dining room table". Through promotion, the element "kitchen" would then take the place of the missing compound head as the first conjunct: conj(kitchen, table) |
Good point. I agree that an ellipsis analysis would be plain wrong in this case. Similar examples are marginally possible in Swedish too, but I think most people would use hyphenation to indicate the exceptional status of the coordinated element: "papper-och-penna-test". It seems clear that compounding in English is less constrained than in Swedish and more similar to phrasal modification, but it is not clear whether this is grounds for abandoning the |
Hi again - this is obviously a very complex discussion, but I'd just like to point out that if we think that "Palmemordet" is a compound, or perhaps even "nuncupo", then the relation Whether English noun-noun, or other types of compounds as identified in traditional (non-UD) linguistics should use the |
In the long discussion #1059, @jnivre has defended the fact that syntactic relations must be defined on a functionalist ground and that, for instance, all genitive constructions must be
nmod
, whether they involve a noun or a pronoun.On the other hand, if we look at the English treebanks (which is the only set of treebanks that all of us can easily explore and which are consequently our unavoidable references), we see that there are three different syntactic constructions where a noun depends on a noun, which are clearly distinguished by the use of three different syntactic relations:
nmod
for "N of N"nmod:poss
for "N's N"compound
for "N N"It is what I would call a distributionalist approach of the syntactic relations. Syntactic constructions that can be clearly distinguished in the language by distributional/syntactic properties are distinguished.
I think that both approaches, functionalist and distributionalist, are useful. But the UD tagset must be clarified and should not mix both approaches at the same level. For instance, in the case of the three constructions in English where a noun depends on a noun, we should have the relation
nmod
and an (optional) subrelation indicating the particular constructions. For instance:nmod:adp
for "N of N"nmod:poss
for "N's N" (or maybenmod:det
because "N's" are in the same syntactic position as determiners in English)nmod:compound
for "N N"I don't think that
compound
is really justified on a functionalist ground.The text was updated successfully, but these errors were encountered: