-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spelled-out numbers #198
Comments
What is the problem with using
Looks pretty ok to me ? Couldn't something similar be done for Turkish ? |
This looks good. The reason for my question was because the relation between hundred and two is not the same as the relation between thirty and three. The first one (for me) has a clear head, but the second one does not. But your answer resolves it for me, the second relation looks more like a conjunction than a compound. I was not thinking about
It also captures the fact that it is 3 + 30 + 2*100. Of course, if there is an already existing standard I'd rather follow it. |
I agree with the analysis of |
Choice of the order in '3 + 30' is arbitrary, like in other conjunctions. But it reflects that I want to mark it head final. Although the choice of head between 30 and 3 (or 100) is arbitrary here, if the whole numeric expression is inflected, the suffixes attach to the last number. |
Agree with Jonathan here. My post should have read:
For Turkic, switch head-initial to head-final. |
Since Uppsala has confirmed that the Also I find it strange to analyze compound numerals as coordination, unless there is an overt conjunction as in English. I have tried to analyze all such examples just with the |
Some more data on this: In Turkish, it is very common to use numbers as noted earlier.
An overt conjunction is never used, but "understood". If there is a ve "and" in between two numbers, these must be two different numbers. In spoken language, intonation would also indicate (I think) which parts are to be understood as conjunction (addition) and which parts should be compound (multiplication). Intonation is different for dokuz yüz "9 * 100 = 900" and on dokuz "10 + 9 = 19". For these reasons, the above proposal sounds quite appealing to me. Furthermore, we get a very clean way to map dependency relations to arithmetic nicely, which wouldn't hurt. The direction of conjunction is not important for the representation of the numbers suggested above. However, for other reasons, such as the suffixes that are added to the last token, I still think the last conjunct should be the head for Turkish and for other languages for which it eases the use of the treebanks (see #236 for more on this discussion). But there is one more issue with the conjoining numbers in Turkish. There is another very common usage where two or more numbers are coordinated without an explicit conjunct. But in this case, it means "or" or indicates a range with a hint of approximation
For now, I have opted for marking the "or" usage with |
FWIW, such “or/to” covert coordination of numerals exists in Czech, too. Here I would not object to |
Closing as obsolete. In UD v2, words of a number are connected via |
@dan-zeman this is currently not the case in English-GUM, which was modeled after EWT (both corpora currently use |
I do not know what is the current situation in the individual treebanks. I know that the v1 guidelines recommended |
I don't think flat is a good idea here, as flat mandated left-to-right. In some complex numerals there might be a motivation to prefer one part (e.g. fraction is added to non fraction part of a complex number?) and certainly for dates. Maybe flat happens to work for the example: 1 December 2016 But if it's: December 1, 2016 I think the head should be the day, and flat removes that option. In the latter case at least for English, I'm for: compound(1,December) This analysis basically says, there are multiple "MONTH 1" dates, and this is the December version: ((December) 1) Does that make sense? |
There is an open issue for dates: #455 (as well as several closed issues, e.g. #113 and #210). |
For spelled-out numerals, such as four thousand, |
Doesn't four modify thousand? If I had to guess, I'd go with the normal Germanic right headed rule, and it fits nicely with German where it looks like a normal compound (viertausend). |
Well... possibly, yes. You could even say that it is |
I think semantically it is 'counting', but syntactically it's more like a compound, at least if we consider "ten" in "ten year old" to be |
I think there would be no problem in considering all additive components of a spelled out number (e.g. units, tens, hundreds...) as four thousand two hundred thirty three
|
@Stormur I thought BTW the English guidelines currently say |
Note that in spoken language, where transcripts often just list words with spaces, things mean very different things:
So I think order is important, and I also don't think it lacks a head or hierarchy. For commutative addition we can just follow the normal conj guideline, placing the head on the left by convention. |
From a very practical point of view, it seems that the validator hasn't complained until now, so I would think it is allowed. I do not see it as impossible: the head elements might have no hierarchy, but be phrases by themselves... no?
I would indeed treat these two cases differently! The first one is "additive" (and represents a single number), hence |
I agree with @Stormur’s analysis. Just need to clarify this case for Slavics where thousands and millions are usually analyzed as nouns: four thousands two_hundred thirty three dollars. |
We have been attaching the numeral as a If thousands is tagged |
It may be a trivial question, but I could not find a direct answer or example.
In Turkish corpora, I see quite a few examples of numbers that are completely spelled out. For example, otuz üç 'thirty three'. It looks natural to relate the parts with
mwe
. However, I saw (for example in English UD treebank) that the numbers like "three million" are marked usingcompound
. This is not a good option for the above example, since it does not have a clear structure, but the decision becomes arbitrary, since we also see examples like iki yüz 'two hundred', and gets difficult if it is iki yüz otuz üc 'two hundred thirty three'.As I understand, In METU-Sabancı treebank, these were joined together during tokenization.
I am inclined to mark all with
mwe
with a flat, head-final structure, but afraid of loosing the parallel with the other languages. (Motivation for head-final structure is the same as ones expressed in #189.)The text was updated successfully, but these errors were encountered: