Spelled-out numbers #198

coltekin · 2015-07-02T15:33:27Z

It may be a trivial question, but I could not find a direct answer or example.

In Turkish corpora, I see quite a few examples of numbers that are completely spelled out. For example, otuz üç 'thirty three'. It looks natural to relate the parts with mwe. However, I saw (for example in English UD treebank) that the numbers like "three million" are marked using compound. This is not a good option for the above example, since it does not have a clear structure, but the decision becomes arbitrary, since we also see examples like iki yüz 'two hundred', and gets difficult if it is iki yüz otuz üc 'two hundred thirty three'.

As I understand, In METU-Sabancı treebank, these were joined together during tokenization.

I am inclined to mark all with mwe with a flat, head-final structure, but afraid of loosing the parallel with the other languages. (Motivation for head-final structure is the same as ones expressed in #189.)

The text was updated successfully, but these errors were encountered:

ftyers · 2015-07-02T16:23:37Z

What is the problem with using compound ?

two hundred and thirty three 
compound(hundred, two)
compound(thirty, three)
conj(hundred, thirty)
cc(hundred, and)

Looks pretty ok to me ? Couldn't something similar be done for Turkish ?

coltekin · 2015-07-02T20:07:13Z

This looks good. The reason for my question was because the relation between hundred and two is not the same as the relation between thirty and three. The first one (for me) has a clear head, but the second one does not. But your answer resolves it for me, the second relation looks more like a conjunction than a compound.

I was not thinking about conj since Turkish does not use an explicit conjunction. With a combination of conj and compound my preference would be:

iki yüz otuz üç
comp(yüz, iki)
conj(üç, otuz)
conj(üç, yüz)

It also captures the fact that it is 3 + 30 + 2*100.

Of course, if there is an already existing standard I'd rather follow it.

jonorthwash · 2015-07-03T00:27:28Z

I agree with the analysis of 2*100 + 30 + 3, where + is treated as conj and * is treated as compound or similar. Treating 30+3 (or 3+30, in this case, for some reason?) the same as 2*100 doesn't seem consistent.

coltekin · 2015-07-03T07:24:09Z

Choice of the order in '3 + 30' is arbitrary, like in other conjunctions. But it reflects that I want to mark it head final. Although the choice of head between 30 and 3 (or 100) is arbitrary here, if the whole numeric expression is inflected, the suffixes attach to the last number.

ftyers · 2015-07-03T14:58:02Z

Agree with Jonathan here. My post should have read:

two hundred and thirty three 
compound(hundred, two)
conj(hundred, three)
conj(hundred, thirty)
cc(hundred, and)

For Turkic, switch head-initial to head-final.

dan-zeman · 2015-11-15T21:58:07Z

Since Uppsala has confirmed that the conj relation must go left-to-right, head-final is not an option here (but there is no such restriction for compound).

Also I find it strange to analyze compound numerals as coordination, unless there is an overt conjunction as in English. I have tried to analyze all such examples just with the compound relation. (But I do not have many examples. Vast majority of numbers in the Czech data is expressed using digits. And sometimes the word for thousand and million is tagged NOUN, which also breaks the compound chains.

coltekin · 2015-11-19T15:27:58Z

Some more data on this: In Turkish, it is very common to use numbers as noted earlier.

Bin       dokuz yüz     on  dokuzda
Thousand  nine  hundret ten nine-LOC
In 1919

An overt conjunction is never used, but "understood". If there is a ve "and" in between two numbers, these must be two different numbers. In spoken language, intonation would also indicate (I think) which parts are to be understood as conjunction (addition) and which parts should be compound (multiplication). Intonation is different for dokuz yüz "9 * 100 = 900" and on dokuz "10 + 9 = 19". For these reasons, the above proposal sounds quite appealing to me. Furthermore, we get a very clean way to map dependency relations to arithmetic nicely, which wouldn't hurt.

The direction of conjunction is not important for the representation of the numbers suggested above. However, for other reasons, such as the suffixes that are added to the last token, I still think the last conjunct should be the head for Turkish and for other languages for which it eases the use of the treebanks (see #236 for more on this discussion).

But there is one more issue with the conjoining numbers in Turkish. There is another very common usage where two or more numbers are coordinated without an explicit conjunct. But in this case, it means "or" or indicates a range with a hint of approximation

Üç      dört     kişi
Three   four     person
Three or/to four people

For now, I have opted for marking the "or" usage with conj, and the numeric combination with conj:num. I am not attached to the labels or the exact solution, but I definitely agree with the "standard needed" tag above.

dan-zeman · 2015-11-19T19:23:11Z

FWIW, such “or/to” covert coordination of numerals exists in Czech, too. Here I would not object to conj, although we don't use it in UD Czech 1.2. The original treebank was not able to represent coordination if there was neither coordinating conjunction nor punctuation, which is this case. Therefore both numerals are attached to the same counted noun as nummod.

http://hdl.handle.net/11346/PMLTQ-VAWO

dan-zeman · 2018-04-23T15:51:47Z

Closing as obsolete. In UD v2, words of a number are connected via flat.

amir-zeldes · 2018-04-23T16:16:02Z

@dan-zeman this is currently not the case in English-GUM, which was modeled after EWT (both corpora currently use compound). This is convenient because then SD number can be converted directly to compound, and there's no need to ensure left-to-right. Is there a plan to change to flat in English/other languages? Also @sebschu

dan-zeman · 2018-04-24T07:04:27Z

I do not know what is the current situation in the individual treebanks. I know that the v1 guidelines recommended compound for this. But somehow this sneaked into the v2 guidelines for flat: http://universaldependencies.org/u/dep/flat.html#dates-and-complex-numerals

amir-zeldes · 2018-04-24T17:58:13Z

I don't think flat is a good idea here, as flat mandated left-to-right. In some complex numerals there might be a motivation to prefer one part (e.g. fraction is added to non fraction part of a complex number?) and certainly for dates. Maybe flat happens to work for the example:

1 December 2016

But if it's:

December 1, 2016

I think the head should be the day, and flat removes that option. In the latter case at least for English, I'm for:

compound(1,December)
nmod:tmod(1,2016)

This analysis basically says, there are multiple "MONTH 1" dates, and this is the December version: ((December) 1)

Does that make sense?

martinpopel · 2018-04-24T19:32:25Z

There is an open issue for dates: #455 (as well as several closed issues, e.g. #113 and #210).
I would suggest to discuss dates there and keep this issue for (other) spelled-out numbers.
Of course, taking into account e.g. the (not only Turkish) spell-out years, which are in the intersection of both issues.

dan-zeman · 2018-04-24T20:30:09Z

For spelled-out numerals, such as four thousand, flat seems better than compound exactly because it does not attempt to make one part the head. It may not be the preferred approach in all languages but I remember I was struggling with identifying the head when I was using compound in Czech, and I think there is no evidence of headedness in English either.

amir-zeldes · 2018-04-24T20:39:46Z

Doesn't four modify thousand? If I had to guess, I'd go with the normal Germanic right headed rule, and it fits nicely with German where it looks like a normal compound (viertausend).

dan-zeman · 2018-04-24T20:45:41Z

Well... possibly, yes. You could even say that it is nummod(thousand, four) (answering "How many thousands?"). But I don't feel strongly about it.

amir-zeldes · 2018-04-25T13:47:08Z

I think semantically it is 'counting', but syntactically it's more like a compound, at least if we consider "ten" in "ten year old" to be compound. This seems to have the same property of not pluralizing the modified number, since it is itself a modifier. It would be nummod for me in "she is ten years old" and "there are four thousands there".

Stormur · 2021-01-22T16:11:09Z

I think there would be no problem in considering all additive components of a spelled out number (e.g. units, tens, hundreds...) as flat, since they're order is largely arbitrary (addition has the commutative property! 🙂 ) and it sees a lot of variation, linguistically and stylistically, while allowing an internal structure for some of these components. So, for example:

four thousand two hundred thirty three

head(thousand)
nummod(thousand,four)
flat(thousand,hundred)
nummod(hundred,two)
flat(thousand,thirty)
flat(thousand,three)

nschneid · 2021-01-22T16:37:14Z

@Stormur I thought flat dependents weren't supposed to have their own dependents? So flat(thousand,hundred) could not be combined with nummod(hundred,two).

BTW the English guidelines currently say compound: https://universaldependencies.org/en/dep/compound.html

amir-zeldes · 2021-01-22T16:54:07Z

Note that in spoken language, where transcripts often just list words with spaces, things mean very different things:

"twenty five" = 25
"five twenty" = 5:20

So I think order is important, and I also don't think it lacks a head or hierarchy. For commutative addition we can just follow the normal conj guideline, placing the head on the left by convention.

Stormur · 2021-01-22T17:04:52Z

@Stormur I thought flat dependents weren't supposed to have their own dependents? So flat(thousand,hundred) could not be combined with nummod(hundred,two).

BTW the English guidelines currently say compound: https://universaldependencies.org/en/dep/compound.html

From a very practical point of view, it seems that the validator hasn't complained until now, so I would think it is allowed. I do not see it as impossible: the head elements might have no hierarchy, but be phrases by themselves... no?

Note that in spoken language, where transcripts often just list words with spaces, things mean very different things:
* "twenty five" = 25

* "five twenty" = 5:20
So I think order is important, and I also don't think it lacks a head or hierarchy. For commutative addition we can just follow the normal conj guideline, placing the head on the left by convention.

I would indeed treat these two cases differently! The first one is "additive" (and represents a single number), hence flat, the second is not (and I would probably choose conj, as it represents a combination)!

msklvsk · 2022-07-25T12:33:44Z

I agree with @Stormur’s analysis. Just need to clarify this case for Slavics where thousands and millions are usually analyzed as nouns: four thousands two_hundred thirty three dollars.
There should be an nmod(thousands, dollars), correct? Meaning thousands will mix flat dependents with an nmod dependent. The validator is silent on this.

dan-zeman · 2022-07-25T20:18:48Z

There should be an nmod(thousands, dollars), correct? Meaning thousands will mix flat dependents with an nmod dependent. The validator is silent on this.

We have been attaching the numeral as a nummod of the counted noun, not vice versa. For three dollars (as well as three girls/houses/cats...), the analysis is nummod(dollars, three). For thousand dollars the analysis is analogous, provided that thousand is NUM and not NOUN: nummod(dollars, thousand). In Czech and some other languages, the situation is further complicated by the fact that some (but not all) numerals in some (but not all) situations force the counted noun into the genitive case. SUD folks might see this as a reason to make the numeral the head, but in UD we kept nummod and instead added a subtype nummod:gov to preserve the information that the numeral governs the case of the noun: nummod:gov(dolarů, tisíc). (The subtype is currently used in some 10 languages, mostly Slavic, including Ukrainian.)

If thousands is tagged NOUN, the situation is different. The criteria for doing so seem blurry to me, but I believe it should be NUM if it occurs as a part of a longer number.

згідно UniversalDependencies/docs#198 (comment)

according to UniversalDependencies/docs#198 (comment)

dan-zeman added standard needed universal labels Nov 15, 2015

dan-zeman added this to the lg-specific v1.3 milestone Nov 15, 2015

dan-zeman modified the milestones: lg-specific v2, lg-specific v1.3 Nov 17, 2016

dan-zeman closed this as completed Apr 23, 2018

dan-zeman reopened this Apr 24, 2018

dan-zeman modified the milestones: lg-specific v2, v2.2 Apr 24, 2018

dan-zeman modified the milestones: v2.2, v2.4 Nov 13, 2018

dan-zeman modified the milestones: v2.4, v2.5 Oct 6, 2019

dan-zeman modified the milestones: v2.5, v2.6 Nov 11, 2019

dan-zeman modified the milestones: v2.6, v2.7 May 14, 2020

dan-zeman modified the milestones: v2.7, v2.8 Nov 14, 2020

dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021

dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022

msklvsk added a commit to mova-institute/zoloto that referenced this issue Jul 26, 2022

перерозмітити великі числівники

99b0477

згідно UniversalDependencies/docs#198 (comment)

msklvsk added a commit to mova-institute/zoloto that referenced this issue Jul 27, 2022

перерозмітити великі числівники

f44167b

згідно UniversalDependencies/docs#198 (comment)

msklvsk added a commit to UniversalDependencies/UD_Ukrainian-IU that referenced this issue Jul 27, 2022

reanalyze large numerals

dd951f6

according to UniversalDependencies/docs#198 (comment)

dan-zeman modified the milestones: v2.11, v2.13 May 29, 2023

dan-zeman closed this as completed Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spelled-out numbers #198

Spelled-out numbers #198

coltekin commented Jul 2, 2015

ftyers commented Jul 2, 2015

coltekin commented Jul 2, 2015

jonorthwash commented Jul 3, 2015

coltekin commented Jul 3, 2015

ftyers commented Jul 3, 2015

dan-zeman commented Nov 15, 2015

coltekin commented Nov 19, 2015

dan-zeman commented Nov 19, 2015

dan-zeman commented Apr 23, 2018

amir-zeldes commented Apr 23, 2018

dan-zeman commented Apr 24, 2018

amir-zeldes commented Apr 24, 2018

martinpopel commented Apr 24, 2018

dan-zeman commented Apr 24, 2018

amir-zeldes commented Apr 24, 2018

dan-zeman commented Apr 24, 2018

amir-zeldes commented Apr 25, 2018

Stormur commented Jan 22, 2021 •

edited

Loading

nschneid commented Jan 22, 2021

amir-zeldes commented Jan 22, 2021

Stormur commented Jan 22, 2021 •

edited

Loading

msklvsk commented Jul 25, 2022 •

edited

Loading

dan-zeman commented Jul 25, 2022

Spelled-out numbers #198

Spelled-out numbers #198

Comments

coltekin commented Jul 2, 2015

ftyers commented Jul 2, 2015

coltekin commented Jul 2, 2015

jonorthwash commented Jul 3, 2015

coltekin commented Jul 3, 2015

ftyers commented Jul 3, 2015

dan-zeman commented Nov 15, 2015

coltekin commented Nov 19, 2015

dan-zeman commented Nov 19, 2015

dan-zeman commented Apr 23, 2018

amir-zeldes commented Apr 23, 2018

dan-zeman commented Apr 24, 2018

amir-zeldes commented Apr 24, 2018

martinpopel commented Apr 24, 2018

dan-zeman commented Apr 24, 2018

amir-zeldes commented Apr 24, 2018

dan-zeman commented Apr 24, 2018

amir-zeldes commented Apr 25, 2018

Stormur commented Jan 22, 2021 • edited Loading

nschneid commented Jan 22, 2021

amir-zeldes commented Jan 22, 2021

Stormur commented Jan 22, 2021 • edited Loading

msklvsk commented Jul 25, 2022 • edited Loading

dan-zeman commented Jul 25, 2022

Stormur commented Jan 22, 2021 •

edited

Loading

Stormur commented Jan 22, 2021 •

edited

Loading

msklvsk commented Jul 25, 2022 •

edited

Loading