Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spelled-out numbers #198

Closed
coltekin opened this issue Jul 2, 2015 · 23 comments
Closed

Spelled-out numbers #198

coltekin opened this issue Jul 2, 2015 · 23 comments

Comments

@coltekin
Copy link
Member

coltekin commented Jul 2, 2015

It may be a trivial question, but I could not find a direct answer or example.

In Turkish corpora, I see quite a few examples of numbers that are completely spelled out. For example, otuz üç 'thirty three'. It looks natural to relate the parts with mwe. However, I saw (for example in English UD treebank) that the numbers like "three million" are marked using compound. This is not a good option for the above example, since it does not have a clear structure, but the decision becomes arbitrary, since we also see examples like iki yüz 'two hundred', and gets difficult if it is iki yüz otuz üc 'two hundred thirty three'.

As I understand, In METU-Sabancı treebank, these were joined together during tokenization.

I am inclined to mark all with mwe with a flat, head-final structure, but afraid of loosing the parallel with the other languages. (Motivation for head-final structure is the same as ones expressed in #189.)

@ftyers
Copy link
Contributor

ftyers commented Jul 2, 2015

What is the problem with using compound ?

two hundred and thirty three 
compound(hundred, two)
compound(thirty, three)
conj(hundred, thirty)
cc(hundred, and)

Looks pretty ok to me ? Couldn't something similar be done for Turkish ?

@coltekin
Copy link
Member Author

coltekin commented Jul 2, 2015

This looks good. The reason for my question was because the relation between hundred and two is not the same as the relation between thirty and three. The first one (for me) has a clear head, but the second one does not. But your answer resolves it for me, the second relation looks more like a conjunction than a compound.

I was not thinking about conj since Turkish does not use an explicit conjunction. With a combination of conj and compound my preference would be:

iki yüz otuz üç
comp(yüz, iki)
conj(üç, otuz)
conj(üç, yüz)

It also captures the fact that it is 3 + 30 + 2*100.

Of course, if there is an already existing standard I'd rather follow it.

@jonorthwash
Copy link
Contributor

I agree with the analysis of 2*100 + 30 + 3, where + is treated as conj and * is treated as compound or similar. Treating 30+3 (or 3+30, in this case, for some reason?) the same as 2*100 doesn't seem consistent.

@coltekin
Copy link
Member Author

coltekin commented Jul 3, 2015

Choice of the order in '3 + 30' is arbitrary, like in other conjunctions. But it reflects that I want to mark it head final. Although the choice of head between 30 and 3 (or 100) is arbitrary here, if the whole numeric expression is inflected, the suffixes attach to the last number.

@ftyers
Copy link
Contributor

ftyers commented Jul 3, 2015

Agree with Jonathan here. My post should have read:

two hundred and thirty three 
compound(hundred, two)
conj(hundred, three)
conj(hundred, thirty)
cc(hundred, and)

For Turkic, switch head-initial to head-final.

@dan-zeman
Copy link
Member

Since Uppsala has confirmed that the conj relation must go left-to-right, head-final is not an option here (but there is no such restriction for compound).

Also I find it strange to analyze compound numerals as coordination, unless there is an overt conjunction as in English. I have tried to analyze all such examples just with the compound relation. (But I do not have many examples. Vast majority of numbers in the Czech data is expressed using digits. And sometimes the word for thousand and million is tagged NOUN, which also breaks the compound chains.

@coltekin
Copy link
Member Author

Some more data on this: In Turkish, it is very common to use numbers as noted earlier.

Bin       dokuz yüz     on  dokuzda
Thousand  nine  hundret ten nine-LOC
In 1919

An overt conjunction is never used, but "understood". If there is a ve "and" in between two numbers, these must be two different numbers. In spoken language, intonation would also indicate (I think) which parts are to be understood as conjunction (addition) and which parts should be compound (multiplication). Intonation is different for dokuz yüz "9 * 100 = 900" and on dokuz "10 + 9 = 19". For these reasons, the above proposal sounds quite appealing to me. Furthermore, we get a very clean way to map dependency relations to arithmetic nicely, which wouldn't hurt.

The direction of conjunction is not important for the representation of the numbers suggested above. However, for other reasons, such as the suffixes that are added to the last token, I still think the last conjunct should be the head for Turkish and for other languages for which it eases the use of the treebanks (see #236 for more on this discussion).

But there is one more issue with the conjoining numbers in Turkish. There is another very common usage where two or more numbers are coordinated without an explicit conjunct. But in this case, it means "or" or indicates a range with a hint of approximation

Üç      dört     kişi
Three   four     person
Three or/to four people 

For now, I have opted for marking the "or" usage with conj, and the numeric combination with conj:num. I am not attached to the labels or the exact solution, but I definitely agree with the "standard needed" tag above.

@dan-zeman
Copy link
Member

FWIW, such “or/to” covert coordination of numerals exists in Czech, too. Here I would not object to conj, although we don't use it in UD Czech 1.2. The original treebank was not able to represent coordination if there was neither coordinating conjunction nor punctuation, which is this case. Therefore both numerals are attached to the same counted noun as nummod.

http://hdl.handle.net/11346/PMLTQ-VAWO

@dan-zeman
Copy link
Member

Closing as obsolete. In UD v2, words of a number are connected via flat.

@amir-zeldes
Copy link
Contributor

@dan-zeman this is currently not the case in English-GUM, which was modeled after EWT (both corpora currently use compound). This is convenient because then SD number can be converted directly to compound, and there's no need to ensure left-to-right. Is there a plan to change to flat in English/other languages? Also @sebschu

@dan-zeman
Copy link
Member

I do not know what is the current situation in the individual treebanks. I know that the v1 guidelines recommended compound for this. But somehow this sneaked into the v2 guidelines for flat: http://universaldependencies.org/u/dep/flat.html#dates-and-complex-numerals

@dan-zeman dan-zeman reopened this Apr 24, 2018
@dan-zeman dan-zeman modified the milestones: lg-specific v2, v2.2 Apr 24, 2018
@amir-zeldes
Copy link
Contributor

I don't think flat is a good idea here, as flat mandated left-to-right. In some complex numerals there might be a motivation to prefer one part (e.g. fraction is added to non fraction part of a complex number?) and certainly for dates. Maybe flat happens to work for the example:

1 December 2016

But if it's:

December 1, 2016

I think the head should be the day, and flat removes that option. In the latter case at least for English, I'm for:

compound(1,December)
nmod:tmod(1,2016)

This analysis basically says, there are multiple "MONTH 1" dates, and this is the December version: ((December) 1)

Does that make sense?

@martinpopel
Copy link
Member

There is an open issue for dates: #455 (as well as several closed issues, e.g. #113 and #210).
I would suggest to discuss dates there and keep this issue for (other) spelled-out numbers.
Of course, taking into account e.g. the (not only Turkish) spell-out years, which are in the intersection of both issues.

@dan-zeman
Copy link
Member

For spelled-out numerals, such as four thousand, flat seems better than compound exactly because it does not attempt to make one part the head. It may not be the preferred approach in all languages but I remember I was struggling with identifying the head when I was using compound in Czech, and I think there is no evidence of headedness in English either.

@amir-zeldes
Copy link
Contributor

Doesn't four modify thousand? If I had to guess, I'd go with the normal Germanic right headed rule, and it fits nicely with German where it looks like a normal compound (viertausend).

@dan-zeman
Copy link
Member

Well... possibly, yes. You could even say that it is nummod(thousand, four) (answering "How many thousands?"). But I don't feel strongly about it.

@amir-zeldes
Copy link
Contributor

I think semantically it is 'counting', but syntactically it's more like a compound, at least if we consider "ten" in "ten year old" to be compound. This seems to have the same property of not pluralizing the modified number, since it is itself a modifier. It would be nummod for me in "she is ten years old" and "there are four thousands there".

@dan-zeman dan-zeman modified the milestones: v2.2, v2.4 Nov 13, 2018
@dan-zeman dan-zeman modified the milestones: v2.4, v2.5 Oct 6, 2019
@dan-zeman dan-zeman modified the milestones: v2.5, v2.6 Nov 11, 2019
@dan-zeman dan-zeman modified the milestones: v2.6, v2.7 May 14, 2020
@dan-zeman dan-zeman modified the milestones: v2.7, v2.8 Nov 14, 2020
@Stormur
Copy link
Contributor

Stormur commented Jan 22, 2021

I think there would be no problem in considering all additive components of a spelled out number (e.g. units, tens, hundreds...) as flat, since they're order is largely arbitrary (addition has the commutative property! 🙂 ) and it sees a lot of variation, linguistically and stylistically, while allowing an internal structure for some of these components. So, for example:

four thousand two hundred thirty three

head(thousand)
nummod(thousand,four)
flat(thousand,hundred)
nummod(hundred,two)
flat(thousand,thirty)
flat(thousand,three)

@nschneid
Copy link
Contributor

@Stormur I thought flat dependents weren't supposed to have their own dependents? So flat(thousand,hundred) could not be combined with nummod(hundred,two).

BTW the English guidelines currently say compound: https://universaldependencies.org/en/dep/compound.html

@amir-zeldes
Copy link
Contributor

Note that in spoken language, where transcripts often just list words with spaces, things mean very different things:

  • "twenty five" = 25
  • "five twenty" = 5:20

So I think order is important, and I also don't think it lacks a head or hierarchy. For commutative addition we can just follow the normal conj guideline, placing the head on the left by convention.

@Stormur
Copy link
Contributor

Stormur commented Jan 22, 2021

@Stormur I thought flat dependents weren't supposed to have their own dependents? So flat(thousand,hundred) could not be combined with nummod(hundred,two).

BTW the English guidelines currently say compound: https://universaldependencies.org/en/dep/compound.html

From a very practical point of view, it seems that the validator hasn't complained until now, so I would think it is allowed. I do not see it as impossible: the head elements might have no hierarchy, but be phrases by themselves... no?

Note that in spoken language, where transcripts often just list words with spaces, things mean very different things:

* "twenty five" = 25

* "five twenty" = 5:20

So I think order is important, and I also don't think it lacks a head or hierarchy. For commutative addition we can just follow the normal conj guideline, placing the head on the left by convention.

I would indeed treat these two cases differently! The first one is "additive" (and represents a single number), hence flat, the second is not (and I would probably choose conj, as it represents a combination)!

@dan-zeman dan-zeman modified the milestones: v2.8, v2.9 Jun 17, 2021
@dan-zeman dan-zeman modified the milestones: v2.9, v2.11 Jun 13, 2022
@msklvsk
Copy link
Member

msklvsk commented Jul 25, 2022

I agree with @Stormur’s analysis. Just need to clarify this case for Slavics where thousands and millions are usually analyzed as nouns: four thousands two_hundred thirty three dollars.
There should be an nmod(thousands, dollars), correct? Meaning thousands will mix flat dependents with an nmod dependent. The validator is silent on this.

@dan-zeman
Copy link
Member

There should be an nmod(thousands, dollars), correct? Meaning thousands will mix flat dependents with an nmod dependent. The validator is silent on this.

We have been attaching the numeral as a nummod of the counted noun, not vice versa. For three dollars (as well as three girls/houses/cats...), the analysis is nummod(dollars, three). For thousand dollars the analysis is analogous, provided that thousand is NUM and not NOUN: nummod(dollars, thousand). In Czech and some other languages, the situation is further complicated by the fact that some (but not all) numerals in some (but not all) situations force the counted noun into the genitive case. SUD folks might see this as a reason to make the numeral the head, but in UD we kept nummod and instead added a subtype nummod:gov to preserve the information that the numeral governs the case of the noun: nummod:gov(dolarů, tisíc). (The subtype is currently used in some 10 languages, mostly Slavic, including Ukrainian.)

If thousands is tagged NOUN, the situation is different. The criteria for doing so seem blurry to me, but I believe it should be NUM if it occurs as a part of a longer number.

msklvsk added a commit to mova-institute/zoloto that referenced this issue Jul 26, 2022
msklvsk added a commit to mova-institute/zoloto that referenced this issue Jul 27, 2022
msklvsk added a commit to UniversalDependencies/UD_Ukrainian-IU that referenced this issue Jul 27, 2022
@dan-zeman dan-zeman modified the milestones: v2.11, v2.13 May 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants