Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ability to mark Russian sentences with stress marks (acute accents) #1872

Open
alanfgh opened this issue Apr 21, 2019 · 8 comments
Open

ability to mark Russian sentences with stress marks (acute accents) #1872

alanfgh opened this issue Apr 21, 2019 · 8 comments
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba.

Comments

@alanfgh
Copy link
Contributor

alanfgh commented Apr 21, 2019

In Russian, the stress patterns are unpredictable for non-native speakers. Therefore, acute accents (U+0301) to indicate the stress are often helpful to Russian learners. For instance, on the Wiktionary page for the word положить, the unmarked form of the word can be seen at the top of the page, while the accented form положи́ть can be seen under the "Verb" section. It would be useful for some members of the Tatoeba community to be able to see stress marks. On the other hand, since they are not used in standard written Russian, they should only be made visible if the user so chooses.

Similarly, native speakers often omit the two dots over the letter ё (pronounced "yo"), writing it like е (pronounced "ye"). However, many Russian learners would like to be able to see the dots. (Note that ё and е have two distinct Unicode code points, namely U+0451 and U+0435, respectively.)

Is there a way to give users the option of seeing acute accents and ё? For instance, could we use the "transcription/script" feature to implement this, even though we're not talking about a complete script?

@jeanm
Copy link

jeanm commented Apr 22, 2019

Interesting! There's a somewhat similar issue for Ligurian, which has optional diacritical marks that can be added to clarify the pronunciation. For example, the word oxello can be written as öxello to clarify the pronunciation of the first vowel, or even as öxéllo. These diacritics can be added to pretty much every word, and writers will decide whether to use them or not depending on the context (e.g. probably not in a text message, but almost certainly for a novel).

I'm currently writing all of my sentences with the optional diacritics, but I suppose it would be nice to be able to provide the less pedantic transcriptions too.

@jiru jiru added the enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. label Apr 27, 2019
@jiru
Copy link
Member

jiru commented Apr 27, 2019

Is there a way to give users the option of seeing acute accents and ё? For instance, could we use the "transcription/script" feature to implement this, even though we're not talking about a complete script?

Maybe. Please refer to the wiki page new transcription request.

@jiru
Copy link
Member

jiru commented Apr 27, 2019

I'm currently writing all of my sentences with the optional diacritics, but I suppose it would be nice to be able to provide the less pedantic transcriptions too.

Please be aware that if you do so, nobody will be able to find your sentences when searching for the same words without diacritics. Therefore, I don’t think it’s a good idea.

@jeanm
Copy link

jeanm commented Apr 29, 2019

Please be aware that if you do so, nobody will be able to find your sentences when searching for the same words without diacritics. Therefore, I don’t think it’s a good idea.

My rationale was that, since Tatoeba is aimed at language learners, it made sense to have the most "instructive" transcription which informs readers of the pronunciation of every word. But you make a very good point, and I might get rid of the diacritics then.

(Should I do it though, we would then have the opposite problem: people searching for words with diacritics would not find the sentences. Both ways of writing are considered equally valid.)

@jiru
Copy link
Member

jiru commented May 1, 2019

(Should I do it though, we would then have the opposite problem: people searching for words with diacritics would not find the sentences. Both ways of writing are considered equally valid.)

@jeanm Fair enough. I just remembered something now. Diacritics are actually ignored from searches in Russian sentences (#666). Thus, searching for положить yields the same results as searching for положи́ть, so contributors can write sentences the way they prefer.

However we do not have something like that for Ligurian. If you’d like that we implement something like that for Ligurian, you can open a new issue and let us know which diacritics should be ignored on which letters in Ligurian.

@soliloquist-tatoeba
Copy link

@jiru

Diacritics are actually ignored from searches in Russian sentences (#666). Thus, searching for положить yields the same results as searching for положи́ть, so contributors can write sentences the way they prefer.

Can you implement this flexibility to letters without diacritics, too?

https://www.fileformat.info/info/unicode/char/0647/index.htm

https://www.fileformat.info/info/unicode/char/06d5/index.htm

Can you make the search engine to handle the letter U+06D5 as U+0647 ? They are actually the same letter used for different purposes. This is important for Ottoman Turkish. U+0647 is much more common, but it sometimes causes problems. To avoid that, people add spaces in the middle of words. U+06D5 overcomes this issue but it's quite uncommon.

مسئله
مسئلە

Searching these two should yield the same results, but it doesn't.

https://tatoeba.org/eng/sentences/search?query=%D9%85%D8%B3%D8%A6%D9%84%D9%87&from=ota&to=und

https://tatoeba.org/eng/sentences/search?query=%D9%85%D8%B3%D8%A6%D9%84%DB%95&from=ota&to=und

@alanfgh
Copy link
Contributor Author

alanfgh commented May 1, 2019

@jeanm

My rationale was that, since Tatoeba is aimed at language learners, it made sense to have the most "instructive" transcription which informs readers of the pronunciation of every word.

Even if the presence of extra markup (such as stress marks in Russian) does not interfere with searches, it does interfere with the experience of people using the site unless it can be turned on or off. Markup can be distracting, even for language learners, because it prevents them from being able to exercise their ability to fill in the missing information. But it's also distracting for people who are not in a language-learning mode (for instance, when they are adding sentences in their native language). For that reason, I would never add accent marks to Russian sentences unless I knew that they could be hidden (and hiding them should be the default behavior). With Ligurian, it may be the case that native speakers don't mind seeing the diacritics; I don't know.

@jiru

It was unclear to me from the "new transcription request" page whether a language has to have a transcription autogeneration feature in order to be approved for transcriptions. It was also unclear to me what interface that feature would need to support (or, to express it another way, how it would be called from Tatoeba).

@soliloquist-tatoeba

Yes, you can choose to treat two letters the same for the purposes of search, whether or not they have diacritics. In Hebrew, we do this for the letters that have final and non-final forms.

If you do deal with diacritics, you should consider both the combining and composite forms, where appropriate. For instance, acute-a can be represented either as the two characters "acute" and "a", or as a single "acute-a" character. However, that's not the case with the letters you're talking about.

@soliloquist-tatoeba
Copy link

@alanfgh

Thanks. I'll open a new issue for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba.
Projects
None yet
Development

No branches or pull requests

4 participants