ability to mark Russian sentences with stress marks (acute accents) #1872

alanfgh · 2019-04-21T17:30:48Z

In Russian, the stress patterns are unpredictable for non-native speakers. Therefore, acute accents (U+0301) to indicate the stress are often helpful to Russian learners. For instance, on the Wiktionary page for the word положить, the unmarked form of the word can be seen at the top of the page, while the accented form положи́ть can be seen under the "Verb" section. It would be useful for some members of the Tatoeba community to be able to see stress marks. On the other hand, since they are not used in standard written Russian, they should only be made visible if the user so chooses.

Similarly, native speakers often omit the two dots over the letter ё (pronounced "yo"), writing it like е (pronounced "ye"). However, many Russian learners would like to be able to see the dots. (Note that ё and е have two distinct Unicode code points, namely U+0451 and U+0435, respectively.)

Is there a way to give users the option of seeing acute accents and ё? For instance, could we use the "transcription/script" feature to implement this, even though we're not talking about a complete script?

jeanm · 2019-04-22T01:27:27Z

Interesting! There's a somewhat similar issue for Ligurian, which has optional diacritical marks that can be added to clarify the pronunciation. For example, the word oxello can be written as öxello to clarify the pronunciation of the first vowel, or even as öxéllo. These diacritics can be added to pretty much every word, and writers will decide whether to use them or not depending on the context (e.g. probably not in a text message, but almost certainly for a novel).

I'm currently writing all of my sentences with the optional diacritics, but I suppose it would be nice to be able to provide the less pedantic transcriptions too.

jiru · 2019-04-27T02:54:32Z

Is there a way to give users the option of seeing acute accents and ё? For instance, could we use the "transcription/script" feature to implement this, even though we're not talking about a complete script?

Maybe. Please refer to the wiki page new transcription request.

jiru · 2019-04-27T02:59:07Z

I'm currently writing all of my sentences with the optional diacritics, but I suppose it would be nice to be able to provide the less pedantic transcriptions too.

Please be aware that if you do so, nobody will be able to find your sentences when searching for the same words without diacritics. Therefore, I don’t think it’s a good idea.

jeanm · 2019-04-29T10:31:34Z

Please be aware that if you do so, nobody will be able to find your sentences when searching for the same words without diacritics. Therefore, I don’t think it’s a good idea.

My rationale was that, since Tatoeba is aimed at language learners, it made sense to have the most "instructive" transcription which informs readers of the pronunciation of every word. But you make a very good point, and I might get rid of the diacritics then.

(Should I do it though, we would then have the opposite problem: people searching for words with diacritics would not find the sentences. Both ways of writing are considered equally valid.)

jiru · 2019-05-01T10:57:00Z

(Should I do it though, we would then have the opposite problem: people searching for words with diacritics would not find the sentences. Both ways of writing are considered equally valid.)

@jeanm Fair enough. I just remembered something now. Diacritics are actually ignored from searches in Russian sentences (#666). Thus, searching for положить yields the same results as searching for положи́ть, so contributors can write sentences the way they prefer.

However we do not have something like that for Ligurian. If you’d like that we implement something like that for Ligurian, you can open a new issue and let us know which diacritics should be ignored on which letters in Ligurian.

soliloquist-tatoeba · 2019-05-01T21:31:22Z

@jiru

Diacritics are actually ignored from searches in Russian sentences (#666). Thus, searching for положить yields the same results as searching for положи́ть, so contributors can write sentences the way they prefer.

Can you implement this flexibility to letters without diacritics, too?

https://www.fileformat.info/info/unicode/char/0647/index.htm

https://www.fileformat.info/info/unicode/char/06d5/index.htm

Can you make the search engine to handle the letter U+06D5 as U+0647 ? They are actually the same letter used for different purposes. This is important for Ottoman Turkish. U+0647 is much more common, but it sometimes causes problems. To avoid that, people add spaces in the middle of words. U+06D5 overcomes this issue but it's quite uncommon.

مسئله
مسئلە

Searching these two should yield the same results, but it doesn't.

https://tatoeba.org/eng/sentences/search?query=%D9%85%D8%B3%D8%A6%D9%84%D9%87&from=ota&to=und

https://tatoeba.org/eng/sentences/search?query=%D9%85%D8%B3%D8%A6%D9%84%DB%95&from=ota&to=und

alanfgh · 2019-05-01T22:22:40Z

@jeanm

My rationale was that, since Tatoeba is aimed at language learners, it made sense to have the most "instructive" transcription which informs readers of the pronunciation of every word.

Even if the presence of extra markup (such as stress marks in Russian) does not interfere with searches, it does interfere with the experience of people using the site unless it can be turned on or off. Markup can be distracting, even for language learners, because it prevents them from being able to exercise their ability to fill in the missing information. But it's also distracting for people who are not in a language-learning mode (for instance, when they are adding sentences in their native language). For that reason, I would never add accent marks to Russian sentences unless I knew that they could be hidden (and hiding them should be the default behavior). With Ligurian, it may be the case that native speakers don't mind seeing the diacritics; I don't know.

@jiru

It was unclear to me from the "new transcription request" page whether a language has to have a transcription autogeneration feature in order to be approved for transcriptions. It was also unclear to me what interface that feature would need to support (or, to express it another way, how it would be called from Tatoeba).

@soliloquist-tatoeba

Yes, you can choose to treat two letters the same for the purposes of search, whether or not they have diacritics. In Hebrew, we do this for the letters that have final and non-final forms.

If you do deal with diacritics, you should consider both the combining and composite forms, where appropriate. For instance, acute-a can be represented either as the two characters "acute" and "a", or as a single "acute-a" character. However, that's not the case with the letters you're talking about.

soliloquist-tatoeba · 2019-05-02T19:20:40Z

@alanfgh

Thanks. I'll open a new issue for this.

jiru added the enhancement Issue that describes a problem that requires a change in the current functionalities of Tatoeba. label Apr 27, 2019

soliloquist-tatoeba mentioned this issue May 2, 2019

Make the search engine treat the letters U+06D5 as U+0647 and U+06AD as U+0643 when searching in Ottoman Turkish #1880

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ability to mark Russian sentences with stress marks (acute accents) #1872

ability to mark Russian sentences with stress marks (acute accents) #1872

alanfgh commented Apr 21, 2019 •

edited

Loading

jeanm commented Apr 22, 2019 •

edited

Loading

jiru commented Apr 27, 2019

jiru commented Apr 27, 2019

jeanm commented Apr 29, 2019

jiru commented May 1, 2019

soliloquist-tatoeba commented May 1, 2019

alanfgh commented May 1, 2019

soliloquist-tatoeba commented May 2, 2019

ability to mark Russian sentences with stress marks (acute accents) #1872

ability to mark Russian sentences with stress marks (acute accents) #1872

Comments

alanfgh commented Apr 21, 2019 • edited Loading

jeanm commented Apr 22, 2019 • edited Loading

jiru commented Apr 27, 2019

jiru commented Apr 27, 2019

jeanm commented Apr 29, 2019

jiru commented May 1, 2019

soliloquist-tatoeba commented May 1, 2019

alanfgh commented May 1, 2019

soliloquist-tatoeba commented May 2, 2019

alanfgh commented Apr 21, 2019 •

edited

Loading

jeanm commented Apr 22, 2019 •

edited

Loading