-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ability to mark Russian sentences with stress marks (acute accents) #1872
Comments
Interesting! There's a somewhat similar issue for Ligurian, which has optional diacritical marks that can be added to clarify the pronunciation. For example, the word oxello can be written as öxello to clarify the pronunciation of the first vowel, or even as öxéllo. These diacritics can be added to pretty much every word, and writers will decide whether to use them or not depending on the context (e.g. probably not in a text message, but almost certainly for a novel). I'm currently writing all of my sentences with the optional diacritics, but I suppose it would be nice to be able to provide the less pedantic transcriptions too. |
Maybe. Please refer to the wiki page new transcription request. |
Please be aware that if you do so, nobody will be able to find your sentences when searching for the same words without diacritics. Therefore, I don’t think it’s a good idea. |
My rationale was that, since Tatoeba is aimed at language learners, it made sense to have the most "instructive" transcription which informs readers of the pronunciation of every word. But you make a very good point, and I might get rid of the diacritics then. (Should I do it though, we would then have the opposite problem: people searching for words with diacritics would not find the sentences. Both ways of writing are considered equally valid.) |
@jeanm Fair enough. I just remembered something now. Diacritics are actually ignored from searches in Russian sentences (#666). Thus, searching for положить yields the same results as searching for положи́ть, so contributors can write sentences the way they prefer. However we do not have something like that for Ligurian. If you’d like that we implement something like that for Ligurian, you can open a new issue and let us know which diacritics should be ignored on which letters in Ligurian. |
Can you implement this flexibility to letters without diacritics, too? https://www.fileformat.info/info/unicode/char/0647/index.htm https://www.fileformat.info/info/unicode/char/06d5/index.htm Can you make the search engine to handle the letter U+06D5 as U+0647 ? They are actually the same letter used for different purposes. This is important for Ottoman Turkish. U+0647 is much more common, but it sometimes causes problems. To avoid that, people add spaces in the middle of words. U+06D5 overcomes this issue but it's quite uncommon. مسئله Searching these two should yield the same results, but it doesn't. https://tatoeba.org/eng/sentences/search?query=%D9%85%D8%B3%D8%A6%D9%84%D9%87&from=ota&to=und https://tatoeba.org/eng/sentences/search?query=%D9%85%D8%B3%D8%A6%D9%84%DB%95&from=ota&to=und |
Even if the presence of extra markup (such as stress marks in Russian) does not interfere with searches, it does interfere with the experience of people using the site unless it can be turned on or off. Markup can be distracting, even for language learners, because it prevents them from being able to exercise their ability to fill in the missing information. But it's also distracting for people who are not in a language-learning mode (for instance, when they are adding sentences in their native language). For that reason, I would never add accent marks to Russian sentences unless I knew that they could be hidden (and hiding them should be the default behavior). With Ligurian, it may be the case that native speakers don't mind seeing the diacritics; I don't know. It was unclear to me from the "new transcription request" page whether a language has to have a transcription autogeneration feature in order to be approved for transcriptions. It was also unclear to me what interface that feature would need to support (or, to express it another way, how it would be called from Tatoeba). Yes, you can choose to treat two letters the same for the purposes of search, whether or not they have diacritics. In Hebrew, we do this for the letters that have final and non-final forms. If you do deal with diacritics, you should consider both the combining and composite forms, where appropriate. For instance, acute-a can be represented either as the two characters "acute" and "a", or as a single "acute-a" character. However, that's not the case with the letters you're talking about. |
Thanks. I'll open a new issue for this. |
In Russian, the stress patterns are unpredictable for non-native speakers. Therefore, acute accents (U+0301) to indicate the stress are often helpful to Russian learners. For instance, on the Wiktionary page for the word положить, the unmarked form of the word can be seen at the top of the page, while the accented form положи́ть can be seen under the "Verb" section. It would be useful for some members of the Tatoeba community to be able to see stress marks. On the other hand, since they are not used in standard written Russian, they should only be made visible if the user so chooses.
Similarly, native speakers often omit the two dots over the letter ё (pronounced "yo"), writing it like е (pronounced "ye"). However, many Russian learners would like to be able to see the dots. (Note that ё and е have two distinct Unicode code points, namely U+0451 and U+0435, respectively.)
Is there a way to give users the option of seeing acute accents and ё? For instance, could we use the "transcription/script" feature to implement this, even though we're not talking about a complete script?
The text was updated successfully, but these errors were encountered: