Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow edition of simplified/traditional scripts in Chinese sentences #2189

Closed
wants to merge 3 commits into from

Conversation

jiru
Copy link
Member

@jiru jiru commented Mar 9, 2020

This is a PR for @Yorwba 😄

It solves #2007 by allowing to edit autogenerated simplified/traditional scripts in Chinese sentences.

I initially believed that there weren’t substantial errors in the conversion. I wonder if we should mark converted scripts with the warning icon just like the Pinyin has. What’s the error rate of the converted script?

@Yorwba
Copy link
Contributor

Yorwba commented Mar 9, 2020

Thanks for addressing this problem. I tried it out quickly and I like that you added some basic sanity checks to keep the sentences aligned.

The error rate is overall relatively low, but it gets some characters consistently wrong. 著/着 is one example and probably the most common, but recently I also noticed e.g. 餘, which is turned into 馀 instead of the more likely 余. (Not wrong, but surprising.)

That's why my suggestion in #2007 was to check whether the sentence contains an ambiguous character and only allow editing in that case.

@jiru
Copy link
Member Author

jiru commented Mar 10, 2020

I see. Checking against a list of ambiguous character is a good idea, maybe we could do that for Japanese too.

By the way, if you think there is anything you can do to improve the conversion script, please have a look at sinoparserd. It parses Chinese using a rather naive algorithm based on a list of known words .

@Yorwba
Copy link
Contributor

Yorwba commented Mar 11, 2020

I think it's hard to do much better than the greedy prefix-matching used by sinoparserd. A probabilistic tokenizer like Jieba would be better at dealing with garden-path sentences where the prefix is misleading, but those are rare. The only instance I've noticed on Tatoeba is 不知道, which gets split as 不知/道 instead of 不/知道, but that's the fault of mandarin.xml listing negated forms of some verbs but not others, and it only affects how the pinyin is split into words, not the traditional/simplified transliteration.

The actual problem here is that some single-character words have multiple traditional/simplified forms depending on what they mean, and the transliteration chosen is not based on which is more likely, but probably implicitly on the order in which they're listed in mandarin.xml instead. I'll check whether reordering the two lines is enough to improve the situation a bit.

While grepping for 着 in mandarin.xml, I noticed a different problem, which is that it's sometimes completely dropped from the pinyin, e.g. 冒着烟 is treated as if it were just 冒烟 and transcribed as "mao4yan1". I'm not sure how the file was created, but I suspect it was from a dictionary that included 冒着烟 as an entry with a "see also" link to 冒烟 and the pinyin was incorrectly taken from that.

The license section of the README doesn't exactly inspire confidence that it will be possible to regenerate the file with updates the original source received in the meantime:

All the source code is licensed under GPLv3, the xml files are under their own license, it's a "open one" but i need to check which one, certainly CC-BY-SA
so for the moment I would recommend people to use their own data files for "public usage" and use the provided xml only for "test" purpose.

I want to try using CC-CEDICT as a replacement. It has fewer entries (118121 lines vs. mandarin.xml's 133002), but at the same time it seems to have better coverage of relevant words (e.g. mandarin.xml doesn't contain 週 at all). Do you have any preference for the language the conversion script should be written in? I would instinctively reach for Python, but most of the support scripts for Tatoeba are written in PHP, so maybe that would be better from a maintenance perspective.

@jiru
Copy link
Member Author

jiru commented Mar 12, 2020

Thanks a million for the extensive research, Yorwba! 😸

About the license of the files used by sinoparserd, you can ask the original author (allan-simon on Github or sysko on Tatoeba).

You are more than welcome to write a replacement for the Chinese parser because nobody is really able to maintenance sinoparserd. I don’t know about the quality of CC-CEDICT, but it looks like it’s quite active, so at least it’s a healthy choice. 👍

I’m okay with any language as long as the code has unit tests. 😉 About the execution speed, it should be fast enough so that resetting all the autogenerated Pinyin and simplified/traditional scripts using cake transcriptions autogen cmn on production doesn’t take ages.

As for the language, I think Python is superior because of its richer ecosystem and ease of use. I understand Python to some extent, and I believe there are and will be enough Pythonists out there so that maintenance won’t be so much of a problem. I myself wrote a replacement for the Japanese parser nihongoparserd in Python too.

If you go for Python, please write code that can be both executed by Python 2 and 3. This is because the production server uses Python 2 as default interpreter (so most of the installed libraries are in Python 2 too) but it will change to Python 3 as soon as we upgrade the OS. For now, as long as it works on Imouto, it should work on production too. Also make sure to check which of the dependencies of your new tool can be installed using apt, or, if it’s not packaged, using pip, so that I can write an ansible role to include it in Imouto.

@jiru
Copy link
Member Author

jiru commented May 13, 2020

@Yorwba Now that Tatoeba/sinoparserd#2 is merged, do you think it’s still worth allowing manual edition of simplified/traditional scripts? Are there still cases where sinoparserd gets it wrong?

That's why my suggestion in #2007 was to check whether the sentence contains an ambiguous character and only allow editing in that case.

If we are to go this way, what would be the list?

@Yorwba
Copy link
Contributor

Yorwba commented May 14, 2020

The new dictionary fixes the most common problems, including the example sentence that made me open issue #2007. However, there are still edge cases like simplified 干, which corresponds to either 幹 gàn if it means "to do" or 乾 gān if it means "dry". (To make matters worse, there's also the surname 乾 Qián, which is the same in simplified and traditional script.) Both usages occur somewhat frequently in the corpus, e.g. 干的好!(Well done!) and 汤姆拧干毛巾挂起来晾干。(Tom wrung out the towel and hung it up to dry.). In the second sentence, 干/乾 occurs twice, but only the second occurrence is transliterated correctly, because it's part of a two-character word that disambiguates it.

So the problem got smaller, but it didn't disappear completely.

That's why my suggestion in #2007 was to check whether the sentence contains an ambiguous character and only allow editing in that case.

If we are to go this way, what would be the list?

I think the Wikipedia article on ambiguities in Chinese character simplification contains the complete list. So that would be

  • the simplified characters 板杯辟表别卜布才彩参冲虫丑仇出村粗酬当党淀吊冬发范丰谷雇刮广哄后伙获几机饥迹奸姜借尽据卷克困夸罗累厘漓梁了霉弥蔑么麽苹仆铺朴签确舍沈胜术松他叹坛你体同涂团喂为纤咸弦绣须熏腌叶佣涌游于余吁郁欲御愿岳云赞脏扎占折征证志制致钟种周注准冢庄涩蚕忏吨赶构柜怀坏极茧家价洁惊腊蜡帘怜岭扑秋千确扰洒晒适听洼网旋踊优症朱荐离卤气圣万与摆虮篱泞恶托咽线曲升苏系尝胡划回汇里历袅向只它并采厂干蒙面复台斗
  • the traditional characters 著兒乾夥藉瞭麼餘摺徵鯰瀋鹼
  • characters which are only ambiguous if it's unknown whether the script is simplified or traditional 苧苎

Since the script detection isn't perfect, it would probably be best to simply check if any of those characters occur in the original text.

@jiru
Copy link
Member Author

jiru commented May 15, 2020

Thanks for the clarification and the list of characters! 👍 I will implement that when I'll have time.

@wangchou
Copy link

Did anyone try https://github.com/BYVoid/OpenCC (4.7k stars, Apache-2.0)?

I tried a few cases, it looks great.
It does not only convert characters, it also converts phrases. (ex: 打印机 -> 印表機)

In last month, there are 9 pull requests.
So it is active, too.

Really wish there are both simplified and traditional Chinese version for each sentence.

@Yorwba
Copy link
Contributor

Yorwba commented May 16, 2020

@wangchou Thank you for the recommendation. I tried the two sentences I linked above and converted them using the online demo

  • 干的好! gets converted to 乾的好!( our current transcription 幹的好! is correct)
  • 汤姆拧干毛巾挂起来晾干。 gets converted to 湯姆擰乾毛巾掛起來晾乾。(our current transcription 湯姆擰幹毛巾掛起來晾乾。 is incorrect)

Based on only those two examples it's not clear which is better, but I guess OpenCC is likely to be better considering how widely it's being used.
However, it doesn't support pinyin, so we won't be able to completely replace our current transcription engine with it. (Which takes care of both pinyin and simplified/traditional conversion at the same time.)

It does not only convert characters, it also converts phrases. (ex: 打印机 -> 印表機)

I think that's probably not a good idea, since it doesn't just change the script, but also the vocabulary used in the sentence. It would be like having a British English/American English converter that replaces all occurrences of "flat" with "apartment"...

Really wish there are both simplified and traditional Chinese version for each sentence.

We already automatically convert simplified Chinese sentences into traditional characters and vice versa. That alternative version is displayed in grey below the original sentence. Does it not show up for you or is there some other problem with it?

@wangchou
Copy link

wangchou commented May 16, 2020

@Yorwba

Thanks for quick response.

Sorry, I only tried the offline version. (800MB of all sentences)
I found simplified and traditional Chinese sentences are mixed in one cmn tag.

I plan to use some sentences in my App.
And my target audience is people from Taiwan.
If there is no phrase conversion, the readability is awful to my audience.
That's why I found OpenCC yesterday.

I selected a set of 25,577 Chinese sentences. (all with jpn and eng translations)
By using OpenCC and human, the conversion and checking will take about 100hrs.
And there is no way to contribute it back to Tatoeba, because for some sentence I will only change the phrase.

It's sad. 😭

@Yorwba
Copy link
Contributor

Yorwba commented May 16, 2020

The alternative script versions of each sentence are contained in the transcriptions.tar.bz2 file on the downloads page If the original sentence is detected as using traditional script, it will have a transcription with Hans (simplified) script, and Hant (traditional) if the original sentence is using simplified characters.

@jiru
Copy link
Member Author

jiru commented May 17, 2020

It does not only convert characters, it also converts phrases. (ex: 打印机 -> 印表機)

While such conversion seems appealing, I don’t think this is something that we want to use in the script conversion tool used on Tatoeba. My knowledge of Chinese is basic, but shouldn’t a sentence using 打印机 be treated as a separate sentence from one using 印表機 (and tagged appropriately)? Isn’t it similar to American English vs. British English?

@wangchou
Copy link

@jiru The scale is different. We create different phrases by translation.

For example,

  1. Name
    Trump, 特朗普(China), 川普(Taiwan)

  2. Location name
    New Zealand, 新西兰(China), 紐西蘭(Taiwan)
    Qatar, 卡塔尔(China), 卡達(Taiwan)

  3. Tech terms
    Computer, 计算机(China), 電腦(Taiwan)
    Printer, 打印机(China), 印表機(Taiwan)
    Mouse cursor, 鼠标光标(China), 滑鼠游標(Taiwan)

  4. Titles of song, movie, game and book
    The Lord of the Rings, 指环王(China), 魔戒(Taiwan)

Most of top researches are published in English.
So new science terms are the same in American English and British English.
But some of their translations are different in China, HongKong and Taiwan.

@kerrickstaley
Copy link

Hey @jiru !

A couple of questions about this PR:

  • Is there a way for editors to just say "yes, this automatic transcription is correct" and have that fact recorded (as opposed to "this automatic transcription has not been reviewed")?
  • Can human-reviewed transcriptions be exported to sentences.tar.bz2 on the downloads page?

(I'm not familiar with PHP or the Tatoeba codebase, otherwise I'd try to answer these questions myself, sorry!).

@trang
Copy link
Member

trang commented May 19, 2020

Hello @kerrickstaley,

Is there a way for editors to just say "yes, this automatic transcription is correct" and have that fact recorded (as opposed to "this automatic transcription has not been reviewed")?

Yes, there's a way to review transcriptions from the UI, but this applies only to transcriptions that are editable. At the moment, only furigana and pinyin are editable.

Can human-reviewed transcriptions be exported to sentences.tar.bz2 on the downloads page?

This information can be provided in the transcriptions.tar.bz2, not in the sentences.tar.bz2. That's because a sentence can have more than one type of transcription. That's the case of Mandarin Chinese which has pinyin and the simplified/traditional script.

The most straightforward solution would be to add the field needsReview in the transcriptions file. Feel free to open an issue about this if you need this information in our exported files.

@jiru
Copy link
Member Author

jiru commented May 20, 2020

@wangchou Thank you for the clarification. I think that such differences in writing style should go into different sentences, possibly tagged as "Taiwanese Mandarin", "Standard Mandarin" or similar. That’s the way typically deal with this on Tatoeba. I realize it could be seen as a bit weird to have a sentence featuring a Taiwanese Mandarin word transcribed to simplified characters. But I think it is still useful for learners or people who can read only simplified or only traditional.

@jiru
Copy link
Member Author

jiru commented May 20, 2020

@kerrickstaley In addition to Trang’s answer, I’d like to mention that in the file transcriptions.tar.bz2, there is a username field that is empty if the transcription is unreviewed, or contains a username if the transcription has been reviewed.

@jiru
Copy link
Member Author

jiru commented Jul 9, 2020

Closing this PR because I don’t have plans to work on it any time soon. I may reopen it if I do. Just don’t delete my branch.

@jiru jiru closed this Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants