-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow edition of simplified/traditional scripts in Chinese sentences #2189
Conversation
Thanks for addressing this problem. I tried it out quickly and I like that you added some basic sanity checks to keep the sentences aligned. The error rate is overall relatively low, but it gets some characters consistently wrong. 著/着 is one example and probably the most common, but recently I also noticed e.g. 餘, which is turned into 馀 instead of the more likely 余. (Not wrong, but surprising.) That's why my suggestion in #2007 was to check whether the sentence contains an ambiguous character and only allow editing in that case. |
I see. Checking against a list of ambiguous character is a good idea, maybe we could do that for Japanese too. By the way, if you think there is anything you can do to improve the conversion script, please have a look at sinoparserd. It parses Chinese using a rather naive algorithm based on a list of known words . |
I think it's hard to do much better than the greedy prefix-matching used by sinoparserd. A probabilistic tokenizer like Jieba would be better at dealing with garden-path sentences where the prefix is misleading, but those are rare. The only instance I've noticed on Tatoeba is 不知道, which gets split as 不知/道 instead of 不/知道, but that's the fault of The actual problem here is that some single-character words have multiple traditional/simplified forms depending on what they mean, and the transliteration chosen is not based on which is more likely, but probably implicitly on the order in which they're listed in While grepping for 着 in The license section of the README doesn't exactly inspire confidence that it will be possible to regenerate the file with updates the original source received in the meantime:
I want to try using CC-CEDICT as a replacement. It has fewer entries (118121 lines vs. |
Thanks a million for the extensive research, Yorwba! 😸 About the license of the files used by sinoparserd, you can ask the original author (allan-simon on Github or sysko on Tatoeba). You are more than welcome to write a replacement for the Chinese parser because nobody is really able to maintenance sinoparserd. I don’t know about the quality of CC-CEDICT, but it looks like it’s quite active, so at least it’s a healthy choice. 👍 I’m okay with any language as long as the code has unit tests. 😉 About the execution speed, it should be fast enough so that resetting all the autogenerated Pinyin and simplified/traditional scripts using As for the language, I think Python is superior because of its richer ecosystem and ease of use. I understand Python to some extent, and I believe there are and will be enough Pythonists out there so that maintenance won’t be so much of a problem. I myself wrote a replacement for the Japanese parser nihongoparserd in Python too. If you go for Python, please write code that can be both executed by Python 2 and 3. This is because the production server uses Python 2 as default interpreter (so most of the installed libraries are in Python 2 too) but it will change to Python 3 as soon as we upgrade the OS. For now, as long as it works on Imouto, it should work on production too. Also make sure to check which of the dependencies of your new tool can be installed using apt, or, if it’s not packaged, using pip, so that I can write an ansible role to include it in Imouto. |
@Yorwba Now that Tatoeba/sinoparserd#2 is merged, do you think it’s still worth allowing manual edition of simplified/traditional scripts? Are there still cases where sinoparserd gets it wrong?
If we are to go this way, what would be the list? |
The new dictionary fixes the most common problems, including the example sentence that made me open issue #2007. However, there are still edge cases like simplified 干, which corresponds to either 幹 gàn if it means "to do" or 乾 gān if it means "dry". (To make matters worse, there's also the surname 乾 Qián, which is the same in simplified and traditional script.) Both usages occur somewhat frequently in the corpus, e.g. 干的好!(Well done!) and 汤姆拧干毛巾挂起来晾干。(Tom wrung out the towel and hung it up to dry.). In the second sentence, 干/乾 occurs twice, but only the second occurrence is transliterated correctly, because it's part of a two-character word that disambiguates it. So the problem got smaller, but it didn't disappear completely.
I think the Wikipedia article on ambiguities in Chinese character simplification contains the complete list. So that would be
Since the script detection isn't perfect, it would probably be best to simply check if any of those characters occur in the original text. |
Thanks for the clarification and the list of characters! 👍 I will implement that when I'll have time. |
Did anyone try https://github.com/BYVoid/OpenCC (4.7k stars, Apache-2.0)? I tried a few cases, it looks great. In last month, there are 9 pull requests. Really wish there are both simplified and traditional Chinese version for each sentence. |
@wangchou Thank you for the recommendation. I tried the two sentences I linked above and converted them using the online demo
Based on only those two examples it's not clear which is better, but I guess OpenCC is likely to be better considering how widely it's being used.
I think that's probably not a good idea, since it doesn't just change the script, but also the vocabulary used in the sentence. It would be like having a British English/American English converter that replaces all occurrences of "flat" with "apartment"...
We already automatically convert simplified Chinese sentences into traditional characters and vice versa. That alternative version is displayed in grey below the original sentence. Does it not show up for you or is there some other problem with it? |
Thanks for quick response. Sorry, I only tried the offline version. (800MB of all sentences) I plan to use some sentences in my App. I selected a set of 25,577 Chinese sentences. (all with jpn and eng translations) It's sad. 😭 |
The alternative script versions of each sentence are contained in the transcriptions.tar.bz2 file on the downloads page If the original sentence is detected as using traditional script, it will have a transcription with |
While such conversion seems appealing, I don’t think this is something that we want to use in the script conversion tool used on Tatoeba. My knowledge of Chinese is basic, but shouldn’t a sentence using 打印机 be treated as a separate sentence from one using 印表機 (and tagged appropriately)? Isn’t it similar to American English vs. British English? |
@jiru The scale is different. We create different phrases by translation. For example,
Most of top researches are published in English. |
Hey @jiru ! A couple of questions about this PR:
(I'm not familiar with PHP or the Tatoeba codebase, otherwise I'd try to answer these questions myself, sorry!). |
Hello @kerrickstaley,
Yes, there's a way to review transcriptions from the UI, but this applies only to transcriptions that are editable. At the moment, only furigana and pinyin are editable.
This information can be provided in the transcriptions.tar.bz2, not in the sentences.tar.bz2. That's because a sentence can have more than one type of transcription. That's the case of Mandarin Chinese which has pinyin and the simplified/traditional script. The most straightforward solution would be to add the field |
@wangchou Thank you for the clarification. I think that such differences in writing style should go into different sentences, possibly tagged as "Taiwanese Mandarin", "Standard Mandarin" or similar. That’s the way typically deal with this on Tatoeba. I realize it could be seen as a bit weird to have a sentence featuring a Taiwanese Mandarin word transcribed to simplified characters. But I think it is still useful for learners or people who can read only simplified or only traditional. |
@kerrickstaley In addition to Trang’s answer, I’d like to mention that in the file |
Closing this PR because I don’t have plans to work on it any time soon. I may reopen it if I do. Just don’t delete my branch. |
This is a PR for @Yorwba 😄
It solves #2007 by allowing to edit autogenerated simplified/traditional scripts in Chinese sentences.
I initially believed that there weren’t substantial errors in the conversion. I wonder if we should mark converted scripts with the warning icon just like the Pinyin has. What’s the error rate of the converted script?