Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Python script to update mandarin.xml using data from CC-CEDICT #2

Merged
merged 14 commits into from
May 12, 2020

Conversation

Yorwba
Copy link

@Yorwba Yorwba commented Mar 28, 2020

As discussed in Tatoeba/tatoeba2#2189, I wrote a script that formats the dictionary data from CC-CEDICT in the mandarin.xml format sinoparserd expects.

That means there's no change to the parsing algorithm, but only to the words sinoparserd recognizes. It turns out that ambiguous parses are not much of an issue. I generated a list of all overlapping words that actually occur in existing sentences on Tatoeba (e.g. 是日/本 vs. 是/日本) and the list in tools/mandarin/override.py is enough to disambiguate all that appear at least five times.

Words or characters that have multiple possible variants/pronunciations are a bit more of a problem. I initially generated tools/mandarin/preference.py based on the frequency of each variant in the corpus and applied some manual corrections afterwards, choosing the more likely option. That means the conversion should be more likely to be correct than not, but there will still be exceptions. E.g. 只 as simultaneously simplified and traditional character vs. 只 as the simplification of traditional 隻 is probably almost equally common.

Nonetheless, when I checked 100 random sentences, I didn't notice any problems with the traditional/simplified conversion, only with the pinyin. Of the changes to the traditional/simplified conversion, 100% were improvements, while the same is true of 77% of the changes to the pinyin. The instances where the pinyin got worse were longer compound words which are now less likely to have spaces separating their components. But that's a minor problem compared to incorrect tones.

I wrote the script to require no dependencies outside the Python standard library and it runs under both Python 2.7 and Python 3.7. But I also committed the output, so you won't need to run it except to get a more current version of CC-CEDICT.

@allan-simon
Copy link

(is it possible to backport the PR on my repo too ? :) )

@Yorwba
Copy link
Author

Yorwba commented Apr 3, 2020

@allan-simon I just checked where your repo and this one diverge and it seems like b062dcd is simply a merge commit combining the changes of a37d2d9 and 8d4482d into one. What confuses me about that is that 7d45d33, the other participant in the merge, is already the parent of 8d4482d. I didn't know it's possible to merge a commit with its own ancestor like that. Maybe I'm missing something here, but it looks like "backporting" could be just another merge.

@Yorwba
Copy link
Author

Yorwba commented Apr 3, 2020

We discussed during the call on Sunday that it would be good to have a tool for collecting statistics about the changes that the new dictionary will cause in transcriptions of sentences on Tatoeba. @jiru mentioned that that was already a consideration when he created a new system for autogenerating Japanese furigana. So it would be good if such a tool could be relatively language-agnostic so that it could be used for evaluating other transcription changes as well.

The way I evaluated the changes in this PR was relatively manual: I downloaded the Mandarin sentences (It's nice that we now have a way to download data in a single language easily) and then fed it through sinoparserd with a simple bash oneliner:

cat cmn_sentences.tsv | cut -f3 | while read sent; do echo "$sent"; curl 'laptop:8080/all?str='"$sent" 2>/dev/null; done > cmn_parsed.txt

That concatenates the original sentences and resulting XML into one big pile of mud

see the first 100 lines here
我們試試看!
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[wo3men5 shi4shi5 kan4 !]]></romanization>
<alternateScript><![CDATA[我们试试看!]]></alternateScript>
</root>
我该去睡觉了。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[wo3 gai1 qu4 shui4jiao4 le5 .]]></romanization>
<alternateScript><![CDATA[我該去睡覺了。]]></alternateScript>
</root>
你在干什麼啊?
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[ni3 zai4 gan4 shen2 me5 a1 ?]]></romanization>
<alternateScript><![CDATA[你在幹什麼啊?]]></alternateScript>
</root>
這是什麼啊?
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[zhe4 shi4 shen2 me5 a1 ?]]></romanization>
<alternateScript><![CDATA[这是什么啊?]]></alternateScript>
</root>
今天是6月18号,也是Muiriel的生日!
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[jin1tian1 shi4 liu4 yue4 yi1 ba1 hao4 , ye3 shi4 Muiriel de5 sheng1ri4 !]]></romanization>
<alternateScript><![CDATA[今天是6月18號,也是Muiriel的生日!]]></alternateScript>
</root>
生日快乐,Muiriel!
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[sheng1ri4 kuai4le4 ,Muiriel!]]></romanization>
<alternateScript><![CDATA[生日快樂,Muiriel!]]></alternateScript>
</root>
Muiriel现在20岁了。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[Muiriel xian4zai4 20 sui4 le5 .]]></romanization>
<alternateScript><![CDATA[Muiriel現在20嵗了。]]></alternateScript>
</root>
密码是"Muiriel"。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[mi4ma3 shi4 "Muiriel".]]></romanization>
<alternateScript><![CDATA[密碼是"Muiriel"。]]></alternateScript>
</root>
我很快就會回來。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[wo3 hen3 kuai4 jiu4 hui4 hui2lai5 .]]></romanization>
<alternateScript><![CDATA[我很快就会回来。]]></alternateScript>
</root>
我不知道。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[wo3 bu4zhi1 dao4 .]]></romanization>
<alternateScript><![CDATA[我不知道。]]></alternateScript>
</root>
我不知道應該說什麼才好。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[wo3 bu4zhi1 dao4 ying1gai1 shuo1 shen2 me5 cai2 hao3 .]]></romanization>
<alternateScript><![CDATA[我不知道应该说什么才好。]]></alternateScript>
</root>
這個永遠完不了了。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[zhe4ge5 yong3yuan3 wan2 bu4 le5 le5 .]]></romanization>
<alternateScript><![CDATA[这个永远完不了了。]]></alternateScript>
</root>
我只是不知道應該說什麼而已……
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[wo3 zhi3 shi4 bu4zhi1 dao4 ying1gai1 shuo1 shen2 me5 er2yi3 ……]]></romanization>
<alternateScript><![CDATA[我只是不知道应该说什么而已……]]></alternateScript>
</root>
那是一隻有惡意的兔子。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[na4 shi4 yi1 zhi3 you3 e4yi4 de5 tu4zi5 .]]></romanization>
<alternateScript><![CDATA[那是一只有恶意的兔子。]]></alternateScript>
</root>
我以前在山里。
<?xml version="1.0" encoding="UTF-8"?>
but it was good enough for manual evaluation of the differences between `cmn_parsed.txt` (with the old dictionary) and `new_cmn_parsed.txt` (after restarting `sinoparserd` with the new dictionary) using `vimdiff` for highlighting.
I'll show the first 100 lines of the output of `git diff --no-index cmn_parsed.txt new_cmn_parsed.txt` here instead
diff --git a/cmn_parsed.txt b/new_cmn_parsed.txt
index 7577a5d..e6941ec 100644
--- a/cmn_parsed.txt
+++ b/new_cmn_parsed.txt
@@ -2,7 +2,7 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[wo3men5 shi4shi5 kan4 !]]></romanization>
+<romanization><![CDATA[wo3men5 shi4shi4kan4 !]]></romanization>
 <alternateScript><![CDATA[我们试试看!]]></alternateScript>
 </root>
 我该去睡觉了。
@@ -15,29 +15,29 @@
 你在干什麼啊?
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
-<script><![CDATA[simplified_script]]></script>
-<romanization><![CDATA[ni3 zai4 gan4 shen2 me5 a1 ?]]></romanization>
-<alternateScript><![CDATA[你在幹什麼啊?]]></alternateScript>
+<script><![CDATA[traditional_script]]></script>
+<romanization><![CDATA[ni3 zai4 gan4 shen2me5 a1 ?]]></romanization>
+<alternateScript><![CDATA[你在干什么啊?]]></alternateScript>
 </root>
 這是什麼啊?
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[zhe4 shi4 shen2 me5 a1 ?]]></romanization>
+<romanization><![CDATA[zhe4 shi4 shen2me5 a1 ?]]></romanization>
 <alternateScript><![CDATA[这是什么啊?]]></alternateScript>
 </root>
 今天是6月18号,也是Muiriel的生日!
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[simplified_script]]></script>
-<romanization><![CDATA[jin1tian1 shi4 liu4 yue4 yi1 ba1 hao4 , ye3 shi4 Muiriel de5 sheng1ri4 !]]></romanization>
+<romanization><![CDATA[jin1tian1 shi4 6 yue4 18 hao4 , ye3 shi4 Muiriel de5 sheng1ri4 !]]></romanization>
 <alternateScript><![CDATA[今天是6月18號,也是Muiriel的生日!]]></alternateScript>
 </root>
 生日快乐,Muiriel!
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[simplified_script]]></script>
-<romanization><![CDATA[sheng1ri4 kuai4le4 ,Muiriel!]]></romanization>
+<romanization><![CDATA[sheng1ri4kuai4le4 ,Muiriel!]]></romanization>
 <alternateScript><![CDATA[生日快樂,Muiriel!]]></alternateScript>
 </root>
 Muiriel现在20岁了。
@@ -45,7 +45,7 @@ Muiriel现在20岁了。
 <root>
 <script><![CDATA[simplified_script]]></script>
 <romanization><![CDATA[Muiriel xian4zai4 20 sui4 le5 .]]></romanization>
-<alternateScript><![CDATA[Muiriel現在20嵗了。]]></alternateScript>
+<alternateScript><![CDATA[Muiriel現在20歲了。]]></alternateScript>
 </root>
 密码是"Muiriel"。
 <?xml version="1.0" encoding="UTF-8"?>
@@ -65,35 +65,35 @@ Muiriel现在20岁了。
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[simplified_script]]></script>
-<romanization><![CDATA[wo3 bu4zhi1 dao4 .]]></romanization>
+<romanization><![CDATA[wo3 bu4 zhi1dao4 .]]></romanization>
 <alternateScript><![CDATA[我不知道。]]></alternateScript>
 </root>
 我不知道應該說什麼才好。
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[wo3 bu4zhi1 dao4 ying1gai1 shuo1 shen2 me5 cai2 hao3 .]]></romanization>
+<romanization><![CDATA[wo3 bu4 zhi1dao4 ying1gai1 shuo1 shen2me5 cai2 hao3 .]]></romanization>
 <alternateScript><![CDATA[我不知道应该说什么才好。]]></alternateScript>
 </root>
 這個永遠完不了了。
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[zhe4ge5 yong3yuan3 wan2 bu4 le5 le5 .]]></romanization>
+<romanization><![CDATA[zhe4ge5 yong3yuan3 wan2 bu4 liao3liao3 .]]></romanization>
 <alternateScript><![CDATA[这个永远完不了了。]]></alternateScript>
 </root>
 我只是不知道應該說什麼而已……
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[wo3 zhi3 shi4 bu4zhi1 dao4 ying1gai1 shuo1 shen2 me5 er2yi3 ……]]></romanization>
+<romanization><![CDATA[wo3 zhi3shi4 bu4 zhi1dao4 ying1gai1 shuo1 shen2me5 er2yi3 ……]]></romanization>
 <alternateScript><![CDATA[我只是不知道应该说什么而已……]]></alternateScript>
 </root>
 那是一隻有惡意的兔子。
 <?xml version="1.0" encoding="UTF-8"?>
 <root>
 <script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[na4 shi4 yi1 zhi3 you3 e4yi4 de5 tu4zi5 .]]></romanization>
+<romanization><![CDATA[na4shi5 yi1 zhi1 you3 e4yi4 de5 tu4zi5 .]]></romanization>
 <alternateScript><![CDATA[那是一只有恶意的兔子。]]></alternateScript>
 </root>
 我以前在山里。
@@ -107,35 +107,35 @@ Muiriel现在20岁了。
The diff shows roughly what kind of changes to expect, but it's hard to evaluate how frequent each change is except by gut feeling. To get some accurate statistics, we can focus on the parts of each line that actually change and then count how many times the same change appears. It turns out that Python's standard library provides functions to do exactly that, so I was able to hack together a quick script:
import difflib
changes = lambda a, b: [
    (tag, a[i1:i2], b[j1:j2])
    for tag, i1, i2, j1, j2
    in difflib.SequenceMatcher(None, a, b).get_opcodes() if tag != 'equal'
]
from collections import Counter
with open('cmn_parsed.txt') as cmn_parsed:
    with open('new_cmn_parsed.txt') as new_cmn_parsed:
        count = Counter(
            c
            for a, b in zip(cmn_parsed.readlines(), new_cmn_parsed.readlines())
            for c in changes(a, b)
        )
from pprint import pprint
pprint(count.most_common(100))
which outputs the 100 most common changes after a few minutes of computation
[(('delete', ' ', ''), 27557),
 (('insert', '', ' '), 12508),
 (('replace', '5', '4'), 4087),
 (('replace', 't', 'T'), 2991),
 (('replace', '4', '5'), 2326),
 (('replace', '1', '5'), 2034),
 (('replace', '甚', '什'), 2023),
 (('replace', 'i3', '5'), 981),
 (('replace', 'y', 'Y'), 965),
 (('insert', '', "'"), 902),
 (('replace', 'm', 'M'), 860),
 (('replace', '1', '4'), 759),
 (('replace', '5', '1'), 720),
 (('replace', '2', '5'), 718),
 (('replace', '5', '3'), 693),
 (('replace', 'ao2', 'e5'), 632),
 (('replace', 's', 'S'), 518),
 (('replace', 'r', 'R'), 507),
 (('replace', 'b', 'B'), 449),
 (('replace', '3', '5'), 392),
 (('replace', 'x', 'X'), 378),
 (('replace', '著', '着'), 363),
 (('replace', 'i4', 'e5'), 363),
 (('replace', '5', '2'), 343),
 (('replace', 'u4', 'e5'), 342),
 (('replace', '4', '2'), 341),
 (('replace', 'f', 'F'), 318),
 (('replace', '2', '4'), 308),
 (('replace', 'h', 'H'), 277),
 (('replace', 'i3 ', '5'), 268),
 (('replace', ' ', "'"), 264),
 (('replace', 'a', 'A'), 253),
 (('replace', 'j', 'J'), 240),
 (('replace', '3', '4'), 239),
 (('replace', 'l', 'L'), 227),
 (('replace', 'i', 'e'), 221),
 (('replace', 'd', 'D'), 221),
 (('replace', '4', '1'), 216),
 (('replace', '1', '3'), 201),
 (('delete', 'o', ''), 196),
 (('replace', '3', '1'), 191),
 (('replace', '1 ', '4'), 190),
 (('replace', 'z', 'Z'), 190),
 (('replace', '周', '週'), 183),
 (('replace', '向', '嚮'), 179),
 (('replace', '3', '2'), 162),
 (('replace', '麽', '么'), 152),
 (('replace', 'P', 'p'), 149),
 (('replace', '嵗', '歲'), 147),
 (('replace', 'd', 't'), 147),
 (('replace', 'w', 'W'), 135),
 (('replace', '5 ', '4'), 133),
 (('replace', '週', '周'), 132),
 (('replace', '2', '1'), 123),
 (('replace', '游', '遊'), 117),
 (('replace', 's', 'trad'), 113),
 (('replace', 'mp', 'tiona'), 113),
 (('delete', 'ified', ''), 113),
 (('replace', 'c', 'z'), 107),
 (('replace', '4', '3'), 99),
 (('replace', 'g', 'G'), 96),
 (('replace', '歲', '岁'), 96),
 (('replace', 'n', 'N'), 91),
 (('replace', 'e5', 'i4'), 84),
 (('replace', '2', '4 '), 84),
 (('replace', '瞭', '了'), 83),
 (('replace', '4 ', '5'), 81),
 (('replace', 'k', 'K'), 78),
 (('replace', 'e', 'E'), 77),
 (('replace', 'c', 'C'), 77),
 (('replace', '1', '4 '), 74),
 (('replace', '1', '2'), 74),
 (('replace', 'f', ' F'), 70),
 (('replace', 'q', 'Q'), 70),
 (('replace', '5', '4 '), 69),
 (('delete', 'e', ''), 68),
 (('replace', 'e5', 'uo2'), 67),
 (('replace', 'i3 ', '2'), 64),
 (('replace', '5 ', '3'), 63),
 (('insert', '', 'o'), 57),
 (('replace', '台', '臺'), 56),
 (('replace', '4', '2 '), 56),
 (('replace', '3 ', '4'), 53),
 (('replace', '4', '5 '), 51),
 (('replace', 'B', 'bi1 '), 50),
 (('replace', '遊', '游'), 49),
 (('replace', '5 ', '2'), 48),
 (('replace', '2 ', '4'), 47),
 (('insert', '', ' de5'), 45),
 (('replace', '錶', '表'), 44),
 (('replace', 'o', 'O'), 42),
 (('replace', '乾', '干'), 41),
 (('replace', '歲', 'sui4'), 41),
 (('replace', '歷', '历'), 40),
 (('replace', '於', '于'), 39),
 (('delete', ' e', ''), 39),
 (('replace', 'i', 'u'), 38),
 (('replace', '3', '5 '), 38),
 (('replace', '歷 ', 'li4'), 37),
 (('replace', '週', 'zhou1'), 37)]
That shows that most changes are relatively short (inserting/removing spaces, changing tone numbers, adding simplified variants of some characters, adding pinyin for characters that were unknown previously) but there's not enough context to tell whether each change is justified or not. E.g. there are 4087 changes turning tone 5 into tone 4, but it's not even clear whether those are all the same word or many different words that are affected. It wouldn't be hard to also include some of the surrounding text for each change to provide some context, but exactly how much context is needed likely depends on the particular case. I feel like this would best be solved by an interactive tool that would allow you to quickly adjust how much context you want to see.

The other issue here is that input is collected in a rather ad-hoc manner and if we wanted to test another transcription method, the files to compare would be generated differently, in a different format. Since Tatoeba already has an abstraction layer providing a unified interface for different transcription methods, maybe that could be extended to export transcriptions in an easily comparable format.

@jiru
Copy link
Member

jiru commented Apr 4, 2020

Thank you for all the work you put into this. I feel like we are getting in the right direction with your automatic diff comparison idea.

Another important statistic to consider is how newly-autogenerated transcriptions compare to manually reviewed ones (i.e. how close we brought the tool to reviewed transcriptions). But for this, we first need to solve Tatoeba/tatoeba2#1515. I’m gonna do this now.

I think it’s okay if the tool is not language-agnostic for now, we can start with a tool specialized in Mandarin and expand it for other languages once we are getting more confident with the process.

@jiru
Copy link
Member

jiru commented Apr 4, 2020

@Yorwba Early exported transcriptions: https://downloads.tatoeba.org/exports/transcriptions.tar.bz2

@Yorwba
Copy link
Author

Yorwba commented May 10, 2020

I finally finished my script for comparing the different transcriptions.

Basically, it groups transcriptions for the same sentence ID and script from multiple files together and splits them into blocks that are either identical in all of them or show some difference. The differences are then classified into patterns, e.g. a b b if the first one is different, but the other two match. For each difference, I also identify which characters are most likely to appear before and after the differing part and group examples where that context is the same together. The tool generates an HTML report that uses nested <details><summary></summary></details> tags to allow drilling down to check a particular difference. The full report is a bit too large to include in this comment, so here's a screenshot:
Transcription differences

The counts for each pattern are:

Differences between transcriptions in old_cmn_transcriptions.tsv, new_cmn_transcriptions.tsv, transcriptions.csv
Pattern a b b: 1599 times (49.8%)
Pattern a a c: 779 times (24.3%)
Pattern a b a: 741 times (23.1%)
Pattern a b c: 90 times (2.8%)

In 49.8% of cases where there is a difference, the manually edited transcription matches the newly generated one, while in 23.1% of cases, it matches the old one instead. In the remaining two cases, neither of the automatic transcriptions matches the manually edited one.

So adopting the new dictionary for the automatic transcription would be an overall improvement.

@jiru
Copy link
Member

jiru commented May 12, 2020

That’s brilliant! Thanks a million, Yorwba.

@jiru jiru merged commit c42dff9 into Tatoeba:master May 12, 2020
@jiru
Copy link
Member

jiru commented May 12, 2020

@Yorwba I installed the new mandarin.xml file on dev.tatoeba.org and I regenerated all the non-manually edited transcriptions for Chinese. I can see that some Pinyin words are now split differently, and that 喜欢 is now transcribed as xǐhuan instead of xǐhuān. 😸

Can you confirm it looks fine?

@Yorwba
Copy link
Author

Yorwba commented May 12, 2020

Yeah, looks good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants