-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Python script to update mandarin.xml using data from CC-CEDICT #2
Conversation
(is it possible to backport the PR on my repo too ? :) ) |
@allan-simon I just checked where your repo and this one diverge and it seems like b062dcd is simply a merge commit combining the changes of a37d2d9 and 8d4482d into one. What confuses me about that is that 7d45d33, the other participant in the merge, is already the parent of 8d4482d. I didn't know it's possible to merge a commit with its own ancestor like that. Maybe I'm missing something here, but it looks like "backporting" could be just another merge. |
We discussed during the call on Sunday that it would be good to have a tool for collecting statistics about the changes that the new dictionary will cause in transcriptions of sentences on Tatoeba. @jiru mentioned that that was already a consideration when he created a new system for autogenerating Japanese furigana. So it would be good if such a tool could be relatively language-agnostic so that it could be used for evaluating other transcription changes as well. The way I evaluated the changes in this PR was relatively manual: I downloaded the Mandarin sentences (It's nice that we now have a way to download data in a single language easily) and then fed it through cat cmn_sentences.tsv | cut -f3 | while read sent; do echo "$sent"; curl 'laptop:8080/all?str='"$sent" 2>/dev/null; done > cmn_parsed.txt That concatenates the original sentences and resulting XML into one big pile of mud see the first 100 lines here我們試試看!
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[wo3men5 shi4shi5 kan4 !]]></romanization>
<alternateScript><![CDATA[我们试试看!]]></alternateScript>
</root>
我该去睡觉了。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[wo3 gai1 qu4 shui4jiao4 le5 .]]></romanization>
<alternateScript><![CDATA[我該去睡覺了。]]></alternateScript>
</root>
你在干什麼啊?
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[ni3 zai4 gan4 shen2 me5 a1 ?]]></romanization>
<alternateScript><![CDATA[你在幹什麼啊?]]></alternateScript>
</root>
這是什麼啊?
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[zhe4 shi4 shen2 me5 a1 ?]]></romanization>
<alternateScript><![CDATA[这是什么啊?]]></alternateScript>
</root>
今天是6月18号,也是Muiriel的生日!
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[jin1tian1 shi4 liu4 yue4 yi1 ba1 hao4 , ye3 shi4 Muiriel de5 sheng1ri4 !]]></romanization>
<alternateScript><![CDATA[今天是6月18號,也是Muiriel的生日!]]></alternateScript>
</root>
生日快乐,Muiriel!
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[sheng1ri4 kuai4le4 ,Muiriel!]]></romanization>
<alternateScript><![CDATA[生日快樂,Muiriel!]]></alternateScript>
</root>
Muiriel现在20岁了。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[Muiriel xian4zai4 20 sui4 le5 .]]></romanization>
<alternateScript><![CDATA[Muiriel現在20嵗了。]]></alternateScript>
</root>
密码是"Muiriel"。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[mi4ma3 shi4 "Muiriel".]]></romanization>
<alternateScript><![CDATA[密碼是"Muiriel"。]]></alternateScript>
</root>
我很快就會回來。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[wo3 hen3 kuai4 jiu4 hui4 hui2lai5 .]]></romanization>
<alternateScript><![CDATA[我很快就会回来。]]></alternateScript>
</root>
我不知道。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[wo3 bu4zhi1 dao4 .]]></romanization>
<alternateScript><![CDATA[我不知道。]]></alternateScript>
</root>
我不知道應該說什麼才好。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[wo3 bu4zhi1 dao4 ying1gai1 shuo1 shen2 me5 cai2 hao3 .]]></romanization>
<alternateScript><![CDATA[我不知道应该说什么才好。]]></alternateScript>
</root>
這個永遠完不了了。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[zhe4ge5 yong3yuan3 wan2 bu4 le5 le5 .]]></romanization>
<alternateScript><![CDATA[这个永远完不了了。]]></alternateScript>
</root>
我只是不知道應該說什麼而已……
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[wo3 zhi3 shi4 bu4zhi1 dao4 ying1gai1 shuo1 shen2 me5 er2yi3 ……]]></romanization>
<alternateScript><![CDATA[我只是不知道应该说什么而已……]]></alternateScript>
</root>
那是一隻有惡意的兔子。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
<romanization><![CDATA[na4 shi4 yi1 zhi3 you3 e4yi4 de5 tu4zi5 .]]></romanization>
<alternateScript><![CDATA[那是一只有恶意的兔子。]]></alternateScript>
</root>
我以前在山里。
<?xml version="1.0" encoding="UTF-8"?> I'll show the first 100 lines of the output of `git diff --no-index cmn_parsed.txt new_cmn_parsed.txt` here insteaddiff --git a/cmn_parsed.txt b/new_cmn_parsed.txt
index 7577a5d..e6941ec 100644
--- a/cmn_parsed.txt
+++ b/new_cmn_parsed.txt
@@ -2,7 +2,7 @@
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[wo3men5 shi4shi5 kan4 !]]></romanization>
+<romanization><![CDATA[wo3men5 shi4shi4kan4 !]]></romanization>
<alternateScript><![CDATA[我们试试看!]]></alternateScript>
</root>
我该去睡觉了。
@@ -15,29 +15,29 @@
你在干什麼啊?
<?xml version="1.0" encoding="UTF-8"?>
<root>
-<script><![CDATA[simplified_script]]></script>
-<romanization><![CDATA[ni3 zai4 gan4 shen2 me5 a1 ?]]></romanization>
-<alternateScript><![CDATA[你在幹什麼啊?]]></alternateScript>
+<script><![CDATA[traditional_script]]></script>
+<romanization><![CDATA[ni3 zai4 gan4 shen2me5 a1 ?]]></romanization>
+<alternateScript><![CDATA[你在干什么啊?]]></alternateScript>
</root>
這是什麼啊?
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[zhe4 shi4 shen2 me5 a1 ?]]></romanization>
+<romanization><![CDATA[zhe4 shi4 shen2me5 a1 ?]]></romanization>
<alternateScript><![CDATA[这是什么啊?]]></alternateScript>
</root>
今天是6月18号,也是Muiriel的生日!
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
-<romanization><![CDATA[jin1tian1 shi4 liu4 yue4 yi1 ba1 hao4 , ye3 shi4 Muiriel de5 sheng1ri4 !]]></romanization>
+<romanization><![CDATA[jin1tian1 shi4 6 yue4 18 hao4 , ye3 shi4 Muiriel de5 sheng1ri4 !]]></romanization>
<alternateScript><![CDATA[今天是6月18號,也是Muiriel的生日!]]></alternateScript>
</root>
生日快乐,Muiriel!
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
-<romanization><![CDATA[sheng1ri4 kuai4le4 ,Muiriel!]]></romanization>
+<romanization><![CDATA[sheng1ri4kuai4le4 ,Muiriel!]]></romanization>
<alternateScript><![CDATA[生日快樂,Muiriel!]]></alternateScript>
</root>
Muiriel现在20岁了。
@@ -45,7 +45,7 @@ Muiriel现在20岁了。
<root>
<script><![CDATA[simplified_script]]></script>
<romanization><![CDATA[Muiriel xian4zai4 20 sui4 le5 .]]></romanization>
-<alternateScript><![CDATA[Muiriel現在20嵗了。]]></alternateScript>
+<alternateScript><![CDATA[Muiriel現在20歲了。]]></alternateScript>
</root>
密码是"Muiriel"。
<?xml version="1.0" encoding="UTF-8"?>
@@ -65,35 +65,35 @@ Muiriel现在20岁了。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[simplified_script]]></script>
-<romanization><![CDATA[wo3 bu4zhi1 dao4 .]]></romanization>
+<romanization><![CDATA[wo3 bu4 zhi1dao4 .]]></romanization>
<alternateScript><![CDATA[我不知道。]]></alternateScript>
</root>
我不知道應該說什麼才好。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[wo3 bu4zhi1 dao4 ying1gai1 shuo1 shen2 me5 cai2 hao3 .]]></romanization>
+<romanization><![CDATA[wo3 bu4 zhi1dao4 ying1gai1 shuo1 shen2me5 cai2 hao3 .]]></romanization>
<alternateScript><![CDATA[我不知道应该说什么才好。]]></alternateScript>
</root>
這個永遠完不了了。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[zhe4ge5 yong3yuan3 wan2 bu4 le5 le5 .]]></romanization>
+<romanization><![CDATA[zhe4ge5 yong3yuan3 wan2 bu4 liao3liao3 .]]></romanization>
<alternateScript><![CDATA[这个永远完不了了。]]></alternateScript>
</root>
我只是不知道應該說什麼而已……
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[wo3 zhi3 shi4 bu4zhi1 dao4 ying1gai1 shuo1 shen2 me5 er2yi3 ……]]></romanization>
+<romanization><![CDATA[wo3 zhi3shi4 bu4 zhi1dao4 ying1gai1 shuo1 shen2me5 er2yi3 ……]]></romanization>
<alternateScript><![CDATA[我只是不知道应该说什么而已……]]></alternateScript>
</root>
那是一隻有惡意的兔子。
<?xml version="1.0" encoding="UTF-8"?>
<root>
<script><![CDATA[traditional_script]]></script>
-<romanization><![CDATA[na4 shi4 yi1 zhi3 you3 e4yi4 de5 tu4zi5 .]]></romanization>
+<romanization><![CDATA[na4shi5 yi1 zhi1 you3 e4yi4 de5 tu4zi5 .]]></romanization>
<alternateScript><![CDATA[那是一只有恶意的兔子。]]></alternateScript>
</root>
我以前在山里。
@@ -107,35 +107,35 @@ Muiriel现在20岁了。 import difflib
changes = lambda a, b: [
(tag, a[i1:i2], b[j1:j2])
for tag, i1, i2, j1, j2
in difflib.SequenceMatcher(None, a, b).get_opcodes() if tag != 'equal'
]
from collections import Counter
with open('cmn_parsed.txt') as cmn_parsed:
with open('new_cmn_parsed.txt') as new_cmn_parsed:
count = Counter(
c
for a, b in zip(cmn_parsed.readlines(), new_cmn_parsed.readlines())
for c in changes(a, b)
)
from pprint import pprint
pprint(count.most_common(100)) which outputs the 100 most common changes after a few minutes of computation[(('delete', ' ', ''), 27557),
(('insert', '', ' '), 12508),
(('replace', '5', '4'), 4087),
(('replace', 't', 'T'), 2991),
(('replace', '4', '5'), 2326),
(('replace', '1', '5'), 2034),
(('replace', '甚', '什'), 2023),
(('replace', 'i3', '5'), 981),
(('replace', 'y', 'Y'), 965),
(('insert', '', "'"), 902),
(('replace', 'm', 'M'), 860),
(('replace', '1', '4'), 759),
(('replace', '5', '1'), 720),
(('replace', '2', '5'), 718),
(('replace', '5', '3'), 693),
(('replace', 'ao2', 'e5'), 632),
(('replace', 's', 'S'), 518),
(('replace', 'r', 'R'), 507),
(('replace', 'b', 'B'), 449),
(('replace', '3', '5'), 392),
(('replace', 'x', 'X'), 378),
(('replace', '著', '着'), 363),
(('replace', 'i4', 'e5'), 363),
(('replace', '5', '2'), 343),
(('replace', 'u4', 'e5'), 342),
(('replace', '4', '2'), 341),
(('replace', 'f', 'F'), 318),
(('replace', '2', '4'), 308),
(('replace', 'h', 'H'), 277),
(('replace', 'i3 ', '5'), 268),
(('replace', ' ', "'"), 264),
(('replace', 'a', 'A'), 253),
(('replace', 'j', 'J'), 240),
(('replace', '3', '4'), 239),
(('replace', 'l', 'L'), 227),
(('replace', 'i', 'e'), 221),
(('replace', 'd', 'D'), 221),
(('replace', '4', '1'), 216),
(('replace', '1', '3'), 201),
(('delete', 'o', ''), 196),
(('replace', '3', '1'), 191),
(('replace', '1 ', '4'), 190),
(('replace', 'z', 'Z'), 190),
(('replace', '周', '週'), 183),
(('replace', '向', '嚮'), 179),
(('replace', '3', '2'), 162),
(('replace', '麽', '么'), 152),
(('replace', 'P', 'p'), 149),
(('replace', '嵗', '歲'), 147),
(('replace', 'd', 't'), 147),
(('replace', 'w', 'W'), 135),
(('replace', '5 ', '4'), 133),
(('replace', '週', '周'), 132),
(('replace', '2', '1'), 123),
(('replace', '游', '遊'), 117),
(('replace', 's', 'trad'), 113),
(('replace', 'mp', 'tiona'), 113),
(('delete', 'ified', ''), 113),
(('replace', 'c', 'z'), 107),
(('replace', '4', '3'), 99),
(('replace', 'g', 'G'), 96),
(('replace', '歲', '岁'), 96),
(('replace', 'n', 'N'), 91),
(('replace', 'e5', 'i4'), 84),
(('replace', '2', '4 '), 84),
(('replace', '瞭', '了'), 83),
(('replace', '4 ', '5'), 81),
(('replace', 'k', 'K'), 78),
(('replace', 'e', 'E'), 77),
(('replace', 'c', 'C'), 77),
(('replace', '1', '4 '), 74),
(('replace', '1', '2'), 74),
(('replace', 'f', ' F'), 70),
(('replace', 'q', 'Q'), 70),
(('replace', '5', '4 '), 69),
(('delete', 'e', ''), 68),
(('replace', 'e5', 'uo2'), 67),
(('replace', 'i3 ', '2'), 64),
(('replace', '5 ', '3'), 63),
(('insert', '', 'o'), 57),
(('replace', '台', '臺'), 56),
(('replace', '4', '2 '), 56),
(('replace', '3 ', '4'), 53),
(('replace', '4', '5 '), 51),
(('replace', 'B', 'bi1 '), 50),
(('replace', '遊', '游'), 49),
(('replace', '5 ', '2'), 48),
(('replace', '2 ', '4'), 47),
(('insert', '', ' de5'), 45),
(('replace', '錶', '表'), 44),
(('replace', 'o', 'O'), 42),
(('replace', '乾', '干'), 41),
(('replace', '歲', 'sui4'), 41),
(('replace', '歷', '历'), 40),
(('replace', '於', '于'), 39),
(('delete', ' e', ''), 39),
(('replace', 'i', 'u'), 38),
(('replace', '3', '5 '), 38),
(('replace', '歷 ', 'li4'), 37),
(('replace', '週', 'zhou1'), 37)] The other issue here is that input is collected in a rather ad-hoc manner and if we wanted to test another transcription method, the files to compare would be generated differently, in a different format. Since Tatoeba already has an abstraction layer providing a unified interface for different transcription methods, maybe that could be extended to export transcriptions in an easily comparable format. |
Thank you for all the work you put into this. I feel like we are getting in the right direction with your automatic diff comparison idea. Another important statistic to consider is how newly-autogenerated transcriptions compare to manually reviewed ones (i.e. how close we brought the tool to reviewed transcriptions). But for this, we first need to solve Tatoeba/tatoeba2#1515. I’m gonna do this now. I think it’s okay if the tool is not language-agnostic for now, we can start with a tool specialized in Mandarin and expand it for other languages once we are getting more confident with the process. |
@Yorwba Early exported transcriptions: https://downloads.tatoeba.org/exports/transcriptions.tar.bz2 |
That’s brilliant! Thanks a million, Yorwba. |
@Yorwba I installed the new mandarin.xml file on dev.tatoeba.org and I regenerated all the non-manually edited transcriptions for Chinese. I can see that some Pinyin words are now split differently, and that 喜欢 is now transcribed as xǐhuan instead of xǐhuān. 😸 Can you confirm it looks fine? |
Yeah, looks good. |
As discussed in Tatoeba/tatoeba2#2189, I wrote a script that formats the dictionary data from CC-CEDICT in the
mandarin.xml
formatsinoparserd
expects.That means there's no change to the parsing algorithm, but only to the words
sinoparserd
recognizes. It turns out that ambiguous parses are not much of an issue. I generated a list of all overlapping words that actually occur in existing sentences on Tatoeba (e.g. 是日/本 vs. 是/日本) and the list intools/mandarin/override.py
is enough to disambiguate all that appear at least five times.Words or characters that have multiple possible variants/pronunciations are a bit more of a problem. I initially generated
tools/mandarin/preference.py
based on the frequency of each variant in the corpus and applied some manual corrections afterwards, choosing the more likely option. That means the conversion should be more likely to be correct than not, but there will still be exceptions. E.g. 只 as simultaneously simplified and traditional character vs. 只 as the simplification of traditional 隻 is probably almost equally common.Nonetheless, when I checked 100 random sentences, I didn't notice any problems with the traditional/simplified conversion, only with the pinyin. Of the changes to the traditional/simplified conversion, 100% were improvements, while the same is true of 77% of the changes to the pinyin. The instances where the pinyin got worse were longer compound words which are now less likely to have spaces separating their components. But that's a minor problem compared to incorrect tones.
I wrote the script to require no dependencies outside the Python standard library and it runs under both Python 2.7 and Python 3.7. But I also committed the output, so you won't need to run it except to get a more current version of CC-CEDICT.