Merge pull request #2 from Yorwba/master

Add Python script to update mandarin.xml using data from CC-CEDICT
Tatoeba · May 12, 2020 · c42dff9 · c42dff9
2 parents be4a0ff + ea7f436
commit c42dff9
Show file tree

Hide file tree

Showing 12 changed files with 125,057 additions and 133,003 deletions.
diff --git a/README.md b/README.md
@@ -40,8 +40,57 @@ it provides the following API call, that will return a XML answer
   * [\_script?str=\*](http://localhost:8080/guess\_script?str=\*)
   * [/all?str=\*](http://localhost:8080/all?str=\*)
 
+### Updating Data Files ###
+
+To regenerate `doc/mandarin.xml` with an updated version of CC-CEDICT, run
+```bash
+python tools/mandarin > doc/mandarin.xml
+```
+
+If any new ambiguous entries have been added to CC-CEDICT, this will fail. In
+that case, add the entries in question to `tools/mandarin/preference.py` to
+specify which variant should be used.
+
+### Evaluating Transcriptions ###
+
+To evaluate changes to the transcription engine, `tools/batch_transcribe.py` and
+`tools/diff` can be used as follows:
+
+1. Get the list of Mandarin sentences from Tatoeba:
+```bash
+wget 'https://downloads.tatoeba.org/exports/per_language/cmn/cmn_sentences.tsv.bz2'
+bunzip2 cmn_sentences.tsv.bz2
+```
+2. Run `sinoparserd` with the old configuration
+```bash
+sinoparserd -m old_mandarin.xml
+```
+3. Transcribe all sentences
+```bash
+cat cmn_sentences.tsv | tools/batch_transcribe.py > old_cmn_transcriptions.tsv
+```
+4. Run `sinoparserd` with the new configuration and repeat.
+5. Generate a report of the differences
+```bash
+python tools/diff/ {old,new}_cmn_transcriptions.tsv > report.html
+```
+6. View the generated HTML in a browser.
+7. To compare against manually edited transcriptions, download them from Tatoeba
+```bash
+wget 'https://downloads.tatoeba.org/exports/transcriptions.tar.bz2'
+tar xf transcriptions.tar.bz2
+```
+8. And include them in the comparison
+```bash
+python tools/diff/ {old,new}_cmn_transcriptions.tsv transcriptions.csv > report.html
+```
+
 ## License
 
-All the source code is licensed under GPLv3, the xml files are under their own license, it's a "open one" but i need to check which one, certainly CC-BY-SA
-so for the moment I would recommend people to use their own data files for "public usage" and use the provided xml only for "test" purpose.
+All the source code is licensed under GPLv3, the xml files are under their own license.
+
+The license for `cantonese.xml` (likely sourced from cantodict) is an "open one" but i need to check which one, certainly CC-BY-SA.
+
+The license for `mandarin.xml` (sourced from CC-CEDICT) is CC BY-SA 4.0. See the comment at the beginning of the file for more details.
 
+So for the moment I would recommend people to use their own data files for "public usage" and use the provided xml only for "test" purpose.
diff --git a/doc/mandarin.xml b/doc/mandarin.xml
diff --git a/src/Utf8String.cpp b/src/Utf8String.cpp
@@ -137,5 +137,5 @@ std::string Utf8String::substr(size_t start, size_t size) const {
  *
  */
 std::ostream& operator<< (std::ostream& stream, const Utf8String& utf8String) { 
-    stream << utf8String.to_string();
+    return stream << utf8String.to_string();
 }
diff --git a/tools/batch_transcribe.py b/tools/batch_transcribe.py
@@ -0,0 +1,64 @@
+#!/usr/bin/python
+# -*- coding: utf-8 -*-
+
+"""This script uses sinoparserd to generate transcriptions for several sentences
+read from standard input and outputs them in the same format as Tatoeba's
+exported transcription files, in order to make it easy to compare them.
+"""
+
+from __future__ import print_function
+
+import re
+import xml.etree.ElementTree as ET
+
+try:
+    from urllib.request import quote, urlopen
+except ImportError:  # Python 2
+    from urllib2 import quote, urlopen
+
+
+def utf8(text):
+    if type(text) != str:
+        return text.encode('utf-8')
+    return text
+
+
+def basic_pinyin_cleanup(text):
+    # See tatoeba2/src/Lib/Autotranscription.php: _basic_pinyin_cleanup
+    text = re.sub(r'\s+([!?:;.,])', r'\1', text)
+    text = re.sub(r'"\s*([^"]+)\s*"', r'"\1"', text)
+    text = text[0].upper() + text[1:]
+    return text
+
+
+def transcribe(text):
+    response = urlopen('http://localhost:8080/all?str='+quote(text))
+    xml = ET.fromstring(response.read())
+    data = {child.tag: utf8(child.text) for child in xml}
+    script = {
+        'simplified_script': 'Hans',
+        'traditional_script': 'Hant'
+    }[data['script']]
+    alternate_script = {'Hans': 'Hant', 'Hant': 'Hans'}[script]
+    alternate_script_text = data['alternateScript']
+    romanization = data['romanization']
+    transcriptions = {
+        alternate_script: alternate_script_text,
+        'Latn': basic_pinyin_cleanup(romanization),
+    }
+    return transcriptions
+
+
+def main(argv):
+    from sys import stdin
+    user = '' # automatic transcriptions are marked by an empty username
+    for line in stdin.readlines():
+        n, lang, text = line.rstrip('\n').split('\t', 2)
+        transcriptions = transcribe(text)
+        for script, transcription in sorted(transcriptions.items()):
+            print(n, lang, script, user, transcription, sep='\t')
+
+
+if __name__ == '__main__':
+    import sys
+    main(sys.argv)