Skip to content

Commit

Permalink
Merge pull request #2 from Yorwba/master
Browse files Browse the repository at this point in the history
Add Python script to update mandarin.xml using data from CC-CEDICT
  • Loading branch information
jiru authored May 12, 2020
2 parents be4a0ff + ea7f436 commit c42dff9
Show file tree
Hide file tree
Showing 12 changed files with 125,057 additions and 133,003 deletions.
53 changes: 51 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,8 +40,57 @@ it provides the following API call, that will return a XML answer
* [\_script?str=\*](http://localhost:8080/guess\_script?str=\*)
* [/all?str=\*](http://localhost:8080/all?str=\*)

### Updating Data Files ###

To regenerate `doc/mandarin.xml` with an updated version of CC-CEDICT, run
```bash
python tools/mandarin > doc/mandarin.xml
```

If any new ambiguous entries have been added to CC-CEDICT, this will fail. In
that case, add the entries in question to `tools/mandarin/preference.py` to
specify which variant should be used.

### Evaluating Transcriptions ###

To evaluate changes to the transcription engine, `tools/batch_transcribe.py` and
`tools/diff` can be used as follows:

1. Get the list of Mandarin sentences from Tatoeba:
```bash
wget 'https://downloads.tatoeba.org/exports/per_language/cmn/cmn_sentences.tsv.bz2'
bunzip2 cmn_sentences.tsv.bz2
```
2. Run `sinoparserd` with the old configuration
```bash
sinoparserd -m old_mandarin.xml
```
3. Transcribe all sentences
```bash
cat cmn_sentences.tsv | tools/batch_transcribe.py > old_cmn_transcriptions.tsv
```
4. Run `sinoparserd` with the new configuration and repeat.
5. Generate a report of the differences
```bash
python tools/diff/ {old,new}_cmn_transcriptions.tsv > report.html
```
6. View the generated HTML in a browser.
7. To compare against manually edited transcriptions, download them from Tatoeba
```bash
wget 'https://downloads.tatoeba.org/exports/transcriptions.tar.bz2'
tar xf transcriptions.tar.bz2
```
8. And include them in the comparison
```bash
python tools/diff/ {old,new}_cmn_transcriptions.tsv transcriptions.csv > report.html
```

## License

All the source code is licensed under GPLv3, the xml files are under their own license, it's a "open one" but i need to check which one, certainly CC-BY-SA
so for the moment I would recommend people to use their own data files for "public usage" and use the provided xml only for "test" purpose.
All the source code is licensed under GPLv3, the xml files are under their own license.

The license for `cantonese.xml` (likely sourced from cantodict) is an "open one" but i need to check which one, certainly CC-BY-SA.

The license for `mandarin.xml` (sourced from CC-CEDICT) is CC BY-SA 4.0. See the comment at the beginning of the file for more details.

So for the moment I would recommend people to use their own data files for "public usage" and use the provided xml only for "test" purpose.
250,896 changes: 117,896 additions & 133,000 deletions doc/mandarin.xml

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion src/Utf8String.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -137,5 +137,5 @@ std::string Utf8String::substr(size_t start, size_t size) const {
*
*/
std::ostream& operator<< (std::ostream& stream, const Utf8String& utf8String) {
stream << utf8String.to_string();
return stream << utf8String.to_string();
}
64 changes: 64 additions & 0 deletions tools/batch_transcribe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
#!/usr/bin/python
# -*- coding: utf-8 -*-

"""This script uses sinoparserd to generate transcriptions for several sentences
read from standard input and outputs them in the same format as Tatoeba's
exported transcription files, in order to make it easy to compare them.
"""

from __future__ import print_function

import re
import xml.etree.ElementTree as ET

try:
from urllib.request import quote, urlopen
except ImportError: # Python 2
from urllib2 import quote, urlopen


def utf8(text):
if type(text) != str:
return text.encode('utf-8')
return text


def basic_pinyin_cleanup(text):
# See tatoeba2/src/Lib/Autotranscription.php: _basic_pinyin_cleanup
text = re.sub(r'\s+([!?:;.,])', r'\1', text)
text = re.sub(r'"\s*([^"]+)\s*"', r'"\1"', text)
text = text[0].upper() + text[1:]
return text


def transcribe(text):
response = urlopen('http://localhost:8080/all?str='+quote(text))
xml = ET.fromstring(response.read())
data = {child.tag: utf8(child.text) for child in xml}
script = {
'simplified_script': 'Hans',
'traditional_script': 'Hant'
}[data['script']]
alternate_script = {'Hans': 'Hant', 'Hant': 'Hans'}[script]
alternate_script_text = data['alternateScript']
romanization = data['romanization']
transcriptions = {
alternate_script: alternate_script_text,
'Latn': basic_pinyin_cleanup(romanization),
}
return transcriptions


def main(argv):
from sys import stdin
user = '' # automatic transcriptions are marked by an empty username
for line in stdin.readlines():
n, lang, text = line.rstrip('\n').split('\t', 2)
transcriptions = transcribe(text)
for script, transcription in sorted(transcriptions.items()):
print(n, lang, script, user, transcription, sep='\t')


if __name__ == '__main__':
import sys
main(sys.argv)
Loading

0 comments on commit c42dff9

Please sign in to comment.