This is a TypeScript/JavaScript library for Node.js (not browser) that weaves together a Japanese language learner-oriented Japanese NLP (natural language processing) pipeline using the following technologies:
- MeCab, the Japanese morphological parser and part-of-speech tagger;
- J.DepP, the bunsetsu chunker and dependency parser that consumes MeCab output;
- JMdict, the free open-source Japanese-to-many-languages dictionary;
- JMdict-Simplified, JMdict in JSON;
- JMdict-Furigana, mapping JMdict entries to accurate furigana (like 食べ物);
- Kanjidic2, a database of kanji (漢字, i.e., Chinese characters) and their components, affiliated with JMdict;
- Kamiya-Codec, which conjugates and deconjugates Japanese verbs and adjectives based on Taeko Kamiya's textbooks;
- in the same vein, this library wraps Naoko Chino's All about particles: a handbook of Japanese function words's taxonomy of particles.
In practical terms, this library will take a sentence like this:
へましたらリーダーに切られるだけ
and give you the following:
- furigana like へましたらリーダーに切られるだけ;
- morpheme and bunsetsu boundaries (note that the input didn't contain any spaces, so these are inferred):
- へま し たら │ リーダー に │ 切ら れる だけ (spaces are morpheme boundaries,
|
bunsetsu boundaryes)
- へま し たら │ リーダー に │ 切ら れる だけ (spaces are morpheme boundaries,
- as well as bunsetsu dependencies, allowing you to reconstruct something like this:
へましたら━━┓
リーダーに━━┫
切られるだけ
- a list of conjugated verbs and adjectives, in this case:
- したら = 為る +
Tara
form, - 切られる = 切る +
ReruRareru
+Dictionary
form, and - (much less usefully), れる = れる +
Dictionary
form;
- したら = 為る +
- a list of particles,
- に, and
- だけ, both accompanied by all possible usage patterns in Naoko Chino's book,
- a long list of dictionary "hits" (like search engine "hits"), broken down by the starting morpheme and ending morpheme, so:
- へま
- 「へま」 blunder/bungle/gaffe
- 「へまをやる」 to commit a blunder
- したら
- 下 「した」 below/down/under/younger (e.g. daughter); bottom; beneath/underneath; etc.
- 舌 「した」 tongue; tongue-like object/clapper (of a bell)/talon (of a lock)
- 簧 「した」 reed (of a musical instrument)
- etc.
- し
- 為る 「する」 to do/to carry out/to perform; to cause to become/to make (into)/to turn (into); to serve as/to act as/to work as; etc.
- 成る・為る 「なる」 to become/to get/to grow/to turn/to reach/to attain; to result in/to turn out/to end up/to prove (to be); to consist of/to be composed of/to be made up of; etc.
- 刷る・摺る 「する」 to print; to color or pattern fabric using a wooden mold
- 掏る 「する・スる」 to pickpocket/to steal
- etc.
- たら
- 多 「た」 multi-
- 誰 「だれ・だあれ・たれ・た」 who
- 田 「た」 rice field
- 他 「た」 other (esp. people and abstract matters)
- etc.
- リーダー
- 「リーダー・リーダ」 leader; reader/reading book; reader (i.e. someone who reads); etc.
- リーダー格 「リーダーかく」 leader/leading figure
- リーダー制 「リーダーせい」 leadership organization/leadership organisation
- etc.
- に切ら
- 煮切る 「にきる」 to boil down sake, mirin, etc. to reduce the alcohol content
- に
- 「に」 at (place, time)/in/on/during; to (direction, state)/toward/into; for (purpose); etc.
- 荷 「に」 load/baggage/cargo/freight/goods; burden/responsibility
- 似 「に」 looking like (someone)/taking after (either of one's parents)
- etc.
- 切ら
- れる
- だけ
- へま
- kanji and their breakdowns per Kanjidic2, in this case just that 切 is pronounced 「セツ・サイ・き.る・-き.る・き.り・-き.り・-ぎ.り・き.れる・-き.れる・き.れ・-き.れ・-ぎ.れ」, meaning "cut; cutoff; be sharp", used in names きつ・きり・ぎり, and is made up of
- 七 「シチ・なな・なな.つ・なの」 "seven" (names: し・しっ・な・ひち)
- 刀 「トウ・かたな・そり」 "sword; saber; knife" (names: き・ち・と・わき)
All of the above information is returned as a JavaScript object or in JSON (if accessed by the built-in web server).
As you can tell from the above, Curtiz gives you a lot of information that might be related to your text but might not be. There are two reasons for this:
- Japanese is a highly homophonous language when it comes to sounds, and its writing system allows for considerable ambiguity. Nonetheless, you can imagine a better version of Curtiz that is much smarter about discarding useless information: for example, all the dictionary entries for
たら
aren't sensible because they're either forた
, which brings us to - Curtiz would much rather provide you with (a lot of) useless information than risk omitting data that is useful to the learner.
First, make sure you have Git and Node.js installed (any recent version is fine).
Then install MeCab, Unidic, and J.DepP. MeCab and Unidic are easy to install on macOS via Homebrew, but J.DepP is a "normal" old-school Unix C++ build (./configure --with-mecab-dict=UNI && make
…; ./configure --help
is useful and explains what with-mecab-dict
is doing) and if you've never built such a project before, do your best to follow the instructions and open an issue if you need help.
Then, download the followed required files (TODO: automatically download these!):
jmdict-eng-*.json
from JMdict-SimplifiedJmdictFurigana.json
from JMdict-Furiganakanjidic2.xml.gz
from Kanjidic
If you already have your own Node.js project, install Curtiz as a dependency:
npm i https://github.com/fasiha/curtiz-japanese-nlp
Drop the three dependency files above into your project and skip to the "API" section below.
If you plan to interact with Curtiz just through a JSON web server, the easiest thing to do is to just set up a mini-Node.js package that'll spin up the server:
mkdir CURTIZ
to make a new directory, name itCURTIZ
but please change thiscd CURTIZ
to enter the new directory- Put the three dependency files into this directory
npm init -y
will initialize an empty Node.js packagenpm i https://github.com/fasiha/curtiz-japanese-nlp
will install Curtiz as a dependencynpx curtiz-annotate-server
will start the webserver on http://127.0.0.1:8133 (you can pick another port, for example 8888, viaPORT=8888 npx curtiz-annotate-server
)- (N.B. If you have multiple copies/versions of
jmdict-simplified
, you can specify the one to use with an environment variableJMDICT_SIMPLIFIED_JSON=./jmdict-eng-3.5.0.json npx curtiz-annotate-server
. Environment variables stack so you can provide both this and the port:PORT=8888 JMDICT_SIMPLIFIED_JSON=./jmdict-eng-3.5.0.json npx curtiz-annotate-server
)
The first time you run this, it will take several seconds while it builds a Leveldb cache of JMdict.
Now you're ready to hit a REST endpoint. The following will ask curl
to POST a Japanese sentence in a specific JSON structure to the appropriate endpoint, and save the result to curtiz.json
:
curl --data '{"sentence": "へましたらリーダーに切られるだけ"}' \
-X POST \
-H "Content-Type: application/json" \
-o curtiz.json \
http://127.0.0.1:8133/api/v1/sentence
As described below, I need to formally describe the structure of this data. In the meantime, please check the tests and the TypeScript interfaces, especially the v1ResSentenceAnalyzed
type, to see what data is where.
In your Node project, create a new file (either TypeScript demo.ts
or ESM demo.mjs
). Put the following code into it to import and exercise the package:
// TypeScript or ESM (e.g., `demo.ts` or `demo.mjs`)
import * as curtiz from 'curtiz-japanese-nlp';
curtiz.handleSentence('それは昨日のことちゃった').then(result => console.dir(result, {depth: null}));
(If you're using TypeScript, (1) make sure you compile this, e.g.,
npx tsc -p .
and run the resultingdemo.js
. Also (2), you may need yourtsconfig.json
to"target": "es2019"
or later.)Make sure you have the three dependency files above in your project head (JMdict-Furigana, JMdict-Simplified, and Kanjidic). The first time you run this, Curtiz will spend several seconds building a Leveldb cache for JMdict and will log its progress.
Note that because Leveldb is not multithreaded, you can't run this if you're also running the web server above 😒. If you see an error like
Error [OpenError]: IO error: lock jmdict-simplified/LOCK: Resource temporarily unavailable
, this is Leveldb complaining that some other process has a lock on the database. I should fix this…
This will print out a lot of text, but it will show you everything that Curtiz has done with the sentence.
More details forthcoming but please check the tests and the TypeScript interfaces, especially the v1ResSentenceAnalyzed
type, to see what data is where.
Often it can be very helpful to inspect the output of MeCab-Unidic to understand what this module is doing. This library incldues a thin wrapper that translates Unidic parts-of-speech, conjugations, inflections, etc., into English (via tables 1, 2, 3, published by GitHub user @masayu-a citing the work of Dr Irena Srdanovic), and exposes a command-line interface: simply pipe multi-line input into mecabUnidic.js
, for example cat text | ./mecabUnidic.js
or equivalently cat text | node mecabUnidic.js
. A simple example on the command-line:
cat <<EOF | ./mecabUnidic.js
「ほら、
あれが小学校だよ。」
EOF
This will print out the following Markdown table:
Literal | Pron. | Lemma Read. | Lemma | PoS | Infl. Type | Infl. |
---|---|---|---|---|---|---|
ほら | ホラ | ホラ | ほら | interjection-general | ||
、 | 、 | supplementary_symbol-comma | ||||
あれ | アレ | アレ | 彼れ | pronoun | ||
が | ガ | ガ | が | particle-case | ||
小 | ショー | ショウ | 小 | prefix | ||
学校 | ガッコー | ガッコウ | 学校 | noun-common-general | ||
だ | ダ | ダ | だ | auxiliary_verb | auxiliary-da | conclusive-general |
よ | ヨ | ヨ | よ | particle-phrase_final | ||
。 | 。 | supplementary_symbol-period | ||||
」 | 」 | supplementary_symbol-bracket_open | ||||