Skip to content

Latest commit

 

History

History
46 lines (30 loc) · 3.56 KB

README.md

File metadata and controls

46 lines (30 loc) · 3.56 KB

integration tests CodeQL npm

mecab-web-worker

Using MeCab for Japanese segmentation in the browser. Inspired by fugashi's API.

npm install mecab-web-worker

Compatibility notice: Uses Module Workers and the Compression Streams API. These are not available in every major browser.

import { MecabWorker } from "mecab-web-worker";

const worker = await MecabWorker.create("/unidic-mecab-2.1.2_bin.zip");
const result = await worker.parse("和布蕪は、ワカメの付着器の上にある");
console.log(result);

const nodes = await worker.parseToNodes("和布蕪は、ワカメの付着器の上にある");
for (let node of nodes) {
  console.log(node);
}

MeCab was compiled to WASM and runs in a background thread via the Web Workers API. It's necessary to provide a dictionary (an url to a zip file). The corresponding files are available here: https://github.com/leyhline/mecab-web-worker/releases/tag/v0.3.0 After the first download the zip file is persisted in the browser cache (using CacheStorage) to avoid repeated downloads.

Motivation

I want to build some interactive tool for aligning Japanese text and audio. Since interactivity is easier to accomplish in the browser I wanted to go full JS instead of putting e.g. Python in the mix. And since the functionality for segmentation is easy to separate I decided to create an NPM package that's hopefully as easy to use as Python's fugashi (a great wrapper around MeCab, check it out, cite it and sponsor Paul's work).

My uninformed self did also draw from the knowledge he published on his blog, e.g. An Overview of Japanese Tokenizer Dictionaries and I use his Unidic distribution. Thanks a lot! I hope to build a better understanding of the theory behind all this at a later date.

Technical Background

MeCab was compiled to WASM using Emscripten without wrapper code in C. See the corresponding GitHub Action for the compiler flags.

However, for accessing a C struct from MeCab, I had to use pointer arithmetic in JavaScript (see mecab-worker.js:MecabNode which isn't really elegant. I also wrote a simple unzip function using the Compression Streams API which works (in Chrome at least) but is not completely correct.

TODO

  • Support different dictionaries; this isn't hard but wasn't a use case for me personally.
  • Wrap more of MeCab's functionality like returning nbest results.
  • Polyfills for APIs that are not widely supported by browsers.