[lipi] Add a stronger implementation

This commit adds support for a variety of schemes as defined by the `indic_transliteration` project. It also adds a minimal test suite as a stronger guarantee on program correctness. `vidyut-lipi` is still immature compared to transliterators like Aksharamukha or `indic_transliteration`. However, it is on a good trajectory, and I think it will become a compelling transliteratior backend over time. Some notes on design: - This commit does not use any of the code from @skmnktl's in-progress transliterator, but it does borrow the idea of incorporating `indic_transliteration`'s TOML maps directly into the program source code. - I liked @skmnktl's idea of using a `Token` enum as an intermediate representation between the input scheme and the output scheme, but pursing that approach felt cumbersome when mapping between *sequences* of characters (e.g. when working with ITRANS), so I've stayed with the approach used by `indic_transliteration`, i.e. using Devanagari as the intermediate representation.
ambuda-org · Dec 26, 2023 · 60f5f95 · 60f5f95
1 parent 2eb4434
commit 60f5f95
Show file tree

Hide file tree

Showing 13 changed files with 3,230 additions and 186 deletions.
diff --git a/Cargo.lock b/Cargo.lock
diff --git a/vidyut-lipi/.gitignore b/vidyut-lipi/.gitignore
@@ -0,0 +1 @@
+www/static/wasm
diff --git a/vidyut-lipi/Cargo.toml b/vidyut-lipi/Cargo.toml
@@ -1,9 +1,23 @@
 [package]
 name = "vidyut-lipi"
 version = "0.1.0"
+authors = ["Arun Prasad <[email protected]>"]
+description = "A Sanskrit transliterator"
+homepage = "https://github.com/ambuda-org/vidyut"
+repository = "https://github.com/ambuda-org/vidyut"
+categories = ["text-processing"]
+keywords = ["sanskrit"]
+license = "MIT"
 edition = "2021"
 
 # See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
 
 [dependencies]
+rustc-hash = "1.1.0"
 clap = { version = "4.0.12", features = ["derive"] }
+wasm-bindgen = "0.2"
+serde-wasm-bindgen = "0.4"
+console_error_panic_hook = "0.1.7"
+
+[lib]
+crate-type = ["cdylib", "rlib"]
diff --git a/vidyut-lipi/Makefile b/vidyut-lipi/Makefile
@@ -0,0 +1,5 @@
+debugger:
+	./scripts/run-debugger.sh
+
+test:
+	cargo nextest run --no-fail-fast --status-level=fail
diff --git a/vidyut-lipi/README.md b/vidyut-lipi/README.md
@@ -1,17 +1,78 @@
-*vidyut-lipi* is a work-in-progress transliterator. It is not ready for public use.
+<div align="center">
+<h1><code>vidyut-lipi</code></h1>
+<p><i>A fast Indic transliterator</i></p>
+</div>
+
+`vidyut-lipi` is an experimental Sanskrit transliteration library that also
+supports many of the scripts used within the Indosphere. Our goal is to provide
+a standard transliterator for the Sanskrit ecosystem that is easy to bind to
+other programming languages.
+
+This [crate][crate] is under active development as part of the [Ambuda][ambuda]
+project. If you enjoy our work and wish to contribute to it, we encourage you
+to [join our Discord server][discord], where you can meet other Sanskrit
+programmers and enthusiasts.
+
+An online demo is available [here][demo].
+
+[crate]: https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html
+[ambuda]: https://ambuda.org
+[discord]: https://discord.gg/7rGdTyWY7Z
+[demo]: https://ambuda-org.github.io/vidyut-lipi/
+
+- [Overview](#overview)
+- [Usage](#usage)
+- [Design](#design)
+
+
+Overview
+--------
+
+Communities around the world write Sanskrit and other Indian languages in
+different scripts in different contexts. For example, a user might type
+Sanskrit in ITRANS, read it in Kannada, and publish it in Devanagari. Such
+communities often rely on a *transliterator*, which converts text from one
+scheme to another.
+
+While various transliterators exist, none are both high-quality and widely
+available in different programming languages. The result is that maintenance
+and feature work is diluted across several different implementations.
+
+`vidyut-lipi` aims to provide a standard transliterator for the Sanskrit
+ecosystem. Our priorities are:
+
+- quality, including a comprehensive test suite.
+- coverage across all of the schemes in common use.
+- ease of use (and reuse) for developers.
+- high performance across various metrics, including runtime, startup time, and
+  file size.
+
+We recommend `vidyut-lipi` if you need a simple and high-quality
+transliteration library, and we encourage you to [file an issue][issue] if
+`vidyut-lipi` does not support your use case. We are especially excited about
+supporting new scripts and new programming languages.
+
+[issue]: https://github.com/ambuda-org/vidyut/issues
+
+If `vidyut-lipi` is not right for your needs, we also strongly recommend
+the [Aksharamukha][aksharamukha] the [indic-transliteration][indic-trans]
+projects, which have each been highly influential in our work on `vidyut-lipi`.
+
+[aksharamukha]: https://github.com/virtualvinodh/aksharamukha/
+[indic-trans]: https://github.com/indic-transliteration
 
 
 Usage
 -----
 
+For simple use cases that aren't very performance-sensitive, we recommend using
+`vidyut-lipi` like so:
+
 ```rust
 use vidyut_lipi::{Scheme, transliterate};
 
-let result = transliterate("devau", Scheme::Iast, Scheme::Slp1);
-assert_eq!(result, "devO");
+let result = transliterate("devO", Scheme::Slp1, Scheme::Iast);
+assert_eq!(result, "devau");
 ```
 
-```shell
-# Run transliteration
-$ cargo run --bin transliterate -- --text rāmau 
-```
+We are still stabilizing our API and will share more examples here soon.
diff --git a/vidyut-lipi/scripts/create_schemes.py b/vidyut-lipi/scripts/create_schemes.py
@@ -0,0 +1,186 @@
+#!/usr/bin/env python3
+"""Create schemes for vidyut-lipi and writes them to `src/schemes.rs`.
+
+We create these mappings by modifying the data in the `common_maps` dir from
+the indic-transliteration project.
+"""
+
+import tomllib
+import subprocess
+from pathlib import Path
+from glob import glob
+import shutil
+
+CRATE_DIR = Path(__file__).parent.parent
+
+VOWEL_TO_MARK = {
+    "आ": "\u093e",
+    "इ": "\u093f",
+    "ई": "\u0940",
+    "उ": "\u0941",
+    "ऊ": "\u0942",
+    "ऋ": "\u0943",
+    "ॠ": "\u0944",
+    "ऌ": "\u0962",
+    "ॡ": "\u0963",
+    "ऎ": "\u0946",
+    "ए": "\u0947",
+    "ऐ": "\u0948",
+    "ऒ": "\u094a",
+    "ओ": "\u094b",
+    "औ": "\u094c",
+}
+
+ALLOWED = {
+    "BENGALI",
+    "BRAHMI",
+    "DEVANAGARI",
+    "GUJARATI",
+    "GURMUKHI",
+    "GRANTHA",
+    "KANNADA",
+    "MALAYALAM",
+    "ORIYA",
+    "SINHALA",
+    "TAMIL",
+    "TELUGU",
+    "TIBETAN",
+
+    "HK",
+    "IAST",
+    "ITRANS",
+    "SLP1",
+    "VELTHUIS",
+}
+
+
+def _sanitize(s: str) -> str:
+    return s.replace("\\", "\\\\").replace('"', '\\"')
+
+
+def _maybe_override(name: str, deva: str, raw: str) -> str | None:
+    if name == "BRAHMI":
+        if deva == "\u0946":
+            # short e mark
+            return None
+        if deva == "\u094a":
+            # short o mark
+            return None
+    elif name == "HK":
+        if raw == "|":
+            return "."
+        if raw == "||":
+            return ".."
+    elif name == "IAST":
+        if deva == "ळ":
+            return "ḻ"
+        if deva == "ऴ":
+            return None
+        if raw == "|":
+            return "."
+        if raw == "||":
+            return ".."
+    elif name == "VELTHUIS":
+        # These are part of the Velthuis spec but are errors in indic-transliteration.
+        if deva == "ॠ":
+            return ".R"
+        if deva == "ॡ":
+            return ".L"
+    return raw
+
+
+def create_scheme_str(name: str, items: list[tuple[str, str]]) -> str:
+    buf = []
+
+    buf.append(f"pub const {name}: &[(&str, &str)] = &[")
+    for deva, raw in items:
+        deva = _sanitize(deva)
+        raw = _sanitize(raw)
+        buf.append(f'    ("{deva}", "{raw}"),')
+    buf.append("];\n")
+
+    return "\n".join(buf)
+
+
+def main():
+    repo = "https://github.com/indic-transliteration/common_maps.git"
+    common_maps = Path("common_maps")
+    if not common_maps.exists():
+        print("Cloning `common_maps` ...")
+        subprocess.run(f"git clone --depth 1 {repo}", shell=True)
+
+    print("Creating schemes ...")
+    buf = [
+        "#![allow(unused)]",
+        "",
+        "//! Auto-generated scheme data.",
+        "//!",
+        "//! These schemes were auto-generated from the `common_maps` repository",
+        "//! from the `indic-transliteration` project.",
+        "",
+    ]
+    for path in sorted(glob("common_maps/**/*.toml")):
+        with open(path, "rb") as f:
+            data = tomllib.load(f)
+
+        scheme_name = Path(path).stem.upper()
+        if scheme_name not in ALLOWED:
+            continue
+
+        scheme_type = Path(path).parent.stem
+        assert scheme_type in {"roman", "brahmic"}, scheme_type
+
+        scheme_items = []
+        raw_to_deva = {}
+
+        for category in data:
+            if category.startswith("_"):
+                # Ignore file comments, etc.
+                continue
+
+            if category == "shortcuts":
+                # TODO: support these
+                continue
+
+            if category.endswith("alternates"):
+                for raw_main, alts in data[category].items():
+                    deva = raw_to_deva.get(raw_main)
+                    if deva is None:
+                        continue
+                    for alt in alts:
+                        assert isinstance(deva, str)
+                        assert isinstance(alt, str)
+                        alt = _maybe_override(scheme_name, deva, alt)
+                        if alt is not None:
+                            scheme_items.append((deva, alt))
+            else:
+                for deva, raw in data[category].items():
+                    assert isinstance(deva, str)
+                    assert isinstance(raw, str)
+                    raw = _maybe_override(scheme_name, deva, raw)
+                    if raw is not None:
+                        raw_to_deva[raw] = deva
+                        scheme_items.append((deva, raw))
+
+                if scheme_type == "roman" and category == "vowels":
+                    for vowel, raw in data[category].items():
+                        raw = _maybe_override(scheme_name, vowel, raw)
+                        mark = VOWEL_TO_MARK.get(vowel)
+                        if mark:
+                            assert isinstance(mark, str)
+                            assert isinstance(raw, str)
+                            scheme_items.append((mark, raw))
+
+        buf.append(create_scheme_str(scheme_name, scheme_items))
+
+    with open(CRATE_DIR / "src/schemes.rs", "w") as f:
+        f.write("\n".join(buf))
+
+    print("Cleaning up ...")
+    shutil.rmtree(common_maps)
+
+    print("Done.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/vidyut-lipi/scripts/run-debugger.sh b/vidyut-lipi/scripts/run-debugger.sh
@@ -0,0 +1,16 @@
+#!/usr/bin/env sh
+if [[ ! $(command -v wasm-pack) ]]
+then
+    echo "Our debugger requires wasm-pack. Please install wasm-pack:"
+    echo "https://rustwasm.github.io/wasm-pack/installer/"
+    echo
+    exit 1
+fi
+
+# `cargo` uses the debug build by default, but `wasm-pack` uses the release
+# build by default instead. Creating this release build is slow, but the debug
+# build seems to have issues with enum parsing. So, stick with the release
+# build.
+wasm-pack build --target web --release
+mkdir -p www/static/wasm && cp pkg/* www/static/wasm
+cd www && python3 -m http.server