Skip to content

Commit

Permalink
[lipi] Add a stronger implementation
Browse files Browse the repository at this point in the history
This commit adds support for a variety of schemes as defined by the
`indic_transliteration` project. It also adds a minimal test suite as a
stronger guarantee on program correctness.

`vidyut-lipi` is still immature compared to transliterators like
Aksharamukha or `indic_transliteration`. However, it is on a good
trajectory, and I think it will become a compelling transliteratior
backend over time.

Some notes on design:

- This commit does not use any of the code from @skmnktl's in-progress
  transliterator, but it does borrow the idea of incorporating
  `indic_transliteration`'s TOML maps directly into the program source
  code.

- I liked @skmnktl's idea of using a `Token` enum as an intermediate
  representation between the input scheme and the output scheme, but
  pursing that approach felt cumbersome when mapping between *sequences*
  of characters (e.g. when working with ITRANS), so I've stayed with the
  approach used by `indic_transliteration`, i.e. using Devanagari as the
  intermediate representation.
  • Loading branch information
akprasad committed Dec 26, 2023
1 parent 2eb4434 commit 60f5f95
Show file tree
Hide file tree
Showing 13 changed files with 3,230 additions and 186 deletions.
4 changes: 4 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions vidyut-lipi/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
www/static/wasm
14 changes: 14 additions & 0 deletions vidyut-lipi/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,9 +1,23 @@
[package]
name = "vidyut-lipi"
version = "0.1.0"
authors = ["Arun Prasad <[email protected]>"]
description = "A Sanskrit transliterator"
homepage = "https://github.com/ambuda-org/vidyut"
repository = "https://github.com/ambuda-org/vidyut"
categories = ["text-processing"]
keywords = ["sanskrit"]
license = "MIT"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
rustc-hash = "1.1.0"
clap = { version = "4.0.12", features = ["derive"] }
wasm-bindgen = "0.2"
serde-wasm-bindgen = "0.4"
console_error_panic_hook = "0.1.7"

[lib]
crate-type = ["cdylib", "rlib"]
5 changes: 5 additions & 0 deletions vidyut-lipi/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
debugger:
./scripts/run-debugger.sh

test:
cargo nextest run --no-fail-fast --status-level=fail
75 changes: 68 additions & 7 deletions vidyut-lipi/README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,78 @@
*vidyut-lipi* is a work-in-progress transliterator. It is not ready for public use.
<div align="center">
<h1><code>vidyut-lipi</code></h1>
<p><i>A fast Indic transliterator</i></p>
</div>

`vidyut-lipi` is an experimental Sanskrit transliteration library that also
supports many of the scripts used within the Indosphere. Our goal is to provide
a standard transliterator for the Sanskrit ecosystem that is easy to bind to
other programming languages.

This [crate][crate] is under active development as part of the [Ambuda][ambuda]
project. If you enjoy our work and wish to contribute to it, we encourage you
to [join our Discord server][discord], where you can meet other Sanskrit
programmers and enthusiasts.

An online demo is available [here][demo].

[crate]: https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html
[ambuda]: https://ambuda.org
[discord]: https://discord.gg/7rGdTyWY7Z
[demo]: https://ambuda-org.github.io/vidyut-lipi/

- [Overview](#overview)
- [Usage](#usage)
- [Design](#design)


Overview
--------

Communities around the world write Sanskrit and other Indian languages in
different scripts in different contexts. For example, a user might type
Sanskrit in ITRANS, read it in Kannada, and publish it in Devanagari. Such
communities often rely on a *transliterator*, which converts text from one
scheme to another.

While various transliterators exist, none are both high-quality and widely
available in different programming languages. The result is that maintenance
and feature work is diluted across several different implementations.

`vidyut-lipi` aims to provide a standard transliterator for the Sanskrit
ecosystem. Our priorities are:

- quality, including a comprehensive test suite.
- coverage across all of the schemes in common use.
- ease of use (and reuse) for developers.
- high performance across various metrics, including runtime, startup time, and
file size.

We recommend `vidyut-lipi` if you need a simple and high-quality
transliteration library, and we encourage you to [file an issue][issue] if
`vidyut-lipi` does not support your use case. We are especially excited about
supporting new scripts and new programming languages.

[issue]: https://github.com/ambuda-org/vidyut/issues

If `vidyut-lipi` is not right for your needs, we also strongly recommend
the [Aksharamukha][aksharamukha] the [indic-transliteration][indic-trans]
projects, which have each been highly influential in our work on `vidyut-lipi`.

[aksharamukha]: https://github.com/virtualvinodh/aksharamukha/
[indic-trans]: https://github.com/indic-transliteration


Usage
-----

For simple use cases that aren't very performance-sensitive, we recommend using
`vidyut-lipi` like so:

```rust
use vidyut_lipi::{Scheme, transliterate};

let result = transliterate("devau", Scheme::Iast, Scheme::Slp1);
assert_eq!(result, "devO");
let result = transliterate("devO", Scheme::Slp1, Scheme::Iast);
assert_eq!(result, "devau");
```

```shell
# Run transliteration
$ cargo run --bin transliterate -- --text rāmau
```
We are still stabilizing our API and will share more examples here soon.
186 changes: 186 additions & 0 deletions vidyut-lipi/scripts/create_schemes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
#!/usr/bin/env python3
"""Create schemes for vidyut-lipi and writes them to `src/schemes.rs`.
We create these mappings by modifying the data in the `common_maps` dir from
the indic-transliteration project.
"""

import tomllib
import subprocess
from pathlib import Path
from glob import glob
import shutil

CRATE_DIR = Path(__file__).parent.parent

VOWEL_TO_MARK = {
"आ": "\u093e",
"इ": "\u093f",
"ई": "\u0940",
"उ": "\u0941",
"ऊ": "\u0942",
"ऋ": "\u0943",
"ॠ": "\u0944",
"ऌ": "\u0962",
"ॡ": "\u0963",
"ऎ": "\u0946",
"ए": "\u0947",
"ऐ": "\u0948",
"ऒ": "\u094a",
"ओ": "\u094b",
"औ": "\u094c",
}

ALLOWED = {
"BENGALI",
"BRAHMI",
"DEVANAGARI",
"GUJARATI",
"GURMUKHI",
"GRANTHA",
"KANNADA",
"MALAYALAM",
"ORIYA",
"SINHALA",
"TAMIL",
"TELUGU",
"TIBETAN",

"HK",
"IAST",
"ITRANS",
"SLP1",
"VELTHUIS",
}


def _sanitize(s: str) -> str:
return s.replace("\\", "\\\\").replace('"', '\\"')


def _maybe_override(name: str, deva: str, raw: str) -> str | None:
if name == "BRAHMI":
if deva == "\u0946":
# short e mark
return None
if deva == "\u094a":
# short o mark
return None
elif name == "HK":
if raw == "|":
return "."
if raw == "||":
return ".."
elif name == "IAST":
if deva == "ळ":
return "ḻ"
if deva == "ऴ":
return None
if raw == "|":
return "."
if raw == "||":
return ".."
elif name == "VELTHUIS":
# These are part of the Velthuis spec but are errors in indic-transliteration.
if deva == "ॠ":
return ".R"
if deva == "ॡ":
return ".L"
return raw


def create_scheme_str(name: str, items: list[tuple[str, str]]) -> str:
buf = []

buf.append(f"pub const {name}: &[(&str, &str)] = &[")
for deva, raw in items:
deva = _sanitize(deva)
raw = _sanitize(raw)
buf.append(f' ("{deva}", "{raw}"),')
buf.append("];\n")

return "\n".join(buf)


def main():
repo = "https://github.com/indic-transliteration/common_maps.git"
common_maps = Path("common_maps")
if not common_maps.exists():
print("Cloning `common_maps` ...")
subprocess.run(f"git clone --depth 1 {repo}", shell=True)

print("Creating schemes ...")
buf = [
"#![allow(unused)]",
"",
"//! Auto-generated scheme data.",
"//!",
"//! These schemes were auto-generated from the `common_maps` repository",
"//! from the `indic-transliteration` project.",
"",
]
for path in sorted(glob("common_maps/**/*.toml")):
with open(path, "rb") as f:
data = tomllib.load(f)

scheme_name = Path(path).stem.upper()
if scheme_name not in ALLOWED:
continue

scheme_type = Path(path).parent.stem
assert scheme_type in {"roman", "brahmic"}, scheme_type

scheme_items = []
raw_to_deva = {}

for category in data:
if category.startswith("_"):
# Ignore file comments, etc.
continue

if category == "shortcuts":
# TODO: support these
continue

if category.endswith("alternates"):
for raw_main, alts in data[category].items():
deva = raw_to_deva.get(raw_main)
if deva is None:
continue
for alt in alts:
assert isinstance(deva, str)
assert isinstance(alt, str)
alt = _maybe_override(scheme_name, deva, alt)
if alt is not None:
scheme_items.append((deva, alt))
else:
for deva, raw in data[category].items():
assert isinstance(deva, str)
assert isinstance(raw, str)
raw = _maybe_override(scheme_name, deva, raw)
if raw is not None:
raw_to_deva[raw] = deva
scheme_items.append((deva, raw))

if scheme_type == "roman" and category == "vowels":
for vowel, raw in data[category].items():
raw = _maybe_override(scheme_name, vowel, raw)
mark = VOWEL_TO_MARK.get(vowel)
if mark:
assert isinstance(mark, str)
assert isinstance(raw, str)
scheme_items.append((mark, raw))

buf.append(create_scheme_str(scheme_name, scheme_items))

with open(CRATE_DIR / "src/schemes.rs", "w") as f:
f.write("\n".join(buf))

print("Cleaning up ...")
shutil.rmtree(common_maps)

print("Done.")


if __name__ == "__main__":
main()
16 changes: 16 additions & 0 deletions vidyut-lipi/scripts/run-debugger.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#!/usr/bin/env sh
if [[ ! $(command -v wasm-pack) ]]
then
echo "Our debugger requires wasm-pack. Please install wasm-pack:"
echo "https://rustwasm.github.io/wasm-pack/installer/"
echo
exit 1
fi

# `cargo` uses the debug build by default, but `wasm-pack` uses the release
# build by default instead. Creating this release build is slow, but the debug
# build seems to have issues with enum parsing. So, stick with the release
# build.
wasm-pack build --target web --release
mkdir -p www/static/wasm && cp pkg/* www/static/wasm
cd www && python3 -m http.server
Loading

0 comments on commit 60f5f95

Please sign in to comment.