-
-
Notifications
You must be signed in to change notification settings - Fork 21
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[lipi] Add a stronger implementation
This commit adds support for a variety of schemes as defined by the `indic_transliteration` project. It also adds a minimal test suite as a stronger guarantee on program correctness. `vidyut-lipi` is still immature compared to transliterators like Aksharamukha or `indic_transliteration`. However, it is on a good trajectory, and I think it will become a compelling transliteratior backend over time. Some notes on design: - This commit does not use any of the code from @skmnktl's in-progress transliterator, but it does borrow the idea of incorporating `indic_transliteration`'s TOML maps directly into the program source code. - I liked @skmnktl's idea of using a `Token` enum as an intermediate representation between the input scheme and the output scheme, but pursing that approach felt cumbersome when mapping between *sequences* of characters (e.g. when working with ITRANS), so I've stayed with the approach used by `indic_transliteration`, i.e. using Devanagari as the intermediate representation.
- Loading branch information
Showing
13 changed files
with
3,230 additions
and
186 deletions.
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
www/static/wasm |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,9 +1,23 @@ | ||
[package] | ||
name = "vidyut-lipi" | ||
version = "0.1.0" | ||
authors = ["Arun Prasad <[email protected]>"] | ||
description = "A Sanskrit transliterator" | ||
homepage = "https://github.com/ambuda-org/vidyut" | ||
repository = "https://github.com/ambuda-org/vidyut" | ||
categories = ["text-processing"] | ||
keywords = ["sanskrit"] | ||
license = "MIT" | ||
edition = "2021" | ||
|
||
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html | ||
|
||
[dependencies] | ||
rustc-hash = "1.1.0" | ||
clap = { version = "4.0.12", features = ["derive"] } | ||
wasm-bindgen = "0.2" | ||
serde-wasm-bindgen = "0.4" | ||
console_error_panic_hook = "0.1.7" | ||
|
||
[lib] | ||
crate-type = ["cdylib", "rlib"] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
debugger: | ||
./scripts/run-debugger.sh | ||
|
||
test: | ||
cargo nextest run --no-fail-fast --status-level=fail |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,17 +1,78 @@ | ||
*vidyut-lipi* is a work-in-progress transliterator. It is not ready for public use. | ||
<div align="center"> | ||
<h1><code>vidyut-lipi</code></h1> | ||
<p><i>A fast Indic transliterator</i></p> | ||
</div> | ||
|
||
`vidyut-lipi` is an experimental Sanskrit transliteration library that also | ||
supports many of the scripts used within the Indosphere. Our goal is to provide | ||
a standard transliterator for the Sanskrit ecosystem that is easy to bind to | ||
other programming languages. | ||
|
||
This [crate][crate] is under active development as part of the [Ambuda][ambuda] | ||
project. If you enjoy our work and wish to contribute to it, we encourage you | ||
to [join our Discord server][discord], where you can meet other Sanskrit | ||
programmers and enthusiasts. | ||
|
||
An online demo is available [here][demo]. | ||
|
||
[crate]: https://doc.rust-lang.org/book/ch07-01-packages-and-crates.html | ||
[ambuda]: https://ambuda.org | ||
[discord]: https://discord.gg/7rGdTyWY7Z | ||
[demo]: https://ambuda-org.github.io/vidyut-lipi/ | ||
|
||
- [Overview](#overview) | ||
- [Usage](#usage) | ||
- [Design](#design) | ||
|
||
|
||
Overview | ||
-------- | ||
|
||
Communities around the world write Sanskrit and other Indian languages in | ||
different scripts in different contexts. For example, a user might type | ||
Sanskrit in ITRANS, read it in Kannada, and publish it in Devanagari. Such | ||
communities often rely on a *transliterator*, which converts text from one | ||
scheme to another. | ||
|
||
While various transliterators exist, none are both high-quality and widely | ||
available in different programming languages. The result is that maintenance | ||
and feature work is diluted across several different implementations. | ||
|
||
`vidyut-lipi` aims to provide a standard transliterator for the Sanskrit | ||
ecosystem. Our priorities are: | ||
|
||
- quality, including a comprehensive test suite. | ||
- coverage across all of the schemes in common use. | ||
- ease of use (and reuse) for developers. | ||
- high performance across various metrics, including runtime, startup time, and | ||
file size. | ||
|
||
We recommend `vidyut-lipi` if you need a simple and high-quality | ||
transliteration library, and we encourage you to [file an issue][issue] if | ||
`vidyut-lipi` does not support your use case. We are especially excited about | ||
supporting new scripts and new programming languages. | ||
|
||
[issue]: https://github.com/ambuda-org/vidyut/issues | ||
|
||
If `vidyut-lipi` is not right for your needs, we also strongly recommend | ||
the [Aksharamukha][aksharamukha] the [indic-transliteration][indic-trans] | ||
projects, which have each been highly influential in our work on `vidyut-lipi`. | ||
|
||
[aksharamukha]: https://github.com/virtualvinodh/aksharamukha/ | ||
[indic-trans]: https://github.com/indic-transliteration | ||
|
||
|
||
Usage | ||
----- | ||
|
||
For simple use cases that aren't very performance-sensitive, we recommend using | ||
`vidyut-lipi` like so: | ||
|
||
```rust | ||
use vidyut_lipi::{Scheme, transliterate}; | ||
|
||
let result = transliterate("devau", Scheme::Iast, Scheme::Slp1); | ||
assert_eq!(result, "devO"); | ||
let result = transliterate("devO", Scheme::Slp1, Scheme::Iast); | ||
assert_eq!(result, "devau"); | ||
``` | ||
|
||
```shell | ||
# Run transliteration | ||
$ cargo run --bin transliterate -- --text rāmau | ||
``` | ||
We are still stabilizing our API and will share more examples here soon. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,186 @@ | ||
#!/usr/bin/env python3 | ||
"""Create schemes for vidyut-lipi and writes them to `src/schemes.rs`. | ||
We create these mappings by modifying the data in the `common_maps` dir from | ||
the indic-transliteration project. | ||
""" | ||
|
||
import tomllib | ||
import subprocess | ||
from pathlib import Path | ||
from glob import glob | ||
import shutil | ||
|
||
CRATE_DIR = Path(__file__).parent.parent | ||
|
||
VOWEL_TO_MARK = { | ||
"आ": "\u093e", | ||
"इ": "\u093f", | ||
"ई": "\u0940", | ||
"उ": "\u0941", | ||
"ऊ": "\u0942", | ||
"ऋ": "\u0943", | ||
"ॠ": "\u0944", | ||
"ऌ": "\u0962", | ||
"ॡ": "\u0963", | ||
"ऎ": "\u0946", | ||
"ए": "\u0947", | ||
"ऐ": "\u0948", | ||
"ऒ": "\u094a", | ||
"ओ": "\u094b", | ||
"औ": "\u094c", | ||
} | ||
|
||
ALLOWED = { | ||
"BENGALI", | ||
"BRAHMI", | ||
"DEVANAGARI", | ||
"GUJARATI", | ||
"GURMUKHI", | ||
"GRANTHA", | ||
"KANNADA", | ||
"MALAYALAM", | ||
"ORIYA", | ||
"SINHALA", | ||
"TAMIL", | ||
"TELUGU", | ||
"TIBETAN", | ||
|
||
"HK", | ||
"IAST", | ||
"ITRANS", | ||
"SLP1", | ||
"VELTHUIS", | ||
} | ||
|
||
|
||
def _sanitize(s: str) -> str: | ||
return s.replace("\\", "\\\\").replace('"', '\\"') | ||
|
||
|
||
def _maybe_override(name: str, deva: str, raw: str) -> str | None: | ||
if name == "BRAHMI": | ||
if deva == "\u0946": | ||
# short e mark | ||
return None | ||
if deva == "\u094a": | ||
# short o mark | ||
return None | ||
elif name == "HK": | ||
if raw == "|": | ||
return "." | ||
if raw == "||": | ||
return ".." | ||
elif name == "IAST": | ||
if deva == "ळ": | ||
return "ḻ" | ||
if deva == "ऴ": | ||
return None | ||
if raw == "|": | ||
return "." | ||
if raw == "||": | ||
return ".." | ||
elif name == "VELTHUIS": | ||
# These are part of the Velthuis spec but are errors in indic-transliteration. | ||
if deva == "ॠ": | ||
return ".R" | ||
if deva == "ॡ": | ||
return ".L" | ||
return raw | ||
|
||
|
||
def create_scheme_str(name: str, items: list[tuple[str, str]]) -> str: | ||
buf = [] | ||
|
||
buf.append(f"pub const {name}: &[(&str, &str)] = &[") | ||
for deva, raw in items: | ||
deva = _sanitize(deva) | ||
raw = _sanitize(raw) | ||
buf.append(f' ("{deva}", "{raw}"),') | ||
buf.append("];\n") | ||
|
||
return "\n".join(buf) | ||
|
||
|
||
def main(): | ||
repo = "https://github.com/indic-transliteration/common_maps.git" | ||
common_maps = Path("common_maps") | ||
if not common_maps.exists(): | ||
print("Cloning `common_maps` ...") | ||
subprocess.run(f"git clone --depth 1 {repo}", shell=True) | ||
|
||
print("Creating schemes ...") | ||
buf = [ | ||
"#![allow(unused)]", | ||
"", | ||
"//! Auto-generated scheme data.", | ||
"//!", | ||
"//! These schemes were auto-generated from the `common_maps` repository", | ||
"//! from the `indic-transliteration` project.", | ||
"", | ||
] | ||
for path in sorted(glob("common_maps/**/*.toml")): | ||
with open(path, "rb") as f: | ||
data = tomllib.load(f) | ||
|
||
scheme_name = Path(path).stem.upper() | ||
if scheme_name not in ALLOWED: | ||
continue | ||
|
||
scheme_type = Path(path).parent.stem | ||
assert scheme_type in {"roman", "brahmic"}, scheme_type | ||
|
||
scheme_items = [] | ||
raw_to_deva = {} | ||
|
||
for category in data: | ||
if category.startswith("_"): | ||
# Ignore file comments, etc. | ||
continue | ||
|
||
if category == "shortcuts": | ||
# TODO: support these | ||
continue | ||
|
||
if category.endswith("alternates"): | ||
for raw_main, alts in data[category].items(): | ||
deva = raw_to_deva.get(raw_main) | ||
if deva is None: | ||
continue | ||
for alt in alts: | ||
assert isinstance(deva, str) | ||
assert isinstance(alt, str) | ||
alt = _maybe_override(scheme_name, deva, alt) | ||
if alt is not None: | ||
scheme_items.append((deva, alt)) | ||
else: | ||
for deva, raw in data[category].items(): | ||
assert isinstance(deva, str) | ||
assert isinstance(raw, str) | ||
raw = _maybe_override(scheme_name, deva, raw) | ||
if raw is not None: | ||
raw_to_deva[raw] = deva | ||
scheme_items.append((deva, raw)) | ||
|
||
if scheme_type == "roman" and category == "vowels": | ||
for vowel, raw in data[category].items(): | ||
raw = _maybe_override(scheme_name, vowel, raw) | ||
mark = VOWEL_TO_MARK.get(vowel) | ||
if mark: | ||
assert isinstance(mark, str) | ||
assert isinstance(raw, str) | ||
scheme_items.append((mark, raw)) | ||
|
||
buf.append(create_scheme_str(scheme_name, scheme_items)) | ||
|
||
with open(CRATE_DIR / "src/schemes.rs", "w") as f: | ||
f.write("\n".join(buf)) | ||
|
||
print("Cleaning up ...") | ||
shutil.rmtree(common_maps) | ||
|
||
print("Done.") | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
#!/usr/bin/env sh | ||
if [[ ! $(command -v wasm-pack) ]] | ||
then | ||
echo "Our debugger requires wasm-pack. Please install wasm-pack:" | ||
echo "https://rustwasm.github.io/wasm-pack/installer/" | ||
echo | ||
exit 1 | ||
fi | ||
|
||
# `cargo` uses the debug build by default, but `wasm-pack` uses the release | ||
# build by default instead. Creating this release build is slow, but the debug | ||
# build seems to have issues with enum parsing. So, stick with the release | ||
# build. | ||
wasm-pack build --target web --release | ||
mkdir -p www/static/wasm && cp pkg/* www/static/wasm | ||
cd www && python3 -m http.server |
Oops, something went wrong.