[lipi] Add eight schemes

Schemes: - Add support for Khmer, Modi, Newa, Saurashtra, Tamil superscript, Thai, Tibetan, and Tirhuta. Features: - Update `detect` for new schemes. - Update `unicode_norm` logic for new schemes. Bug fixes: - Add missing schemes to `Scheme::iter`. - Update `unicode_norm` logic for previously missing schemes. - Slightly improve support for ITRANS. Code: - Add `reshape` module to support more complex schemes. - Combine our two separate transliteration functions into a single `transliterate_inner`. - Create internal `Token` struct to model mappings. - Avoid extra allocation in main `transliterate` loop. - Fix various `clippy` warnings. Documentation: - Add "Alternatives" section to README and update examples. - Add extensive comments to core code. - Add or expand various docstrings.
ambuda-org · Jan 28, 2024 · 885a962 · 885a962
1 parent a0e7546
commit 885a962
Show file tree

Hide file tree

Showing 14 changed files with 2,043 additions and 447 deletions.
diff --git a/vidyut-lipi/README.md b/vidyut-lipi/README.md
@@ -28,7 +28,7 @@ An online demo is available [here][demo].
 Overview
 --------
 
-Communities around the world write Sanskrit and other Indian languages in
+Communities around the world write Sanskrit and other Indian languages with
 different scripts in different contexts. For example, a user might type
 Sanskrit in ITRANS, read it in Kannada, and publish it in Devanagari. Such
 communities often rely on a *transliterator*, which converts text from one
@@ -42,24 +42,48 @@ and feature work is diluted across several different implementations.
 ecosystem. Our priorities are:
 
 - quality, including a comprehensive test suite.
-- coverage across all of the schemes in common use.
-- ease of use (and reuse) for developers.
+- test coverage across all of the schemes in common use.
+- a precise and ergonomic API.
+- availability in multiple languages, including Python and WebAssembly.
 - high performance across various metrics, including runtime, startup time, and
   file size.
 
-We recommend `vidyut-lipi` if you need a simple and high-quality
-transliteration library, and we encourage you to [file an issue][issue] if
-`vidyut-lipi` does not support your use case. We are especially excited about
-supporting new scripts and new programming languages.
+We encourage you to [file an issue][issue] if `vidyut-lipi` does not support
+your use case. We are especially excited about supporting new scripts and new
+programming languages.
 
 [issue]: https://github.com/ambuda-org/vidyut/issues
 
-If `vidyut-lipi` is not right for your needs, we also strongly recommend
-the [Aksharamukha][aksharamukha] the [indic-transliteration][indic-trans]
-projects, which have each been highly influential in our work on `vidyut-lipi`.
 
-[aksharamukha]: https://github.com/virtualvinodh/aksharamukha/
-[indic-trans]: https://github.com/indic-transliteration
+Alternatives to `vidyut-lipi`
+-----------------------------
+
+There are two main alternatives to `vidyut-lipi`, both of which have been
+influential on the design of `vidyut-lipi`:
+
+- [Aksharamukha][am] offers high quality and supports more than a hundred
+  different scripts. Aksharamukha offers best-in-class transliteration, but it
+  is available only in Python.
+
+- [indic-transliteration][it] implements the same basic transliterator in
+  multiple programming languages. indic-transliteration supports a large
+  software ecosystem, but its different implementations each have their own
+  quirks and limitations.
+
+[am]: https://github.com/virtualvinodh/aksharamukha/
+[it]: https://github.com/indic-transliteration
+
+Our long-term goal is to combine the quality of Aksharamukha with the
+availability of indic-transliteration. Until then, `vidyut-lipi` provides the
+following short-term benefits:
+
+- High-quality transliteration for Rust and WebAssembly.
+- Smooth support for other programming languages through projects like
+  [pyo3][pyo3] (Python), [magnus][magnus] (Ruby), [cxx][cxx] (C++), etc.
+
+[pyo3]: https://pyo3.rs/v0.20.2/
+[magnus]: https://github.com/matsadler/magnus
+[cxx]: https://cxx.rs/
 
 
 Usage
@@ -102,31 +126,39 @@ for scheme in Scheme::iter() {
 }
 ```
 
-As of 2023-12-29, this code prints the following:
+As of 2024-01-27, this code prints the following:
 
 ```text
 Balinese        ᬲᬂᬲ᭄ᬓᬺᬢᬫ᭄
+BarahaSouth     saMskRutam
 Bengali         সংস্কৃতম্
 Brahmi          𑀲𑀁𑀲𑁆𑀓𑀾𑀢𑀫𑁆
 Burmese         သံသ်ကၖတမ်
 Devanagari      संस्कृतम्
 Grantha         𑌸𑌂𑌸𑍍𑌕𑍃𑌤𑌮𑍍
 Gujarati        સંસ્કૃતમ્
 Gurmukhi        ਸਂਸ੍ਕਤਮ੍
-BarahaSouth     saMskRutam
 HarvardKyoto    saMskRtam
 Iast            saṃskṛtam
+Iso15919        saṁskr̥tam
 Itrans          saMskRRitam
 Javanese        ꦱꦁꦱ꧀ꦏꦽꦠꦩ꧀
 Kannada         ಸಂಸ್ಕೃತಮ್
+Khmer           សំស្ក្ឫតម៑
 Malayalam       സംസ്കൃതമ്
+Modi            𑘭𑘽𑘭𑘿𑘎𑘵𑘝𑘦𑘿
+Newa            𑐳𑑄𑐳𑑂𑐎𑐺𑐟𑐩𑑂
 Odia            ସଂସ୍କୃତମ୍
+Saurashtra      ꢱꢀꢱ꣄ꢒꢺꢡꢪ꣄
 Sharada         𑆱𑆁𑆱𑇀𑆑𑆸𑆠𑆩𑇀
 Siddham         𑖭𑖽𑖭𑖿𑖎𑖴𑖝𑖦𑖿
 Sinhala         සංස්කෘතම්
 Slp1            saMskftam
-Tamil           ஸம்ஸ்க்ரு'தம்
+Tamil           ஸம்ʼஸ்க்ருʼதம்
 Telugu          సంస్కృతమ్
+Thai            สํสฺกฺฤตมฺ
+Tibetan         སཾསྐྲྀཏམ
+Tirhuta         𑒮𑓀𑒮𑓂𑒏𑒵𑒞𑒧𑓂
 Velthuis        sa.msk.rtam
 Wx              saMskqwam
 ```

diff --git a/vidyut-lipi/scripts/create_schemes.py b/vidyut-lipi/scripts/create_schemes.py
@@ -37,20 +37,29 @@
     "BENGALI",
     "BRAHMI",
     "BURMESE",
+    "CHAM",
     "DEVANAGARI",
     "GUJARATI",
     "GURMUKHI",
     "GRANTHA",
     "JAVANESE",
     "KANNADA",
+    "KHMER",
+    "LAO",
     "MALAYALAM",
+    "MODI",
+    "NEWA",
     "ORIYA",
     "SHARADA",
     "SIDDHAM",
     "SINHALA",
-    "TAMIL",
+    # Not yet on indic-transliteration/master
+    "SAURASHTRA",
+    "TAMIL_SUPERSCRIPTED",
     "TELUGU",
+    "THAI",
     "TIBETAN",
+    "TIRHUTA_MAITHILI",
 
     "BARAHA",
     "HK",
@@ -93,7 +102,7 @@ def to_unique(xs: list) -> list:
 def _maybe_override(name: str, deva: str, raw: str) -> str | None:
     overrides = {}
 
-    if name in {"BRAHMI", "BALINESE", "BURMESE", "SIDDHAM", "TIBETAN"}:
+    if name in {"BRAHMI", "BALINESE", "BURMESE", "SIDDHAM"}:
         if deva in {"\u0946", "\u094a", "\u090e", "\u0912"}:
             # - short e mark
             # - short o mark
@@ -110,6 +119,14 @@ def _maybe_override(name: str, deva: str, raw: str) -> str | None:
             "\ua8e2": None,
             "\ua8e3": None,
         }
+    elif name == "CHAM":
+        overrides = {
+            # Short e and o, plus vowel marks
+            "\u0946": None,
+            "\u094a": None,
+            "\u090e": None,
+            "\u0912": None,
+        }
     elif name == "GRANTHA":
         overrides = {
             # vowel sign AU
@@ -124,6 +141,9 @@ def _maybe_override(name: str, deva: str, raw: str) -> str | None:
         overrides = {
             "।": ".",
             "॥": "..",
+            "ख़": "k͟h",
+            # Delete -- common_maps maps this to "ḳ", which we need for aytam.
+            # We'll add a valid mapping for क़: further below.
             "क़": None,
         }
     elif name == "IAST":
@@ -135,10 +155,64 @@ def _maybe_override(name: str, deva: str, raw: str) -> str | None:
             # candrabindu
             "\u0901": "m̐",
         }
-    elif name == "TAMIL":
+    elif name == "KHMER":
+        overrides = {
+            "।": "។",
+            "॥": "៕",
+        }
+    elif name == "MODI":
+        overrides = {
+            "\u0907": "\U00011602",  # letter i
+            "\u0908": "\U00011603",  # letter ii
+            "\u0909": "\U00011604",  # letter u
+            "\u090a": "\U00011605",  # letter uu
+            "\u090b": "\U00011606",  # letter vocalic r
+            "\u090c": "\U00011608",  # letter vocalic l
+            "\u093f": "\U00011631",  # sign i
+            "\u0940": "\U00011632",  # sign ii
+            "\u0941": "\U00011633",  # sign u
+            "\u0942": "\U00011634",  # sign uu
+            "\u0943": "\U00011635",  # sign vocalic r
+            "\u0944": "\U00011636",  # sign vocalic rr
+            "\u0960": "\U00011607",  # letter vocalic rr
+            "\u0961": "\U00011609",  # letter vocalic ll
+            "\u0962": "\U00011637",  # sign vocalic l
+            "\u0963": "\U00011638",  # sign vocalic ll
+
+            "\u0964": "\U00011641",  # danda
+            "\u0965": "\U00011642",  # double danda
+        }
+
+    elif name == "NEWA":
         overrides = {
-            # Visarga
-            "\u0903": None,
+            "\u0964": "\U0001144b",  # danda
+            "\u0965": "\U0001144c",  # double danda
+        }
+    elif name == "TAMIL_SUPERSCRIPTED":
+        # Use roman digits per Aksharamukha
+        overrides = {
+            "०": "0",
+            "१": "1",
+            "२": "2",
+            "३": "3",
+            "४": "4",
+            "५": "5",
+            "६": "6",
+            "७": "7",
+            "८": "8",
+            "९": "9",
+        }
+    elif name == "TIBETAN":
+        overrides = {
+            # Virama
+            "\u094d": "\u0f84",
+            # Short e and o, plus vowel marks
+            "\u0946": None,
+            "\u094a": None,
+            "\u090e": None,
+            "\u0912": None,
+            # Use distinct "va" character instead of "ba".
+            "व": "\u0f5d",
         }
     elif name == "VELTHUIS":
         # These are part of the Velthuis spec but are errors in indic-transliteration.
@@ -185,7 +259,9 @@ def create_scheme_entry(name: str, items: list[tuple[str, str]]) -> str:
 
 
 def main():
-    repo = "https://github.com/indic-transliteration/common_maps.git"
+    # We're waiting on some changes to be pushed to indic-transliteration, so
+    # use a fork for now.
+    repo = "https://github.com/akprasad/common_maps.git"
     common_maps = Path("common_maps")
     if not common_maps.exists():
         print("Cloning `common_maps` ...")
@@ -333,6 +409,11 @@ def main():
                 # AU (AA + AU length mark)
                 ("\u094c", "\U00011347\U00011357"),
             ])
+        elif scheme_name == "ITRANS":
+            scheme_items.extend([
+                # Vedic anusvara (just render as candrabindu)
+                ("\u0901", "{\\m+}"),
+            ])
         elif scheme_name == "ISO":
             scheme_items.extend([
                 # Aytam
@@ -355,7 +436,7 @@ def main():
                 # Anudatta
                 ("\u0952", "\\"),
             ])
-        elif scheme_name == "TAMIL":
+        elif scheme_name == "TAMIL_SUPERSCRIPTED":
             scheme_items.extend([
                 # Aytam
                 ("\u0b83", "\u0b83"),
@@ -382,6 +463,10 @@ def main():
                 ("\u092b\u093c", "f"),
             ])
 
+        if scheme_name == "TAMIL_SUPERSCRIPTED":
+            scheme_name = "TAMIL"
+        elif scheme_name == "TIRHUTA_MAITHILI":
+            scheme_name = "TIRHUTA"
         buf.append(create_scheme_entry(scheme_name, scheme_items))
 
     with open(CRATE_DIR / "src/autogen_schemes.rs", "w") as f: