Skip to content

Commit

Permalink
[lipi] Add eight schemes
Browse files Browse the repository at this point in the history
Schemes:
- Add support for Khmer, Modi, Newa, Saurashtra, Tamil superscript,
  Thai, Tibetan, and Tirhuta.

Features:
- Update `detect` for new schemes.
- Update `unicode_norm` logic for new schemes.

Bug fixes:
- Add missing schemes to `Scheme::iter`.
- Update `unicode_norm` logic for previously missing schemes.
- Slightly improve support for ITRANS.

Code:
- Add `reshape` module to support more complex schemes.
- Combine our two separate transliteration functions into a single
  `transliterate_inner`.
- Create internal `Token` struct to model mappings.
- Avoid extra allocation in main `transliterate` loop.
- Fix various `clippy` warnings.

Documentation:
- Add "Alternatives" section to README and update examples.
- Add extensive comments to core code.
- Add or expand various docstrings.
  • Loading branch information
akprasad committed Jan 28, 2024
1 parent a0e7546 commit 885a962
Show file tree
Hide file tree
Showing 14 changed files with 2,043 additions and 447 deletions.
62 changes: 47 additions & 15 deletions vidyut-lipi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ An online demo is available [here][demo].
Overview
--------

Communities around the world write Sanskrit and other Indian languages in
Communities around the world write Sanskrit and other Indian languages with
different scripts in different contexts. For example, a user might type
Sanskrit in ITRANS, read it in Kannada, and publish it in Devanagari. Such
communities often rely on a *transliterator*, which converts text from one
Expand All @@ -42,24 +42,48 @@ and feature work is diluted across several different implementations.
ecosystem. Our priorities are:

- quality, including a comprehensive test suite.
- coverage across all of the schemes in common use.
- ease of use (and reuse) for developers.
- test coverage across all of the schemes in common use.
- a precise and ergonomic API.
- availability in multiple languages, including Python and WebAssembly.
- high performance across various metrics, including runtime, startup time, and
file size.

We recommend `vidyut-lipi` if you need a simple and high-quality
transliteration library, and we encourage you to [file an issue][issue] if
`vidyut-lipi` does not support your use case. We are especially excited about
supporting new scripts and new programming languages.
We encourage you to [file an issue][issue] if `vidyut-lipi` does not support
your use case. We are especially excited about supporting new scripts and new
programming languages.

[issue]: https://github.com/ambuda-org/vidyut/issues

If `vidyut-lipi` is not right for your needs, we also strongly recommend
the [Aksharamukha][aksharamukha] the [indic-transliteration][indic-trans]
projects, which have each been highly influential in our work on `vidyut-lipi`.

[aksharamukha]: https://github.com/virtualvinodh/aksharamukha/
[indic-trans]: https://github.com/indic-transliteration
Alternatives to `vidyut-lipi`
-----------------------------

There are two main alternatives to `vidyut-lipi`, both of which have been
influential on the design of `vidyut-lipi`:

- [Aksharamukha][am] offers high quality and supports more than a hundred
different scripts. Aksharamukha offers best-in-class transliteration, but it
is available only in Python.

- [indic-transliteration][it] implements the same basic transliterator in
multiple programming languages. indic-transliteration supports a large
software ecosystem, but its different implementations each have their own
quirks and limitations.

[am]: https://github.com/virtualvinodh/aksharamukha/
[it]: https://github.com/indic-transliteration

Our long-term goal is to combine the quality of Aksharamukha with the
availability of indic-transliteration. Until then, `vidyut-lipi` provides the
following short-term benefits:

- High-quality transliteration for Rust and WebAssembly.
- Smooth support for other programming languages through projects like
[pyo3][pyo3] (Python), [magnus][magnus] (Ruby), [cxx][cxx] (C++), etc.

[pyo3]: https://pyo3.rs/v0.20.2/
[magnus]: https://github.com/matsadler/magnus
[cxx]: https://cxx.rs/


Usage
Expand Down Expand Up @@ -102,31 +126,39 @@ for scheme in Scheme::iter() {
}
```

As of 2023-12-29, this code prints the following:
As of 2024-01-27, this code prints the following:

```text
Balinese ᬲᬂᬲ᭄ᬓᬺᬢᬫ᭄
BarahaSouth saMskRutam
Bengali সংস্কৃতম্
Brahmi 𑀲𑀁𑀲𑁆𑀓𑀾𑀢𑀫𑁆
Burmese သံသ်ကၖတမ်
Devanagari संस्कृतम्
Grantha 𑌸𑌂𑌸𑍍𑌕𑍃𑌤𑌮𑍍
Gujarati સંસ્કૃતમ્
Gurmukhi ਸਂਸ੍ਕਤਮ੍
BarahaSouth saMskRutam
HarvardKyoto saMskRtam
Iast saṃskṛtam
Iso15919 saṁskr̥tam
Itrans saMskRRitam
Javanese ꦱꦁꦱ꧀ꦏꦽꦠꦩ꧀
Kannada ಸಂಸ್ಕೃತಮ್
Khmer សំស្ក្ឫតម៑
Malayalam സംസ്കൃതമ്
Modi 𑘭𑘽𑘭𑘿𑘎𑘵𑘝𑘦𑘿
Newa 𑐳𑑄𑐳𑑂𑐎𑐺𑐟𑐩𑑂
Odia ସଂସ୍କୃତମ୍
Saurashtra ꢱꢀꢱ꣄ꢒꢺꢡꢪ꣄
Sharada 𑆱𑆁𑆱𑇀𑆑𑆸𑆠𑆩𑇀
Siddham 𑖭𑖽𑖭𑖿𑖎𑖴𑖝𑖦𑖿
Sinhala සංස්කෘතම්
Slp1 saMskftam
Tamil ஸம்ஸ்க்ரு'தம்
Tamil ஸம்ʼஸ்க்ருʼதம்
Telugu సంస్కృతమ్
Thai สํสฺกฺฤตมฺ
Tibetan སཾསྐྲྀཏམ
Tirhuta 𑒮𑓀𑒮𑓂𑒏𑒵𑒞𑒧𑓂
Velthuis sa.msk.rtam
Wx saMskqwam
```
Expand Down
99 changes: 92 additions & 7 deletions vidyut-lipi/scripts/create_schemes.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,20 +37,29 @@
"BENGALI",
"BRAHMI",
"BURMESE",
"CHAM",
"DEVANAGARI",
"GUJARATI",
"GURMUKHI",
"GRANTHA",
"JAVANESE",
"KANNADA",
"KHMER",
"LAO",
"MALAYALAM",
"MODI",
"NEWA",
"ORIYA",
"SHARADA",
"SIDDHAM",
"SINHALA",
"TAMIL",
# Not yet on indic-transliteration/master
"SAURASHTRA",
"TAMIL_SUPERSCRIPTED",
"TELUGU",
"THAI",
"TIBETAN",
"TIRHUTA_MAITHILI",

"BARAHA",
"HK",
Expand Down Expand Up @@ -93,7 +102,7 @@ def to_unique(xs: list) -> list:
def _maybe_override(name: str, deva: str, raw: str) -> str | None:
overrides = {}

if name in {"BRAHMI", "BALINESE", "BURMESE", "SIDDHAM", "TIBETAN"}:
if name in {"BRAHMI", "BALINESE", "BURMESE", "SIDDHAM"}:
if deva in {"\u0946", "\u094a", "\u090e", "\u0912"}:
# - short e mark
# - short o mark
Expand All @@ -110,6 +119,14 @@ def _maybe_override(name: str, deva: str, raw: str) -> str | None:
"\ua8e2": None,
"\ua8e3": None,
}
elif name == "CHAM":
overrides = {
# Short e and o, plus vowel marks
"\u0946": None,
"\u094a": None,
"\u090e": None,
"\u0912": None,
}
elif name == "GRANTHA":
overrides = {
# vowel sign AU
Expand All @@ -124,6 +141,9 @@ def _maybe_override(name: str, deva: str, raw: str) -> str | None:
overrides = {
"।": ".",
"॥": "..",
"ख़": "k͟h",
# Delete -- common_maps maps this to "ḳ", which we need for aytam.
# We'll add a valid mapping for क़: further below.
"क़": None,
}
elif name == "IAST":
Expand All @@ -135,10 +155,64 @@ def _maybe_override(name: str, deva: str, raw: str) -> str | None:
# candrabindu
"\u0901": "m̐",
}
elif name == "TAMIL":
elif name == "KHMER":
overrides = {
"।": "។",
"॥": "៕",
}
elif name == "MODI":
overrides = {
"\u0907": "\U00011602", # letter i
"\u0908": "\U00011603", # letter ii
"\u0909": "\U00011604", # letter u
"\u090a": "\U00011605", # letter uu
"\u090b": "\U00011606", # letter vocalic r
"\u090c": "\U00011608", # letter vocalic l
"\u093f": "\U00011631", # sign i
"\u0940": "\U00011632", # sign ii
"\u0941": "\U00011633", # sign u
"\u0942": "\U00011634", # sign uu
"\u0943": "\U00011635", # sign vocalic r
"\u0944": "\U00011636", # sign vocalic rr
"\u0960": "\U00011607", # letter vocalic rr
"\u0961": "\U00011609", # letter vocalic ll
"\u0962": "\U00011637", # sign vocalic l
"\u0963": "\U00011638", # sign vocalic ll

"\u0964": "\U00011641", # danda
"\u0965": "\U00011642", # double danda
}

elif name == "NEWA":
overrides = {
# Visarga
"\u0903": None,
"\u0964": "\U0001144b", # danda
"\u0965": "\U0001144c", # double danda
}
elif name == "TAMIL_SUPERSCRIPTED":
# Use roman digits per Aksharamukha
overrides = {
"०": "0",
"१": "1",
"२": "2",
"३": "3",
"४": "4",
"५": "5",
"६": "6",
"७": "7",
"८": "8",
"९": "9",
}
elif name == "TIBETAN":
overrides = {
# Virama
"\u094d": "\u0f84",
# Short e and o, plus vowel marks
"\u0946": None,
"\u094a": None,
"\u090e": None,
"\u0912": None,
# Use distinct "va" character instead of "ba".
"व": "\u0f5d",
}
elif name == "VELTHUIS":
# These are part of the Velthuis spec but are errors in indic-transliteration.
Expand Down Expand Up @@ -185,7 +259,9 @@ def create_scheme_entry(name: str, items: list[tuple[str, str]]) -> str:


def main():
repo = "https://github.com/indic-transliteration/common_maps.git"
# We're waiting on some changes to be pushed to indic-transliteration, so
# use a fork for now.
repo = "https://github.com/akprasad/common_maps.git"
common_maps = Path("common_maps")
if not common_maps.exists():
print("Cloning `common_maps` ...")
Expand Down Expand Up @@ -333,6 +409,11 @@ def main():
# AU (AA + AU length mark)
("\u094c", "\U00011347\U00011357"),
])
elif scheme_name == "ITRANS":
scheme_items.extend([
# Vedic anusvara (just render as candrabindu)
("\u0901", "{\\m+}"),
])
elif scheme_name == "ISO":
scheme_items.extend([
# Aytam
Expand All @@ -355,7 +436,7 @@ def main():
# Anudatta
("\u0952", "\\"),
])
elif scheme_name == "TAMIL":
elif scheme_name == "TAMIL_SUPERSCRIPTED":
scheme_items.extend([
# Aytam
("\u0b83", "\u0b83"),
Expand All @@ -382,6 +463,10 @@ def main():
("\u092b\u093c", "f"),
])

if scheme_name == "TAMIL_SUPERSCRIPTED":
scheme_name = "TAMIL"
elif scheme_name == "TIRHUTA_MAITHILI":
scheme_name = "TIRHUTA"
buf.append(create_scheme_entry(scheme_name, scheme_items))

with open(CRATE_DIR / "src/autogen_schemes.rs", "w") as f:
Expand Down
Loading

0 comments on commit 885a962

Please sign in to comment.