Skip to content

Latest commit

 

History

History
155 lines (114 loc) · 5.62 KB

File metadata and controls

155 lines (114 loc) · 5.62 KB

Indic-PersoArabic-Script-Converter

Indo-Pakistani Transliteration

A python library to convert from Indian scripts to Pakistani scripts and vice-versa.

Currently supported methods

  1. Rule-based conversion
  • Faster, but does not support short vowels
  • Will not be accurate, especially for Arabic-to-Indic
  1. Sangam Project's online transliteration API
  • Uses an online endpoint for the conversion
  • Produces much better results, but much slower

Usage

Installation

Pre-requisites:

  • Use Python 3.7+
  • pip install git+https://github.com/GokulNC/indic_nlp_library
pip install indo-arabic-transliteration

Using rule-based conversion

from indo_arabic_transliteration.mapper import script_convert
script_convert(text: str, from_script: str, to_script: str)

Using Sangam API

from indo_arabic_transliteration.sangam_api import online_transliterate
online_transliterate(text: str, from_script: str, to_script: str)

Languages

We use the standard BCP 47 language tags to refer to the language-script combinations.

Hindi-Urdu (Hindustani)

Language Script Code
Hindi Devanagari hi-IN
Urdu Perso-Arabic ur-PK

Example:

# Rule-based
script_convert("हैदराबाद‎", 'hi-IN', 'ur-PK') # حیدرآباد
script_convert("حيدرآباد‎", 'ur-PK', 'hi-IN') # हीदराबाद‎

# Online-API
online_transliterate("حيدرآباد‎", 'ur-PK', 'hi-IN') # हैदराबाद‎
online_transliterate("हैदराबाद‎", 'hi-IN', 'ur-PK') # حیدرآباد‎

Notes & Resources:

Panjabi

Language Script Code
East Punjabi Gur'Mukhi pa-IN
West Punjabi ShahMukhi pa-PK

Example:

# Rule-based
script_convert("ਸਿੰਘ", 'pa-IN', 'pa-PK') # سںگھ
script_convert("سںگھ", 'pa-PK', 'pa-IN') # ਸਂਘ

# Online-API
online_transliterate("سنگھ", 'pa-PK', 'pa-IN') # ਸਿੰਘ
online_transliterate("ਸਿੰਘ", 'pa-IN', 'pa-PK') # سِنگھ

Notes & Resources:

Sindhi

Language Script Code
Indian Sindhi Devanagari sd-IN
Pakistani Sindhi Perso-Arabic sd-PK

Example:

# Rule-based
script_convert("हैदराबाद‎", 'sd-IN', 'sd-PK') # حیدرآباد
script_convert("حيدرآباد‎", 'sd-PK', 'sd-IN') # हीदराबाद‎

# Online-API
online_transliterate("حيدرآباد‎", 'sd-PK', 'sd-IN') # हैदराबाद‎
online_transliterate("हैदराबाद‎", 'sd-IN', 'sd-PK') # حیدرآباد‎

Notes & Resources:


Other Methods

MachineLearning-based Transliteration

  • Uses LibIndicTrans library for models
    • Install it by pip install git+https://github.com/libindic/indic-trans
  • Currently supports only Hindi-Urdu languages

API:

from indo_arabic_transliteration.ml_based import ml_transliterate
# Same interface as script_convert()

Indic-to-Arabic with Diacritics

  • Indic scripts are mostly phonetic. Use this to retain diacritics in PersoArabic
    • Currently only supports Hindustani (Hindi to Urdu) and Punjabi (Gurmukhi to Shahmukhi)
    • Uses AksharaMukhi library

API:

from indo_arabic_transliteration.lossless_converter import convert_with_diacritics
# Same interface as script_convert()

Support

  • For help in using the library, please use the GitHub Issues section.
  • For script conversion errors from the online API, please write directly to the Sangam team. We are not related to them in anyway and this is not an official library.