Indic / इंडिक - conversion tools

Post Reply
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Indic / इंडिक - conversion tools

Post by Optilon »

Hindi is ranked 3rd, Bengali 6th, Urdu 10th, Punjabi 11th, Marathi 14th and Telugu 17th. So of the 20 top languages, 6 are indic.
Many thanks @hurrdudd for sharing his knowledge with us. A summary of his explanations from reddit: https://www.reddit.com/r/HindiLanguage/ ... favourite/
Optilon question: How prevelant are the followeing methods of typing hindi?
a) direct hindi alphabet (~46 hindi letters)
b1) indic IME - conversion of english letters to Hindi
b2) indic IME - other?
c) writing hindi words with english letters, no conversion
d) other?

hurrdudd answer: I find phonetic IME layouts (b1) more convenient than inscript et al (a). In particular, I use the Bolnagri layout on which I am able to achieve decent rate of typing. This is with physical keyboards. On mobile (or virtual keyboards), I prefer the full devanagari keyboard (a) over english (roman) to hindi conversion.
Optilon question: I've got hindi text corpora in hindi alphabet and need to convert them to english letters. Or I would need hindi text corpora that are available with correct english letters. I used the conversion tool linked here: http://opt-in-layout.org/viewtopic.php?f=13&t=9 but the results were wrong. Are there any hindi text corpora in english letters or a better conversion tool?

hurrdudd answer: I can vouche for Aksharamukha. Its "Devanagari" to "Roman (Readable)" conversion is pretty accurate in most cases. It is also available as a Python package.



Old: There is a conversion tool for many indic languages:
http://mylanguages.org/romanization.php
http://mylanguages.org/hindi_romanization.php
one might need to do a text cleanup afterwards:
http://mylanguages.org/romanization_cleanup.php
hurrdudd
Posts: 14
Joined: Tue Sep 08, 2020 4:47 pm

Re: Indic / इंडिक - conversion tools

Post by hurrdudd »

Use Aksharamukha https://aksharamukha.appspot.com/python

With the Python package, the following code should do the job for you:

Code: Select all

# coding: utf-8
from aksharamukha import transliterate
def transliterateRoman(word):
    return transliterate.process('Devanagari', 'RomanReadable', word, pre_options=['RemoveSchwaHindi']) 
print(transliterateRoman("गांधी"))
For me, it produces 'gaandhee'
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Indic / इंडिक - conversion tools

Post by Optilon »

@hurrdudd
Thank you so much! :D
Is the "Roman readable" the same text that someone would type to write on Hindi IME?
hurrdudd
Posts: 14
Joined: Tue Sep 08, 2020 4:47 pm

Re: Indic / इंडिक - conversion tools

Post by hurrdudd »

Optilon wrote: Wed Sep 09, 2020 11:17 am @hurrdudd
Thank you so much! :D
Is the "Roman readable" the same text that someone would type to write on Hindi IME?
Yes, almost the same. Though this is not exact transliteration unlike schemes such as ITRANS or Harvard-Kyoto which have 1-1 correspondence between English alphabet and Devanagari symbols.
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Indic / इंडिक - conversion tools

Post by Optilon »

hurrdudd wrote: Thu Sep 10, 2020 10:18 am Yes, almost the same. Though this is not exact transliteration unlike schemes such as ITRANS or Harvard-Kyoto which have 1-1 correspondence between English alphabet and Devanagari symbols.
Okay. Thank you so much. I will use it to create an optimized Hindi IME keyboard with it. :D

Edit: I tried to use pip3 on ubuntu:

Code: Select all

$ sudo pip3 install aksharamukha
Collecting aksharamukha
  Downloading aksharamukha-1.8.1-py3-none-any.whl (159 kB)
     |████████████████████████████████| 159 kB 3.4 MB/s 
Requirement already satisfied: requests in /usr/lib/python3/dist-packages (from aksharamukha) (2.22.0)
Installing collected packages: aksharamukha
Successfully installed aksharamukha-1.8.1
but got:

Code: Select all

$ from aksharamukha import transliterate
from: can't read /var/mail/aksharamukha
So I just did "Convert Files (batch)" which was convenient:
2020-09-10 22.08.49 aksharamukha.appspot.com bfa70eba09c1.png
2020-09-10 22.08.49 aksharamukha.appspot.com bfa70eba09c1.png (149.04 KiB) Viewed 38757 times
Aksharamukha is really a super great tool 8-)
The big file with 22MB did not finish. Smaller files with up to 30k sentences were transcribed fast.
hurrdudd
Posts: 14
Joined: Tue Sep 08, 2020 4:47 pm

Re: Indic / इंडिक - conversion tools

Post by hurrdudd »

Optilon wrote: Thu Sep 10, 2020 2:22 pm but got:

Code: Select all

$ from aksharamukha import transliterate
from: can't read /var/mail/aksharamukha
I am so sorry for not being verbose. The snippet I wrote was meant to be saved as a Python file and executed.

If you want to batch convert documents then use the following code.

Code: Select all

#!/usr/bin/python3
# coding: utf-8
import sys
from aksharamukha import transliterate
import argparse

def transliterateRoman(s):
    return transliterate.process('Devanagari', 'RomanReadable', s, pre_options=['RemoveSchwaHindi'])

def batch_transliterate(dev_file, rr_file):
    for line in dev_file:
        rr_file.writelines([transliterateRoman(line)])
    return 0

if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument('devanagari_file', nargs='?', type=argparse.FileType('r'), default=sys.stdin)
    ap.add_argument('romanreadable_file', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
    args = ap.parse_args()

    exit(batch_transliterate(args.devanagari_file, args.romanreadable_file))
Save it as a python file, say "batch_convert.py" and then execute using "python3 batch_convert.py <input_filename> <output_filename>".

Please do not overload Aksharamukha server like that. The service is offered free of cost for nominal use.
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Indic / इंडिक - conversion tools

Post by Optilon »

hurrdudd wrote: Fri Sep 11, 2020 8:29 pm Please do not overload Aksharamukha server like that. The service is offered free of cost for nominal use.
Thank you! I will try the script this weekend.
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Indic / इंडिक - conversion tools

Post by Optilon »

@hurrdudd
Which one is closest to the way one would type?

Roman Readable:
1 unhomne bataayaa ki jeba mem 14 hajaara roopae the aura jeba usa vakta kat'ee jaba ve vahaam para hue eka haadase mem ghaayala yuvaka kaa haalachaala poochhane ke lie apanee gaad'ee se utare.
2 rabee ke lie jaroorata beekaanera ko itanee bijalee bhee naheem mila rahee ki shahara mem kat'autee karake graameena kshetra ko aapoorti kee jaa sake.
3 isee bhaavanaa aura anubhava ke aadhaara para madhya pradesha ke mukhyamantree ke roopa mem maimne aapako raajya yojanaa bord'a kaa sadasya banane ke lie aamantrita kiyaa.
4 haalaanki, kisee bhee saude se pahale usakaa d'yoo d'ilijemsa hotaa hai aura saude kee sameekshaa se isakee shuruaata hotee hai.
5 isa shrenee ke kuchha anya samaachaara Tags: nirvaachana aayoga, matadaataa parichaya patra, heeraka jayantee, national voters day t'ippanee ke saatha tasveera dikhaanaa chaahate haim to chunem eka Gravatar!
Roman Kyoto:
1 unhoMne batAyA ki jeba meM 14 hajAra rUpae the aura jeba usa vakta kaTI jaba ve vahAM para hue eka hAdase meM ghAyala yuvaka kA hAlacAla pUchane ke lie apanI g2ADI se utare.
2 rabI ke lie jarUrata bIkAnera ko itanI bijalI bhI nahIM mila rahI ki zahara meM kaTautI karake grAmINa kSetra ko ApUrti kI jA sake.
3 isI bhAvanA aura anubhava ke AdhAra para madhya pradeza ke mukhyamaMtrI ke rUpa meM maiMne Apako rAjya yojanA borDa kA sadasya banane ke lie AmaMtrita kiyA.
4 hAlAMki, kisI bhI saude se pahale usakA DyU DilijeMsa hotA hai aura saude kI samIkSA se isakI zuruAta hotI hai.
5 isa zreNI ke kucha anya samAcAra Tags: nirvAcana Ayoga, matadAtA paricaya patra, hIraka jayaMtI, national voters day TippaNI ke sAtha tasvIra dikhAnA cAhate haiM to cuneM eka Gravatar!
Roman IAST:
1 unhoṃne batāyā ki jeba meṃ 14 hajāra rūpae the aura jeba usa vakta kaṭī jaba ve vahāṃ para hue eka hādase meṃ ghāyala yuvaka kā hālacāla pūchane ke lie apanī ġāḍī se utare.
2 rabī ke lie jarūrata bīkānera ko itanī bijalī bhī nahīṃ mila rahī ki śahara meṃ kaṭautī karake grāmīṇa kṣetra ko āpūrti kī jā sake.
3 isī bhāvanā aura anubhava ke ādhāra para madhya pradeśa ke mukhyamaṃtrī ke rūpa meṃ maiṃne āpako rājya yojanā borḍa kā sadasya banane ke lie āmaṃtrita kiyā.
4 hālāṃki, kisī bhī saude se pahale usakā ḍyū ḍilijeṃsa hotā hai aura saude kī samīkṣā se isakī śuruāta hotī hai.
5 isa śreṇī ke kucha anya samācāra Tags: nirvācana āyoga, matadātā paricaya patra, hīraka jayaṃtī, national voters day ṭippaṇī ke sātha tasvīra dikhānā cāhate haiṃ to cuneṃ eka Gravatar!
hurrdudd
Posts: 14
Joined: Tue Sep 08, 2020 4:47 pm

Re: Indic / इंडिक - conversion tools

Post by hurrdudd »

Optilon wrote: Sat Sep 12, 2020 9:11 am @hurrdudd
Which one is closest to the way one would type?
For your use (optimizing Hindi keyboard input) both Harvard-Kyoto and IAST would work as they assign a single character to each keystroke. As I mentioned elsewhere, Roman readable does not do that.
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Indic / इंडिक - conversion tools

Post by Optilon »

hurrdudd wrote: Sat Sep 12, 2020 3:35 pm For your use (optimizing Hindi keyboard input) both Harvard-Kyoto and IAST would work as they assign a single character to each keystroke. As I mentioned elsewhere, Roman readable does not do that.
Okay. I now see they are basically the same: Harvard-Kyoto uses capital letters, while IAST uses [_] above the vocals and [.] below consonants. I assume when typing, one would type like IAST but without the strokes and dots, correct? I will then create a new letter frequency and bigramm chart. Thank you hurrdudd! Without you, I wouldn't have got this far in Hindi so quickly! :)
Post Reply