Hindi Roman IME - हिंदी रोमन IME - text corpora

Post Reply
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by Optilon »

For Hindi Devanagiri, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/download/hindi
hin_mixed_2019_30K
hin-in_web_2015_30K
hin_news_2011_10K
hin_newscrawl_2016_10K
hin_newscrawl_2017_10K
hin_wikipedia_2016_10K
---
total: 100K
---
For conversion, multiple steps are necessary:
1. Aksharamukha conversion tool, preferably the python script. Conversion from devanagiri with schwa deletion to Roman IAST.
2. Conversion script to replace ā|æ|ç|ḍ|ĕ|è|ġ|ī|ï|ṃ|ṇ|ṉ|ṅ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ẏ with normal letters.
3. sed-command in terminal (linux)
...
1. Aksharamukha conversion tool (source: https://aksharamukha.appspot.com/):

Code: Select all

#!/usr/bin/python3
# coding: utf-8
import sys
from aksharamukha import transliterate
import argparse

def transliterateRoman(s):
    return transliterate.process('Devanagari', 'IAST', s, pre_options=['RemoveSchwaHindi'])

def batch_transliterate(dev_file, rr_file):
    for line in dev_file:
        rr_file.writelines([transliterateRoman(line)])
    return 0

if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument('devanagari_file', nargs='?', type=argparse.FileType('r'), default=sys.stdin)
    ap.add_argument('romanIAST_file', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
    args = ap.parse_args()

    exit(batch_transliterate(args.devanagari_file, args.romanIAST_file))
save file as: convert_hindi_roman.py
run script with:

Code: Select all

python3 convert_hindi_roman.py hindi2020.txt hindiromanIAST2020.txt
---
2. Conversion script to replace (ā|æ|ç|ḍ|ĕ|è|ġ|ḥ|ī|ï|ṃ|ṇ|ṉ|ṅ|ñ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ü|ẏ) with normal letters:

Code: Select all

$rules = New-Object System.Collections.Hashtable
$rules.ā  = "a"
$rules.æ  = "ae"
$rules.ç  = "c"
$rules.ḍ  = "d"
$rules.ĕ  = "e"
$rules.è  = "e"
$rules.ġ  = "g"
$rules.ḥ  = "h"
$rules.ī  = "i"
$rules.ï  = "i"
$rules.ṃ  = "m"
$rules.ṇ  = "n"
$rules.ṉ  = "n"
$rules.ṅ  = "n"
$rules.ñ  = "n"
$rules.Ñ  = "n"
$rules.Ó  = "O"
$rules.ô  = "o"
$rules.ŏ  = "o"
$rules.ṛ  = "r"
$rules.ṟ  = "r"
$rules.ṝ  = "r"
$rules.ś  = "s"
$rules.ṣ  = "s"
$rules.ṭ  = "t"
$rules.ū  = "u"
$rules.ü  = "u"
$rules.ẏ  = "y"

$file  = Get-Content -Path $args[0]
$a   = [regex]'(ā|æ|ç|ḍ|ĕ|è|ġ|ḥ|ī|ï|ṃ|ṇ|ṉ|ṅ|ñ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ü|ẏ)‘

$a_cb = {$rules[$args[0].Groups[1].Value] + $args[0].Groups[2].Value}

Write-Progress -Activity "a" -PercentComplete 4
$file = $a.Replace($file, $a_cb)

Set-Content -Path "$args.converted.txt" -Value $file
echo "Finished" 

save file as: IASTtoclear.ps1
in terminal run:

Code: Select all

pwsh
then

Code: Select all

./IASTtoclear.ps1 hindiromanIAST2020.txt
then rename hindiromanIAST2020.txt.converted.txt to hindiromanIASTconverted2020.txt
---
3. sed-command in terminal (linux) to remove some diacritics, that can not be remove with the simple conversion script from step 2 (there still is one diacritic sign I was not able to replace, it looks like [_]:
sed -i 's/̤//g' hindiromanIASTconverted2020.txt
sed -i 's/̐//g' hindiromanIASTconverted2020.txt
sed -i 's/̈//g' hindiromanIASTconverted2020.txt
---
Letter frequency chart (all alphabetic characters + punctuation characters with >0.1% frequency):
hindi-roman-IAST-characterfrequency-with-punctuation.png
hindi-roman-IAST-characterfrequency-with-punctuation.png (143.92 KiB) Viewed 50180 times
Letter frequency table:
characterfrequency.ods
(50.09 KiB) Downloaded 1923 times
First optimization result for 100% Hindi:
100HINDI.png
100HINDI.png (195.16 KiB) Viewed 50177 times
And for 50% Hindi and 50% English:
HIN50ENG50.png
HIN50ENG50.png (191.71 KiB) Viewed 50177 times
And for 100% English (this layout is not optimized for English but still far better than QWERTY. Funny, isn't it? Dvorak, which has been optimized for English, is, as expected, a little better for 100% English) The 50% Hindi + 50% English Layout does beat dvorak in this setup.
ENG100hindiroman.png
ENG100hindiroman.png (195.22 KiB) Viewed 50177 times
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by Optilon »

@hurrdudd
The character [a] has a frequency of ~30%, which is quite a lot. Do you think this is correct? There are many double strokes for a. Are they usually written that way?
The most common bigramms are:
821045 a
648070 aa
493560 e
392033 ha
320197 ee
317772 k
313123 ra
264597 ar
211511 ka
205643 an
169322 na
157428 s
157127 sa
144352 m
140563 ke
138969 h
129159 ya
128731 ta
123313 at
116746 ma
113958 m
113888 a
113686 p
111542 ai
110381 pa
107440 he
103996 ah
100374 la
100305 b
100180 am

Frequency chart (Edit: wrong, because a and e were used far too frequently. Actual numbers are lower, see updated first / main post.
hindi-roman-characterfrequency-with-punctuation.png
hindi-roman-characterfrequency-with-punctuation.png (132.32 KiB) Viewed 50179 times
First optimization result:
Screenshot_20200910_225041.png
Screenshot_20200910_225041.png (190.71 KiB) Viewed 50197 times
OptHin effort is almost halved in comparison to qwerty. Both QWERTY and DVORAK have [a] on the left pinky, which is problematic, as a is the most frequent letter, and much more frequent than in other languages. Hand alternation is the most important factor for writing speed and increases from 51% to 68%. Collisions (same finger usage) is reduced from 4.8% to 0.7%. Balance of right and left hand is still bad: 57% and 43%.
hurrdudd
Posts: 14
Joined: Tue Sep 08, 2020 4:47 pm

Re: Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by hurrdudd »

Optilon wrote: Thu Sep 10, 2020 8:55 pm @hurrdudd
The character [a] has a frequency of ~30%, which is quite a lot. Do you think this is correct? There are many double strokes for a. Are they usually written that way?
The most common bigramms are:
821045 a
648070 aa
493560 e
392033 ha
320197 ee
317772 k
313123 ra
264597 ar
211511 ka
205643 an
169322 na
157428 s
157127 sa
144352 m
140563 ke
138969 h
129159 ya
128731 ta
123313 at
116746 ma
113958 m
113888 a
113686 p
111542 ai
110381 pa
107440 he
103996 ah
100374 la
100305 b
100180 am

First optimization result:
Screenshot_20200910_225041.png

OptHin effort is almost halved in comparison to qwerty. Both QWERTY and DVORAK have [a] on the left pinky, which is problematic, as a is the most frequent letter, and much more frequent than in other languages. Hand alternation is the most important factor for writing speed and increases from 51% to 68%. Collisions (same finger usage) is reduced from 4.8% to 0.7%. Balance of right and left hand is still bad: 57% and 43%.
Doesn't look correct to me. I think you are counting a (अ) along with aa (आ) making it 3x the actual frequency. IMHO using Harvard-Kyoto or IAST would be preferable for frequency analysis as they would assign a unique character to each letter (as is the case with a keyboard). Using RomanReadable will never give the true picture.
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by Optilon »

hurrdudd wrote: Fri Sep 11, 2020 8:37 pm Doesn't look correct to me. I think you are counting a (अ) along with aa (आ) making it 3x the actual frequency. IMHO using Harvard-Kyoto or IAST would be preferable for frequency analysis as they would assign a unique character to each letter (as is the case with a keyboard). Using RomanReadable will never give the true picture.
Okay, I will check again this weekend. It is important to use the text as it would be typed by the majority for easy and convenient input.
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by Optilon »

hurrdudd wrote: Fri Sep 11, 2020 8:37 pm Doesn't look correct to me. I think you are counting a (अ) along with aa (आ) making it 3x the actual frequency. IMHO using Harvard-Kyoto or IAST would be preferable for frequency analysis as they would assign a unique character to each letter (as is the case with a keyboard). Using RomanReadable will never give the true picture.
The result looks far more realistic now. The data includes 100.000 sentences with more than 7 million letters, the result should be fairly accurate:
Image
The top bigrams are:
317772 k
283502 e
263264 i
247343 ar
219675 ha
206174 a
172302 m
157428 s
151471 ka
148279 r
138969 h
137266 am
130157 ai
129368 sa
128848 ra
114999 an
113963 m
113888 a
113686 p
102095 hi
101898 ah
100305 b
(single letters do have a space before or after them)

The high amount of aa and ee is now gone.

The first optimization result looks good so far:
Image
77% hand alternation for OptHin in comparison to 54% for Qwerty.
1.5% same finger usage vs. 8.7%
8% adjacent vs. 16%
9.5% total pinky usage vs. 23.8%
hurrdudd
Posts: 14
Joined: Tue Sep 08, 2020 4:47 pm

Re: Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by hurrdudd »

I am just curious how were you able to represent the 44 devanagari letters with 26 lowercase roman letters. Did you encode effect of shift separately? If yes, then why does it not appear in the frequency graph? Is it not accounted for?
Also, I hope you assigned a unique character to each IAST letter. Otherwise you will get the same problem as roman readable representation.
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by Optilon »

hurrdudd wrote: Tue Sep 15, 2020 7:06 pm I am just curious how were you able to represent the 44 devanagari letters with 26 lowercase roman letters. Did you encode effect of shift separately? If yes, then why does it not appear in the frequency graph? Is it not accounted for?
Also, I hope you assigned a unique character to each IAST letter. Otherwise you will get the same problem as roman readable representation.
No, this is the Roman IME thread. For Devanagiri see: viewtopic.php?f=13&t=14
I soon have to check, whether the shifted level for Devanagiri is always the same. Then I can make an optimization for Devanagiri.
Each IAST letter has a unique character from a-z.
hurrdudd
Posts: 14
Joined: Tue Sep 08, 2020 4:47 pm

Re: Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by hurrdudd »

Optilon wrote: Tue Sep 15, 2020 7:34 pm No, this is the Roman IME thread. For Devanagiri see: viewtopic.php?f=13&t=14
Ah, sorry.
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Hindi Roman IME - हिंदी रोमन IME - text corpora

Post by Optilon »

@hurrdudd
Do you know if there is any scientific data / data collection about how often the Roman IME and Devanagiri are used in india in comparison?

I am currently learning about Devanagiri. Maybe I can include it in the OptIN optimization, but therefore have to assign letters a-z and shifted level A-Z for the characters. Still have to check whether it is possible. Otherwise, I can only optimize it independently.
Post Reply