Bengali Roman IME / বাঙালি রোমান আইএমই - text corpora

Post Reply
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Bengali Roman IME / বাঙালি রোমান আইএমই - text corpora

Post by Optilon »

For Bengali, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/download/bengali
ben_newscrawl_2011_10K
ben_newscrawl_2017_30K
ben_wikipedia_2011_10K
ben_wikipedia_2016_10K
ben-bd_web_2014_10K
ben-bd_web_2017_30K
---
total: 100K
---
For conversion, multiple steps are necessary:
1. Aksharamukha conversion tool, preferably the python script. Conversion from bengali with schwa deletion to Roman IAST.
2. Conversion script to replace ā|æ|ç|ḍ|ĕ|è|ġ|ī|ï|ṃ|ṇ|ṉ|ṅ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ẏ with normal letters.
3. sed-command in terminal (linux)
...
1. Aksharamukha conversion tool (source: https://aksharamukha.appspot.com/):

Code: Select all

#!/usr/bin/python3
# coding: utf-8
import sys
from aksharamukha import transliterate
import argparse

def transliterateRoman(s):
    return transliterate.process('Bengali', 'IAST', s, pre_options=['SchwaFinalBengali'])

def batch_transliterate(dev_file, rr_file):
    for line in dev_file:
        rr_file.writelines([transliterateRoman(line)])
    return 0

if __name__ == '__main__':
    ap = argparse.ArgumentParser()
    ap.add_argument('devanagari_file', nargs='?', type=argparse.FileType('r'), default=sys.stdin)
    ap.add_argument('romanIAST_file', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
    args = ap.parse_args()

    exit(batch_transliterate(args.devanagari_file, args.romanIAST_file))
save file as: convert_bengali_roman.py
run script with:

Code: Select all

python3 convert_bengali_roman.py bengali2020.txt bengaliroman2020.txt
---
2. Conversion script to replace (ā|æ|ç|ḍ|ĕ|è|ġ|ḥ|ī|ï|ṃ|ṇ|ṉ|ṅ|ñ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ü|ẏ) with normal letters:

Code: Select all

$rules = New-Object System.Collections.Hashtable
$rules.ā  = "a"
$rules.æ  = "ae"
$rules.ç  = "c"
$rules.ḍ  = "d"
$rules.ĕ  = "e"
$rules.è  = "e"
$rules.ġ  = "g"
$rules.ḥ  = "h"
$rules.ī  = "i"
$rules.ï  = "i"
$rules.ṃ  = "m"
$rules.ṇ  = "n"
$rules.ṉ  = "n"
$rules.ṅ  = "n"
$rules.ñ  = "n"
$rules.Ñ  = "n"
$rules.Ó  = "O"
$rules.ô  = "o"
$rules.ŏ  = "o"
$rules.ṛ  = "r"
$rules.ṟ  = "r"
$rules.ṝ  = "r"
$rules.ś  = "s"
$rules.ṣ  = "s"
$rules.ṭ  = "t"
$rules.ū  = "u"
$rules.ü  = "u"
$rules.ẏ  = "y"

$file  = Get-Content -Path $args[0]
$a   = [regex]'(ā|æ|ç|ḍ|ĕ|è|ġ|ḥ|ī|ï|ṃ|ṇ|ṉ|ṅ|ñ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ü|ẏ)‘

$a_cb = {$rules[$args[0].Groups[1].Value] + $args[0].Groups[2].Value}

Write-Progress -Activity "a" -PercentComplete 4
$file = $a.Replace($file, $a_cb)

Set-Content -Path "$args.converted.txt" -Value $file
echo "Finished" 

save file as: IASTtoclear.ps1
in terminal run:

Code: Select all

pwsh
then

Code: Select all

./IASTtoclear.ps1 bengaliroman2020.txt
then rename bengaliroman2020.txt.converted.txt to bengaliromanconverted2020.txt
---
3. sed-command in terminal (linux) to remove some diacritics, that can not be remove with the simple conversion script from step 2 (there still is one diacritic sign I was not able to replace, it looks like [_]:

Code: Select all

sed -i 's/̤//g' bengaliromanconverted2020.txt
sed -i 's/̐//g' bengaliromanconverted2020.txt
sed -i 's/̈//g' bengaliromanconverted2020.txt
---
Letter frequency chart (all alphabetic characters + symbolic characters with >0.1% frequency):
bengali-characterfrequency.png
bengali-characterfrequency.png (255.7 KiB) Viewed 38488 times
bengali-roman-characterfrequency.png
bengali-roman-characterfrequency.png (138.93 KiB) Viewed 38487 times
characterfrequency.ods
(89.93 KiB) Downloaded 1829 times
bengali-characterfrequency.ods
(51.64 KiB) Downloaded 1812 times
First optimization result for 100% Bengali:
Bengali100.png
Bengali100.png (289.75 KiB) Viewed 38488 times
And for 50% Bengali and 50% English:
BEN50EN50.png
BEN50EN50.png (296.65 KiB) Viewed 38488 times
Post Reply