https://wortschatz.uni-leipzig.de/en/download/bengali
ben_newscrawl_2011_10K
ben_newscrawl_2017_30K
ben_wikipedia_2011_10K
ben_wikipedia_2016_10K
ben-bd_web_2014_10K
ben-bd_web_2017_30K
---
total: 100K
---
For conversion, multiple steps are necessary:
1. Aksharamukha conversion tool, preferably the python script. Conversion from bengali with schwa deletion to Roman IAST.
2. Conversion script to replace ā|æ|ç|ḍ|ĕ|è|ġ|ī|ï|ṃ|ṇ|ṉ|ṅ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ẏ with normal letters.
3. sed-command in terminal (linux)
...
1. Aksharamukha conversion tool (source: https://aksharamukha.appspot.com/):
Code: Select all
#!/usr/bin/python3
# coding: utf-8
import sys
from aksharamukha import transliterate
import argparse
def transliterateRoman(s):
return transliterate.process('Bengali', 'IAST', s, pre_options=['SchwaFinalBengali'])
def batch_transliterate(dev_file, rr_file):
for line in dev_file:
rr_file.writelines([transliterateRoman(line)])
return 0
if __name__ == '__main__':
ap = argparse.ArgumentParser()
ap.add_argument('devanagari_file', nargs='?', type=argparse.FileType('r'), default=sys.stdin)
ap.add_argument('romanIAST_file', nargs='?', type=argparse.FileType('w'), default=sys.stdout)
args = ap.parse_args()
exit(batch_transliterate(args.devanagari_file, args.romanIAST_file))
run script with:
Code: Select all
python3 convert_bengali_roman.py bengali2020.txt bengaliroman2020.txt
2. Conversion script to replace (ā|æ|ç|ḍ|ĕ|è|ġ|ḥ|ī|ï|ṃ|ṇ|ṉ|ṅ|ñ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ü|ẏ) with normal letters:
Code: Select all
$rules = New-Object System.Collections.Hashtable
$rules.ā = "a"
$rules.æ = "ae"
$rules.ç = "c"
$rules.ḍ = "d"
$rules.ĕ = "e"
$rules.è = "e"
$rules.ġ = "g"
$rules.ḥ = "h"
$rules.ī = "i"
$rules.ï = "i"
$rules.ṃ = "m"
$rules.ṇ = "n"
$rules.ṉ = "n"
$rules.ṅ = "n"
$rules.ñ = "n"
$rules.Ñ = "n"
$rules.Ó = "O"
$rules.ô = "o"
$rules.ŏ = "o"
$rules.ṛ = "r"
$rules.ṟ = "r"
$rules.ṝ = "r"
$rules.ś = "s"
$rules.ṣ = "s"
$rules.ṭ = "t"
$rules.ū = "u"
$rules.ü = "u"
$rules.ẏ = "y"
$file = Get-Content -Path $args[0]
$a = [regex]'(ā|æ|ç|ḍ|ĕ|è|ġ|ḥ|ī|ï|ṃ|ṇ|ṉ|ṅ|ñ|Ñ|Ó|ô|ŏ|ṛ|ṟ|ṝ|ś|ṣ|ṭ|ū|ü|ẏ)‘
$a_cb = {$rules[$args[0].Groups[1].Value] + $args[0].Groups[2].Value}
Write-Progress -Activity "a" -PercentComplete 4
$file = $a.Replace($file, $a_cb)
Set-Content -Path "$args.converted.txt" -Value $file
echo "Finished"
in terminal run:
Code: Select all
pwsh
Code: Select all
./IASTtoclear.ps1 bengaliroman2020.txt
---
3. sed-command in terminal (linux) to remove some diacritics, that can not be remove with the simple conversion script from step 2 (there still is one diacritic sign I was not able to replace, it looks like [_]:
Code: Select all
sed -i 's/̤//g' bengaliromanconverted2020.txt
sed -i 's/̐//g' bengaliromanconverted2020.txt
sed -i 's/̈//g' bengaliromanconverted2020.txt
Letter frequency chart (all alphabetic characters + symbolic characters with >0.1% frequency): First optimization result for 100% Bengali: And for 50% Bengali and 50% English: