Arabic / اَللُّغَةُ اَلْعَرَبِيَّة - text corpora
Posted: Fri Sep 18, 2020 4:38 pm
For Arabic, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/download/arabic
ara_news_2016_10K
ara_news_2017_10K
ara_wikipedia_2012_10K
ara_wikipedia_2016_10K
ara-ae_web_2017_10K
ara-eg_web_2015_10K
ara-ps_newscrawl_2012_10K
ara-sy_newscrawl_2012_10K
ara-tn_newscrawl_2012_10K
https://wortschatz.uni-leipzig.de/en/do ... ian-arabic:
arz_wikipedia_2016_10K
---
total: 100K
---
Arabic: Arabic 100k - 10 files
---
Character frequency with symbols: ---
Arabic optimized with optS1V1.cfg: ---
Maximum possible hand alternation 63.60% (approximate, for 26 most frequent letters): ---
transliteration chart for arabic: (for letters with a green background: i'm very sure that the transliteration is correct. I'm not sure about the ones with orange background. their transliteration was chosen in such a way that the most frequent letters correspond to the most frequently used letters in Latin alphabets and preferential assignment of vowel-like letters to vowels and consonant-like letters to consonants. letters beginning with a small "s" are typed with "shift"+letter) ---
Transliteration tool for arabic (ا|ل|ي|م|و|ن|ر|ت|ب|ع|ة|د|ف|ه|س|ق|ك|ح|أ|ج|ى|ش|ط|ص|خ|ض|إ|ز|ذ|ث|ئ|غ|ء|ظ|ؤ): ---
Conversion to all small characters: viewtopic.php?f=12&t=8 ---
Arabic romanized with diacritics and symbols: ---
I uploaded the configuration files that I used for the optimization so that the optimization can be reproduced later:
Link: viewtopic.php?f=12&t=20
---
First optimization result:
https://wortschatz.uni-leipzig.de/en/download/arabic
ara_news_2016_10K
ara_news_2017_10K
ara_wikipedia_2012_10K
ara_wikipedia_2016_10K
ara-ae_web_2017_10K
ara-eg_web_2015_10K
ara-ps_newscrawl_2012_10K
ara-sy_newscrawl_2012_10K
ara-tn_newscrawl_2012_10K
https://wortschatz.uni-leipzig.de/en/do ... ian-arabic:
arz_wikipedia_2016_10K
---
total: 100K
---
Arabic: Arabic 100k - 10 files
---
Character frequency with symbols: ---
Arabic optimized with optS1V1.cfg: ---
Maximum possible hand alternation 63.60% (approximate, for 26 most frequent letters): ---
transliteration chart for arabic: (for letters with a green background: i'm very sure that the transliteration is correct. I'm not sure about the ones with orange background. their transliteration was chosen in such a way that the most frequent letters correspond to the most frequently used letters in Latin alphabets and preferential assignment of vowel-like letters to vowels and consonant-like letters to consonants. letters beginning with a small "s" are typed with "shift"+letter) ---
Transliteration tool for arabic (ا|ل|ي|م|و|ن|ر|ت|ب|ع|ة|د|ف|ه|س|ق|ك|ح|أ|ج|ى|ش|ط|ص|خ|ض|إ|ز|ذ|ث|ئ|غ|ء|ظ|ؤ): ---
Conversion to all small characters: viewtopic.php?f=12&t=8 ---
Arabic romanized with diacritics and symbols: ---
I uploaded the configuration files that I used for the optimization so that the optimization can be reproduced later:
Link: viewtopic.php?f=12&t=20
Code: Select all
./opt -2 arabic2020.txt -i 20000 -K optS1V1.cfg
Code: Select all
./opt -2 arabicroman2020.txt -i 20000 -K optS1V1.cfg
Code: Select all
echo ARABIC:;./opt -2 arabicroman2020.txt -r bsptast.txt -K controlS1V1.cfg;
First optimization result:
Code: Select all
qwertyuiop■☻asdfghjkl▓█▒░zxcvbnm,. QWERTY
▓,.pyfgcrl■☻aoeuidhtns█▒░qjkxbmwvz DVORAK-EN
wflcgqku,y■☻rsntdoeaih█▒░vmpbzx.j▓ optLAT-S1-V1
wqrdfgockv■☻snlmbeuait█▒░j.zpxh,y▓ optAR-S1-V1
▓pfocqdeg█■☻bmautrlinsv▒zyjw.h,kx░ ar-lulua