Keyboard Layouts and Optin Keyboard Layout

Posted: **Thu Sep 10, 2020 3:54 pm**

For Hindi Devanagiri, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/download/hindi
hin_mixed_2019_30K
hin-in_web_2015_30K
hin_news_2011_10K
hin_newscrawl_2016_10K
hin_newscrawl_2017_10K
hin_wikipedia_2016_10K
---
total: 100K
---
Letter frequency chart (all letters with more than 0.1% frequency):

: hindi-devanagiri-characterfrequency.png (150.56 KiB) Viewed 15784 times

Letter frequency table:

hindi-devanagiri-characterfrequency.ods: (57.06 KiB) Downloaded 997 times

Posted: **Thu Sep 10, 2020 3:56 pm**

@hurrdudd
I now understand why you said it is a bad idea to put diacritics to a third or fourth layer. 13 of the 54 most frequently used symbols are diacritics. I'm surprised the keyboard optimizer was able to tear the text corpora apart and even noticed the diacritics separately.

This is the frequency table of all characters with usage of >0.01%. The graph includes only characters with >0.1%:
8,52 % ा
6,90 % क
6,42 % र
6,21 % े
4,35 % ्
4,12 % न
4,09 % ी
4,01 % स
3,86 % ि
3,75 % ं
3,73 % ह
3,25 % त
3,23 % म
2,70 % ल
2,69 % ो
2,47 % प
2,32 % य
1,92 % व
1,86 % द
1,71 % ज
1,61 % ब
1,48 % ग
1,42 % ै
1,32 % ु
1,07 % ।
0,89 % श
0,89 % ट
0,85 % ए
0,82 % च
0,79 % अ
0,76 % भ
0,65 % ू
0,62 % ड
0,61 % थ
0,61 % आ
0,58 % इ
0,55 % ,
0,53 % ख
0,53 % उ
0,49 % ध
0,41 % ष
0,40 % फ
0,38 % औ
0,37 % ई
0,36 % .
0,36 % ़
0,29 % ण
0,21 % छ
0,20 % -
0,19 % ौ
0,13 % ठ
0,13 % घ
0,12 % ओ
0,12 % ॉ
0,09 % ृ
0,09 % ढ
0,08 % झ
0,08 % ँ
0,08 % '
0,08 % :
0,07 % )
0,07 % (
0,07 % ड़
0,06 % ऐ
0,04 % "
0,03 % ‘
0,03 % ’
0,03 % ?
0,03 % ञ
0,03 % ऊ
0,03 % ऑ
0,02 % !
0,01 % ढ़
0,01 % ज़
0,01 % ः
0,01 % /

Posted: **Fri Sep 11, 2020 8:46 pm**

Optilon wrote: ↑Thu Sep 10, 2020 3:56 pm @hurrdudd
I now understand why you said it is a bad idea to put diacritics to a third or fourth layer. 13 of the 54 most frequently used symbols are diacritics. I'm surprised the keyboard optimizer was able to tear the text corpora apart and even noticed the diacritics separately.

This is the frequency table of all characters with usage of >0.01%. The graph includes only characters with >0.1%:
8,52 % ा
6,90 % क
6,42 % र
6,21 % े
4,35 % ्
4,12 % न
4,09 % ी
4,01 % स
3,86 % ि
3,75 % ं
3,73 % ह
3,25 % त
3,23 % म
2,70 % ल
2,69 % ो
2,47 % प
2,32 % य
1,92 % व
1,86 % द
1,71 % ज
1,61 % ब
1,48 % ग
1,42 % ै
1,32 % ु
1,07 % ।
0,89 % श
0,89 % ट
0,85 % ए
0,82 % च
0,79 % अ
0,76 % भ
0,65 % ू
0,62 % ड
0,61 % थ
0,61 % आ
0,58 % इ
0,55 % ,
0,53 % ख
0,53 % उ
0,49 % ध
0,41 % ष
0,40 % फ
0,38 % औ
0,37 % ई
0,36 % .
0,36 % ़
0,29 % ण
0,21 % छ
0,20 % -
0,19 % ौ
0,13 % ठ
0,13 % घ
0,12 % ओ
0,12 % ॉ
0,09 % ृ
0,09 % ढ
0,08 % झ
0,08 % ँ
0,08 % '
0,08 % :
0,07 % )
0,07 % (
0,07 % ड़
0,06 % ऐ
0,04 % "
0,03 % ‘
0,03 % ’
0,03 % ?
0,03 % ञ
0,03 % ऊ
0,03 % ऑ
0,02 % !
0,01 % ढ़
0,01 % ज़
0,01 % ः
0,01 % /

This looks reasonably correct. The diacritics are assigned a separate unicode codepoint, probably the optimizer just looked at each unicode character separately.

Keyboard Layouts and Optin Keyboard Layout

Hindi Devanagiri - हिंदी देवनागरी - text corpora

Hindi Devanagiri - हिंदी देवनागरी - text corpora

Re: Hindi Devanagiri - हिंदी देवनागरी - text corpora

Re: Hindi Devanagiri - हिंदी देवनागरी - text corpora