Hindi Devanagiri - हिंदी देवनागरी - text corpora

Post by **Optilon** » Thu Sep 10, 2020 3:54 pm

For Hindi Devanagiri, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/download/hindi
hin_mixed_2019_30K
hin-in_web_2015_30K
hin_news_2011_10K
hin_newscrawl_2016_10K
hin_newscrawl_2017_10K
hin_wikipedia_2016_10K
---
total: 100K
---
Letter frequency chart (all letters with more than 0.1% frequency):

: hindi-devanagiri-characterfrequency.png (150.56 KiB) Viewed 15772 times

Letter frequency table:

hindi-devanagiri-characterfrequency.ods: (57.06 KiB) Downloaded 995 times

Post by **Optilon** » Thu Sep 10, 2020 3:56 pm

@hurrdudd
I now understand why you said it is a bad idea to put diacritics to a third or fourth layer. 13 of the 54 most frequently used symbols are diacritics. I'm surprised the keyboard optimizer was able to tear the text corpora apart and even noticed the diacritics separately.

This is the frequency table of all characters with usage of >0.01%. The graph includes only characters with >0.1%:
8,52 % ा
6,90 % क
6,42 % र
6,21 % े
4,35 % ्
4,12 % न
4,09 % ी
4,01 % स
3,86 % ि
3,75 % ं
3,73 % ह
3,25 % त
3,23 % म
2,70 % ल
2,69 % ो
2,47 % प
2,32 % य
1,92 % व
1,86 % द
1,71 % ज
1,61 % ब
1,48 % ग
1,42 % ै
1,32 % ु
1,07 % ।
0,89 % श
0,89 % ट
0,85 % ए
0,82 % च
0,79 % अ
0,76 % भ
0,65 % ू
0,62 % ड
0,61 % थ
0,61 % आ
0,58 % इ
0,55 % ,
0,53 % ख
0,53 % उ
0,49 % ध
0,41 % ष
0,40 % फ
0,38 % औ
0,37 % ई
0,36 % .
0,36 % ़
0,29 % ण
0,21 % छ
0,20 % -
0,19 % ौ
0,13 % ठ
0,13 % घ
0,12 % ओ
0,12 % ॉ
0,09 % ृ
0,09 % ढ
0,08 % झ
0,08 % ँ
0,08 % '
0,08 % :
0,07 % )
0,07 % (
0,07 % ड़
0,06 % ऐ
0,04 % "
0,03 % ‘
0,03 % ’
0,03 % ?
0,03 % ञ
0,03 % ऊ
0,03 % ऑ
0,02 % !
0,01 % ढ़
0,01 % ज़
0,01 % ः
0,01 % /

hurrdudd · Post by **hurrdudd** » Fri Sep 11, 2020 8:46 pm

Optilon wrote: ↑Thu Sep 10, 2020 3:56 pm @hurrdudd
I now understand why you said it is a bad idea to put diacritics to a third or fourth layer. 13 of the 54 most frequently used symbols are diacritics. I'm surprised the keyboard optimizer was able to tear the text corpora apart and even noticed the diacritics separately.

This is the frequency table of all characters with usage of >0.01%. The graph includes only characters with >0.1%:
8,52 % ा
6,90 % क
6,42 % र
6,21 % े
4,35 % ्
4,12 % न
4,09 % ी
4,01 % स
3,86 % ि
3,75 % ं
3,73 % ह
3,25 % त
3,23 % म
2,70 % ल
2,69 % ो
2,47 % प
2,32 % य
1,92 % व
1,86 % द
1,71 % ज
1,61 % ब
1,48 % ग
1,42 % ै
1,32 % ु
1,07 % ।
0,89 % श
0,89 % ट
0,85 % ए
0,82 % च
0,79 % अ
0,76 % भ
0,65 % ू
0,62 % ड
0,61 % थ
0,61 % आ
0,58 % इ
0,55 % ,
0,53 % ख
0,53 % उ
0,49 % ध
0,41 % ष
0,40 % फ
0,38 % औ
0,37 % ई
0,36 % .
0,36 % ़
0,29 % ण
0,21 % छ
0,20 % -
0,19 % ौ
0,13 % ठ
0,13 % घ
0,12 % ओ
0,12 % ॉ
0,09 % ृ
0,09 % ढ
0,08 % झ
0,08 % ँ
0,08 % '
0,08 % :
0,07 % )
0,07 % (
0,07 % ड़
0,06 % ऐ
0,04 % "
0,03 % ‘
0,03 % ’
0,03 % ?
0,03 % ञ
0,03 % ऊ
0,03 % ऑ
0,02 % !
0,01 % ढ़
0,01 % ज़
0,01 % ः
0,01 % /

This looks reasonably correct. The diacritics are assigned a separate unicode codepoint, probably the optimizer just looked at each unicode character separately.

Hindi Devanagiri - हिंदी देवनागरी - text corpora

Hindi Devanagiri - हिंदी देवनागरी - text corpora

Re: Hindi Devanagiri - हिंदी देवनागरी - text corpora

Re: Hindi Devanagiri - हिंदी देवनागरी - text corpora