Mandarin / 官话 - text corpora

Post Reply
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Mandarin / 官话 - text corpora

Post by Optilon »

For Chinese Mandarin, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/download/chinese
zho_news_2007-2009_30K
zho-cn_web_2015_30K
zho-simp-tw_web_2014_30 (seems to be from Taiwan, but will still be included, as it will be converted into Pinyin)
zho-mo_web_2016_10K (seems to be from Macao, but still included as 10% due to lack of other corpora)
---
total: 100K
---
Chinese simplified:
Chinese Simplified 100k - 5 files
---
Conversion tool:
https://www.fuhaoku.net/tool/pinyin.html
---
Result pinyin:
Chinese Mandarin converted to Pinyin 100k - 5 files
---
Conversion from pinyin to clear latin:
Tool (powershell): https://drive.google.com/file/d/1cUk4U6 ... sp=sharing
Result clear latin: https://drive.google.com/file/d/1ltvH0c ... sp=sharing
---
Conversion from clear latin to all small characters:
Result: https://drive.google.com/file/d/1NaWMgF ... sp=sharing
Chinese_Mandarin_Pinyin_character_frequency_letters.png
Chinese_Mandarin_Pinyin_character_frequency_letters.png (28.6 KiB) Viewed 31804 times
Chinese_Mandarin_Pinyin_character_frequency_letters+symbols.png
Chinese_Mandarin_Pinyin_character_frequency_letters+symbols.png (29.29 KiB) Viewed 31804 times

Early optimization result:
OptCNv1.png
OptCNv1.png (53.32 KiB) Viewed 31815 times
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Re: Mandarin / 官话 - text corpora

Post by Optilon »

Optimization result for S2V1:
optMAN-Mandarin.png
optMAN-Mandarin.png (203.78 KiB) Viewed 31781 times
optMan-Man50En.png
optMan-Man50En.png (208.21 KiB) Viewed 31781 times
optMAN-english.png
optMAN-english.png (204.64 KiB) Viewed 31781 times
Post Reply