For Chinese Mandarin, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/download/chinese
zho_news_2007-2009_30K
zho-cn_web_2015_30K
zho-simp-tw_web_2014_30 (seems to be from Taiwan, but will still be included, as it will be converted into Pinyin)
zho-mo_web_2016_10K (seems to be from Macao, but still included as 10% due to lack of other corpora)
---
total: 100K
---
Chinese simplified:
Chinese Simplified 100k - 5 files
---
Conversion tool:
https://www.fuhaoku.net/tool/pinyin.html
---
Result pinyin:
Chinese Mandarin converted to Pinyin 100k - 5 files
---
Conversion from pinyin to clear latin:
Tool (powershell): https://drive.google.com/file/d/1cUk4U6 ... sp=sharing
Result clear latin: https://drive.google.com/file/d/1ltvH0c ... sp=sharing
---
Conversion from clear latin to all small characters:
Result: https://drive.google.com/file/d/1NaWMgF ... sp=sharing
Early optimization result:
Mandarin / 官话 - text corpora
Re: Mandarin / 官话 - text corpora
Optimization result for S2V1: