Spanish / Español - text corpora

Post Reply
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Spanish / Español - text corpora

Post by Optilon »

For Spanish, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/download/spanish:
spa_news_2011_10K
spa_newscrawl_2015_10K
spa_newscrawl-public_2019_10K
spa_web_2016_10K
spa_wikipedia_2016_10K
spa-ar_web-public_2019_10K
spa-co_web_2015_10K
spa-mx_web_2015_10K
spa-pe_web_2016_10K
spa-ve_web_2016_10K
---
total: 100K
---
Spanish:
Spanish 100k - 10 files
---
Conversion tool for diacritics (ñ|á|é|í|ó|ú|Ñ|Á|É|Í|Ó|Ú):
https://drive.google.com/drive/folders/ ... sp=sharing
---
Conversion to all small characters:
Spanish with diacritics:
characterfrequency_spanish_with_diacritics.png
characterfrequency_spanish_with_diacritics.png (146.93 KiB) Viewed 29068 times
---
Spanish with converted diacritics (ñ|á|é|í|ó|ú|Ñ|Á|É|Í|Ó|Ú):
characterfrequency_spanish_with_converted_diacritics.png
characterfrequency_spanish_with_converted_diacritics.png (149.71 KiB) Viewed 29068 times
---
I uploaded the configuration files that I used for the optimization so that the optimization can be reproduced later:
Link: viewtopic.php?f=12&t=20

Code: Select all

./opt -2 spanish2020.txt -i 20000 -K optS1V1.cfg

Code: Select all

./opt -2 spanish2020.txt -i 20000 -K optS2V1.cfg

Code: Select all

echo SPANISH:;./opt -2 spanish2020.txt -r bsptast.txt -K controlS1V1.cfg;
SPANISH.png
SPANISH.png (279.31 KiB) Viewed 29065 times
Post Reply