Portuguese / Português - text corpora

Post Reply
User avatar
Optilon
Site Admin
Posts: 50
Joined: Mon Aug 31, 2020 8:36 am

Portuguese / Português - text corpora

Post by Optilon »

For Portuguese, I used the following text corpora from uni-leipzig:
https://wortschatz.uni-leipzig.de/en/do ... portuguese:
por-an_web_2015_10K
por-br_newscrawl_2011_10K
por-cv_web_2015_10K
por-mo_newscrawl_2011_10K
por-mo_web_2016_10K
por-mz_web_2016_10K
por_newscrawl_2018_10K
por-pt_newscrawl_2011_10K
por-pt_web_2015_10K
por_wikipedia_2016_10K
---
total: 100K
---
Portuguese:
Portuguese Google Drive - corpora in folder sentences
---
Conversion tool for diacritics (ñ|á|é|í|ó|ú|Ñ|Á|É|Í|Ó|Ú):
https://drive.google.com/drive/folders/ ... sp=sharing
---
Conversion to all small characters:
Spanish with diacritics:
characterfrequency_spanish_with_diacritics.png
---
Spanish with converted diacritics (ñ|á|é|í|ó|ú|Ñ|Á|É|Í|Ó|Ú):
characterfrequency_spanish_with_converted_diacritics.png
---
I uploaded the configuration files that I used for the optimization so that the optimization can be reproduced later:
Link: https://opt-in-layout.org/viewtopic.php?f=12&t=20

Code: Select all

./opt -2 spanish2020.txt -i 20000 -K optS1V1.cfg

Code: Select all

./opt -2 spanish2020.txt -i 20000 -K optS2V1.cfg

Code: Select all

echo SPANISH:;./opt -2 spanish2020.txt -r bsptast.txt -K controlS1V1.cfg;
SPANISH.png
Post Reply