FinCORE
Keywords: Internet language
This corpus is a sample of the Finnish Internet Parsebank data, which has been mined from the searchable Internet. The texts have been manually annotated by register. The annotation follows the taxonomy presented by Douglas Biber and Jesse Egbert (see Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.), which consists of 8 main registers and 33 subregisters that aim to cover all linguistic variation on the Internet.
The annotated texts have been split into the files train.tsv, dev.tsv and test.tsv in the folder data. In the TSV files each row has the register given to the text in the first column and the text itself in the second column. In total the corpus includes 2,226 annotated texts.
Details about the resource
- Language: Finnish
- Form: written language
- Genre: Internet language
- Dataset size: 2,226 texts
- register
Each text has been annotated with 1–2 registers
| Valtteri Skantsi | |
| Roosa Kyllönen | |
| Veronika Laippala | |
| Jesse Egbert | |
| Douglas Biber | |
| Sampo Pyysalo |
Available at
Contact person
| Veronika Laippala | mavela *at* utu.fi |