FreCORE

Keywords: Internet language

This corpus is a sample of the French-language searchable Internet. The texts have been manually annotated by register. The annotation follows the taxonomy presented by Douglas Biber and Jesse Egbert (see Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.), which consists of 8 main registers and 33 subregisters that aim to cover all linguistic variation on the Internet.

The annotated texts have been split into the files train.tsv, dev.tsv and test.tsv in the folder data/FreCORE. In the TSV files each row has the register given to the text in the first column and the text itself in the second column. In total the corpus includes 1,818 annotated texts.

Details about the resource

Content
  • Language: French
  • Form: written language
  • Genre: Internet language
  • Dataset size: 1,818 texts
Annotations
  • register

    Each text has been manually annotated with 1–2 registers.

Authors
Veronika Laippala 
Jesse Egbert 
Douglas Biber 
Sampo Pyysalo 
Saara Hellström 
Anna Salmela 
Liina Repo 
Samuel Rönnqvist 
Miika Oinonen 
Availability

Contact person

Veronika Laippalamavela *at* utu.fi

Usage licence

CC BY
Referring

Reference instructions

Repo, L., Skantsi, V., Rönnqvist, S., Hellström, S., Oinonen, M., Salmela, A., Biber, D., Egbert, J., Pyysalo, S., & Laippala, V. (2021). Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers. EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Student Research Workshop, 183–191. http://arxiv.org/abs/2102.07396