Register Classified OSCAR

Keywords: Internet language

The register classified OSCAR (Open Super-large Crawled ALMAnaCH coRpus) is a multi-lingual sample of the searchable internet that has been automatically classified into registers (genres). The annotation follows the taxonomy presented by Douglas Biber and Jesse Egbert (see Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.), which consists of 8 main registers and 33 subregisters that aim to cover all linguistic variation on the Internet.

This corpus contains approximately 2 Tb of data. The languages included are Arabic, English, Spanish, French, Hindi, Portuguese, Swahili, Urdu and Chinese. The files are jsonl files of the form {id: integer, labels: [label, label], text: document}. "id" denotes the number code of the text, "labels" includes a list of registers assigned to the text and "text" contains the text itself.

More information about OSCAR can be found at:

https://huggingface.co/datasets/oscar

Details about the resource

Content

Language: Arabic, English, Spanish, French, Hindi, Portuguese, Swahili, Urdu, Chinese
Form: written language
Genre: Internet language
Dataset size: 2 Tb data

Annotions

register

Authors

Veronika Laippala
Sampo Pyysalo
Miika Oinonen
Samuel Rönnqvist

Availability

Available at

https://huggingface.co/datasets/mhtoin/register_oscar/tree/main

Contact person

Veronika Laippala

mavela *at* utu.fi

Links

UTU-Digilang front page