Register Classified OSCAR

Keywords: Internet language

The register classified OSCAR (Open Super-large Crawled ALMAnaCH coRpus) is a multi-lingual sample of the searchable internet that has been automatically classified into registers (genres). The annotation follows the taxonomy presented by Douglas Biber and Jesse Egbert (see Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.), which consists of 8 main registers and 33 subregisters that aim to cover all linguistic variation on the Internet.

This corpus contains approximately 2 Tb of data. The languages included are Arabic, English, Spanish, French, Hindi, Portuguese, Swahili, Urdu and Chinese. The files are jsonl files of the form {id: integer, labels: [label, label], text: document}. "id" denotes the number code of the text, "labels" includes a list of registers assigned to the text and "text" contains the text itself.

More information about OSCAR can be found at:

https://huggingface.co/datasets/oscar

Details about the resource

Content
  • Language: Arabic, English, Spanish, French, Hindi, Portuguese, Swahili, Urdu, Chinese
  • Form: written language
  • Genre: Internet language
  • Dataset size: 2 Tb data
Annotions
  • register
Authors
Veronika Laippala 
Sampo Pyysalo 
Miika Oinonen 
Samuel Rönnqvist 
Availability

Contact person

Veronika Laippalamavela *at* utu.fi