Register Classified OSCAR
Keywords: Internet language
The register classified OSCAR (Open Super-large Crawled ALMAnaCH coRpus) is a multi-lingual sample of the searchable internet that has been automatically classified into registers (genres). The annotation follows the taxonomy presented by Douglas Biber and Jesse Egbert (see Biber, D., & Egbert, J. (2018). Register Variation Online. Cambridge University Press.), which consists of 8 main registers and 33 subregisters that aim to cover all linguistic variation on the Internet.
This corpus contains approximately 2 Tb of data. The languages included are Arabic, English, Spanish, French, Hindi, Portuguese, Swahili, Urdu and Chinese. The files are jsonl files of the form {id: integer, labels: [label, label], text: document}. "id" denotes the number code of the text, "labels" includes a list of registers assigned to the text and "text" contains the text itself.
More information about OSCAR can be found at:
Details about the resource
- Language: Arabic, English, Spanish, French, Hindi, Portuguese, Swahili, Urdu, Chinese
- Form: written language
- Genre: Internet language
- Dataset size: 2 Tb data
- register
| Veronika Laippala | |
| Sampo Pyysalo | |
| Miika Oinonen | |
| Samuel Rönnqvist |
Contact person
| Veronika Laippala | mavela *at* utu.fi |