The Corpus of Academic Finnish (LAS1)

Keywords: academic language

A text corpus is a large and systematically gathered collection of text that contains examples from natural language. A digital corpus allows linguists to analyse language digitally and to perform searches on vocabulary, grammar and the contexts of language use with the help of computers and other digital devices.

The University of Turku has a longstanding tradition in the production of text corpora, including corpora for dialects, standard language and Finnish as a second language (LAS2). The Corpus of Academic Finnish, a subproject of the larger Digilang project, aims to create an additional digital corpus composed of two subcorpora: the LAS1 corpus that consists of Master’s theses by native Finnish writers and another corpus that consists of research papers written in Finnish. The aim of these corpora is to offer a large collection of academic Finnish language texts that represent all fields of academic research for the use of teaching and research.

For further information about the project, please contact Professor Ilmari Ivaska.

Details about the resource

Content

Language: Finnish
Form: written language
Genre: theses
Dataset size: 22,365 sentences, 317,282 words, 404,933 word tokens

Annotations

lemmatisation
morphology
syntax

The resource is manually annotated using the Syntax Archive's guidelines.

Authors

Elisa Reunanen	project researcher
Markku Nikulin	project researcher
Kirsti Siitonen	the founder and a member of steering committee

Availability

The dataset is available by contacting the persons below

Contact persons

Elisa Reunanen	etreun at utu.fi
Markku Nikulin	marnik at utu.fi
Nobufumi Inaba	ninaba at utu.fi

Referring

Reference instructions

LAS1 = The Corpus of Academic Finnish. University of Turku, School of Languages and Translation Studies, Department of Finnish and Finno-Ugric Languages

Links

UTU-Digilang front page