The Corpus of Academic Finnish (LAS1)

Keywords: academic language

A text corpus is a large and systematically gathered collection of text that contains examples from natural language. A digital corpus allows linguists to analyse language digitally and to perform searches on vocabulary, grammar and the contexts of language use with the help of computers and other digital devices.

The University of Turku has a longstanding tradition in the production of text corpora, including corpora for dialects, standard language and Finnish as a second language (LAS2). The Corpus of Academic Finnish, a subproject of the larger Digilang project, aims to create an additional digital corpus composed of two subcorpora: the LAS1 corpus that consists of Master’s theses by native Finnish writers and another corpus that consists of research papers written in Finnish. The aim of these corpora is to offer a large collection of academic Finnish language texts that represent all fields of academic research for the use of teaching and research.

For further information about the project, please contact Professor Ilmari Ivaska.

Details about the resource

Content
  • Language: Finnish
  • Form: written language
  • Genre: theses
  • Dataset size: 22,365 sentences, 317,282 words, 404,933 word tokens
Annotations
  • lemmatisation
  • morphology
  • syntax

The resource is manually annotated using the Syntax Archive's guidelines.

Authors
Elisa Reunanenproject researcher
Markku Nikulinproject researcher
Kirsti Siitonenthe founder and a member of steering committee
Availability
The dataset is available by contacting the persons below

Contact persons

Elisa Reunanenetreun *at* utu.fi
Markku Nikulinmarnik *at* utu.fi
Nobufumi Inabaninaba *at* utu.fi
Referring

Reference instructions

LAS1 = The Corpus of Academic Finnish. University of Turku, School of Languages and Translation Studies, Department of Finnish and Finno-Ugric Languages