MarKo Corpus (Mari texts)

Keywords: literary language, journalistic language, academic language

The MarKo Corpus of Meadow and Hill Mari texts dates back to the late 1980s. Excerpts of text from various sources were manually written into files.

The Meadow Mari part of the corpus contains ca. 313,000 word tokens, and the corresponding number of the Hill Mari part is ca. 75,000.

The Meadow Mari texts are numbered from 1 to 150, and the Hill Mari texts from 301 to 331. The lines of each text are also numbered. The position in the text is indicated by a combination of the text number and the line number, for example 33:67 (67th line of the text 33).

The texts 77-79 contain Meadow Mari example sentences from three grammars: Alhoniemi 1984, Sovremennyy mariyskiy yazyk 1961, Vasikova 1982.

Please note that the texts were first written by non-standard Latin transliteration and later automatically transliterated into Cyrillic letters. There may be cases where the Cyrillic text of the corpus does not exactly match the original text. The non-standard Latin transliteration of the original corpus can still be seen in the text names.

The corpus is accessible through Finno-Ugric Corpora portal.

Details about the resource

Content
  • Language: Meadow Mari, Hill Mari
  • Form: written language
  • Genre: fiction, journalistic texts, scientific texts, folklore
  • Dataset size: 388,000 word tokens
Authors
Jorma Luutonencoordinator
Availability

Contact persons

Jussi Ylikoskivolgaserver *at* utu.fi