Turku Tatar Corpus (TuTaC)

Keywords: literary language, journalistic language

The Turku Tatar Corpus v1.0 contains a selection of Tatar texts that were collected from the Internet and assembled into a corpus in August 2009. The corpus is accessible through Finno-Ugric Corpora portal.

The texts were copied from two Internet sites, the Tatar Electronic Library (Татарская электронная библиотека, http://kitap.net.ru/), and the All Tatar Press page (Matbugat.ru, Бөтен татар матбугаты, http://www.matbugat.ru/). The former site cannot be found when writing this (April 2021).

The initiator of the project was Jorma Luutonen and the collecting and editing of the texts was made by Annika Setälä. The compilers of the corpus also want to thank Mansur Saykhunov for his help.

Contents of the Corpus

The corpus consists of 713 text files, containing ca. 1,700,000 word tokens. The size of the texts varies from whole books to short articles.

Most of the Tatar Electronic Library materials represent fictional prose (245 texts, containing 1,161,000 words). There are also 65 poetic works (136,000 words), and 11 texts of the scholarly type (63,000 words).

The All Tatar Press page texts represent the following fields or topics: accidents, advice, cars, countryside, crime, culture, ecology, economy, education, fate, festivities, food, humour, internet, letters, medicine, miracles, nation, politics, press, relationships, religion, show business, society, and sport. The number of texts in each field or topic varies from 5 (ecology) to 24 (culture).

Details about the resource

Content
  • Language: Tatar
  • Form: written language
  • Genre: fiction, journalistic texts, poetry
  • Dataset size: 713 texts, 1,700,000 word tokens
Authors
Jorma Luutonencoordinator
Availability

Contact person

Jussi Ylikoskivolgaserver *at* utu.fi
Referring

Reference instructions

A reference to the corpus should contain the following parts: 1) name of the corpus; 2) (abbreviated) name of the text; and 3) line number in the text.

The name of the corpus is Turku Tatar Corpus, abbreviated TuTatC.

No fixed abbreviations for the corpus text names are available. You can form your own abbreviations on the basis of the "nice name" of the text, see below.

If you have access to the corpus through the Finno-Ugric Corpora portal, you can find information about a specific text in the following way. When you have made a query, click the text identification code (for example, A6) on the left, in the column "Text". The "nice name" of the text, containing the name of the text/publication and, for some texts, additional information, is shown.

The "nice name" of the All Tatar Page texts usually also names the original publishing media. The Tatar Electronic Library texts do not usually contain any information about the original publication or the publishing date.

The original fixed line numbers of the corpus files can be seen inside the query result lines in the form "8:", "9:", etc.. They can be used to specify location in a certain text.