The Corpus of Prosodic Variation of Finnish (Prosovar)
Keywords: dialects, spoken language, prosody, elicited recording tasks
The multidisciplinary project The Regional and Social Variation of Finnish Prosody (Prosovar) was conducted by the University of Turku and financed by the Kone foundation (2013–2015; see also Kurki & al. 2014; Nieminen and Kurki 2017). The objectives of this project included a) the formation of a speech corpus particularly for the study of Finnish prosody and its regional and social variation (The Corpus of Prosodic Variation in Finnish) and b) the development and testing of a method for data collection and analysis for the study of natural spoken language.
As a complement to old fieldwork for obtaining speech in dialectology and sociolinguistics, a new partially crowdsourced method for collecting sociolinguistic and sociophonetic data via the Internet was developed and tested in the Prosovar project. There was also a precedent for collecting sociolinguistic data on the Internet (in particular, Dialect Topography by professor J. K. Chambers; cf. Chambers, 1994), but to our knowledge, Prosovar was one of the first attempts in dialectology, sociolinguistics and sociophonetics (cf. computational linguistics; e.g. Lane & al. 2010; McGraw 2013) to collect speech data over the Internet. The development of data collection methods in Prosovar required a multidisciplinary approach, where dialectological, sociolinguistic, (socio)phonetic, computer science and Finnish language expertise was needed.
The idea was to motivate non-linguists to participate in data collection by completing recording tasks with a web application created for the Prosovar project. From the beginning of the project it was crucially important to find ways to attract voluntary participants willing to record their speech samples for linguistic research purposes. The goal of giving public presentations, interviews to newspapers and campaigning in social media was to arouse public interest. Also the possibility of listening to anonymous speech samples of other participants and implementing the elements of a game-like design in developing the application were found to be good ways to raise their interest.
Participants were able to make recordings with their personal computers, (Android) tablet computers and (Android) cellular phones, as long as their device had a microphone and they created a user account. At the same time, this opened a way for them to further participate in the research; as long as they made recordings for the database, they were allowed to listen to randomly selected anonymous voice clips from the database and evaluate them in a folk linguistic manner. For example, a participant was asked to listen to a clip and locate the speaker’s dialect on a map or he/she was asked to describe with a few adjectives what the speaker in a clip sounded like. This information was and is possible to investigate from a folk linguistic perspective by analyzing the language with regard to respondents and from a computer science perspective by applying dialect recognition techniques (e.g. how humans and computers perceive sounds differently).
Unregistered guest users were only able to listen to a few selected anonymous samples and obtain general information about Finnish colloquial speech and dialect samples in the data obtained so far in the project. In order to access the recording tasks and the “game” in which one listened to short audio clips and tried to locate their speakers, one had to 1) create a user account, 2) accept the conditions and terms of use and 3) finish at least one recording task for accessing the game. All the data and the background information about the participants were moved to a separate server for privacy and security reasons. By the end of November 2015, there were approximately one thousand registered users, of whom 395 had made recordings for the project with a total of over 9300 recorded samples.
A selection of the Prosovar audio recordings have been segmented and phonetically annotated in the Digilang project (2018–2021), and they are available for linguistic research. In the future, this annotated corpus will be available in the Language Bank of Finland.
Details about the resource
- Language: Finnish
- Form: audio, database with background information about informants
- Genre: colloquial language
- Dataset size: 5,700 audio excerpts
- Timescale: 2014–2016
- prosody
| Tommi Kurki | founder and principal investigator (PI) |
| Tommi Nieminen | founder and member of the steering committee |
| Hamid Behravan | project researcher |
Contact person
| Tommi Kurki | tommi.kurki *at* utu.fi |