Väitös (tietojenkäsittelytieteet): FM Jenna Kanerva

Aika

15.3.2024 klo 11.00 - 15.00
FM Jenna Kanerva esittää väitöskirjansa ”Understanding the Structure and Meaning of Finnish Texts: From Corpus Creation to Deep Language Modelling” julkisesti tarkastettavaksi Turun yliopistossa perjantaina 15.03.2024 klo 11.00 (Turun yliopisto, Educarium, Edu2, Assistentinkatu 5, Turku).

Yleisön on mahdollista osallistua väitökseen myös etäyhteyden kautta: https://echo360.org.uk/section/bcc6b487-8fee-445c-a37f-eba6fa28d8a0/public (kopioi linkki selaimeen).

Vastaväittäjänä toimii apulaisprofessori Kairit Sirts (Tarton yliopisto, Viro) ja kustoksena professori Tapio Salakoski (Turun yliopisto). Tilaisuus on englanninkielinen. Väitöksen alana on tietojenkäsittelytieteet.

Väitöskirja yliopiston julkaisuarkistossa: https://urn.fi/URN:ISBN:978-951-29-9623-0

***

Tiivistelmä väitöstutkimuksesta:

Natural Language Processing (NLP) is a field that aims to develop methods for analysing, understanding or generating human language. The primary aim of this thesis is to advance NLP in Finnish by providing more resources and investigating machine learning based practices for their use. While NLP includes various topics involving textual or speech data, this thesis specifically focuses on understanding the structure and meaning of written language. The research concentrates on structural and grammatical analysis (syntactic parsing) as well as exploring statements that convey the same meaning but use different words (paraphrase modelling).

The first set of contributions of this thesis centers on the development of a state-of-the-art Finnish parser, a tool for analysing Finnish text by its grammatical structure. The overall outcome of this line of research is a machine-learned tool that approaches or nearly matches human performance on analysing standard written Finnish. Major advances were obtained by using pre-trained, neural language models.

The success of large language models in syntactic parsing, as well as in many other tasks, raises the question of whether these models genuinely comprehend language. However, datasets designed to measure semantic comprehension in Finnish have been non-existent, or very scarce. To address this limitation, the second part of the thesis shifts its focus to language understanding through paraphrase modelling. The second contribution of the thesis is the creation of a novel, large-scale, manually annotated corpus of Finnish paraphrases, which can be used e.g. to measure the ability of language models to handle variation in expressing similar ideas.
Viestintä