FIN-CLARIAH
FIN-CLARIAH (Common Language Resoures and Technology Infrastructure) is a nationwide research infrastructure for Social Sciences and Humanities that is comprised of two components: FIN-CLARIN and DARIAH-FI.
The University of Turku advances the goals of FIN-CLARIAH by supporting and sharing resources. The digitisation of the Archives of the School of History, Culture and Arts Studies supports the usage of resources in the archive, and the UTU-Digilang portal highlights the wide variety of corpora and other language resources developed at the School of Languages and Translation Studies. In addition to resource work, the University also has numerous projects advancing digital humanities. The topic is also an important part of education provided by the University.
FIN-CLARIAH work at the University of Turku is led by the TurkuNLP and Turku Data Science Group research groups, which are known for language corpora and tools for their usage, and digital curation of humanities data.
FIN-CLARIAH projects
The project uses machine learning to automatically identify social media text varieties in web dataset. The deliverable includes the following resources:
- A multilingual classifier for labeling web documents by their register (or genre), including social media categories such as blogs and forums.
- Social media subtype classifiers for English, Finnish, and Swedish for identifying thematic groups within social media registers (e.g. travel topics within Narrative Blogs).
- Datasets labeled with register and fine-grained social media subtype metadata.
- A demonstration pipeline and tutorial on Google Colab
- A code repository on Github
These tools provide contextual metadata for internet datasets, which expand their usability in social sciences and humanities, such as in the fields of corpus linguistics, digital humanities, and computational social science.
Details as well as all the tools can be found on GitHub
Contributors: Erik Henriksson, Tuomas Lundberg, and Veronika Laippala
This deliverable consists of three parts:
- Text quality data: tools for cleaning web-based data so that non-desired elements such as "click here" and "read more" are not present
- Register annotations for the Oscar dataset: a multilingual collection of text tagged with information about text register
- Toxicity classifier: tools for detecting toxic language in Finnish and labeling it by type
Contributors: Veronika Laippala, Filip Ginter, Sampo Pyysalo, Anni Eskelinen, and Anna Salmela
The pipeline includes two parts: one which extracts documents that might include question and answer pairs from web-crawled corpora, and one which extracts the QA pairs from documents. Both are available for Finnish and English.
In addition to the tools, the project produced annotated corpora of QA pairs in English and Finnish.
Links to the tools and corpora can be found at the Language Bank of Finland.
Contributors: Anni Eskelinen, Veronika Laippala, Amanda Myntti, Erik Henriksson, and Sampo Pyysalo
This project provides tools for the study of political texts. The tools were developed for the FinParl corpus, which contains parlamentary plenary speeches from 1907 to the present day.
Two tools are available on the LAWPOL site:
- KWIC tool for FinParl corpus: This tool provides a user interface to query word embeddings with KWIC (Key Word In Context) method. The tool offers a simple, yet intuitive user interface built with R Shiny, with which the user can query key word embeddings of the FinParl corpus of plenary debates of the Finnish parliament (eduskunta) and use the KWIC results to inspect n-grams and to visualise key word embeddings as text networks.
- TNA tool for the analysis of speeches of Finnish MPs: This tool will provide functionalities for vocabulary based content analysis of political speeches. The user selects an MP and can then study 1) a timeline of the MP’s plenary speeches, 2) a wordcloud of max. 500 most used words by the MP, as well 3) a speaker-to-concept network consisting of the 50 most frequently used concepts of the selected MP and and of their most similar colleagues (similarity measured as word-based cosine similarity).
Contributors: Kimmo Elo, Veronika Laippala, Otto Tarkka, Pyry Kantanen & Markus Korhonen
LAWPOL has received funding from the European Union – NextGenerationEU instrument and is funded by the Academy of Finland under grant numbers 352827, 353569, and 352828.
Other projects, research groups and networks in digital human sciences
The Centre is one of eleven chosen by the Research Council of Finland for the period 2026–2033. The research groups are at the international cuttings edges of research in their fields.
The group researches human diversity and its temporal and locational changes by using internet language, linguistic and cultural resources as well as ancient and modern genetic data. The group studies how humans integrate into new places considering their lifestyles, languages, cultures, and genetics. The main focus is on Finland, but local changes are connected to the history of the Uralic speaker area and with regard to internet language also to global contexts.
The group is lead by Professor of Evolutionary Biology Virpi Lummaa, and the researchers are Professor of Digital Linguistics Veronika Laippala, Professor of Evolutionary Genomics Päivi Onkamo, and Assistant Professor of Evolutionary Linguistics Outi Vesakoski.
As the name suggests, TurkuNLP focuses on natural language processing. The international and interdisclipinary group has diverse viewpoints and interests ranging from corpus annotation and analysis to the theory and applications of machine learning.
The main research topics are:
- Syntactic and semantic analysis of Finnish
- Large Language Models for Finnish and other languages
- BioNLP: mining and modeling biological, biomedical, and clinical texts
- Modeling and analysis of registers and genres on the internet
- Linguistic analysis of language use on the internet
Turku Data Science Group focuses on the computational analysis of complex natural and social systems. Key applications include microbiome research, population studies, and computational humanities.
The group blends machine learning, artificial intelligence, statistical and probabilistic programming, complex systems, and data science in order to develop novel, targeted techniques to extract information and insights from rich data streams based on combinations of human and machine intelligence.
Turku Data Science Group website
The work has been supported by the European Union/Horizon, the Research Council of Finland, the Strategic Research Council in Finland, the Kone Foundation, the Alhopuro foundation, the Finnish Cultural Foundation, the Turku University Foundation, Biocity Turku, CIMO/EFUDI, the UTUGS/MATTI and UTUGS/DPT Graduate Schools and the University of Turku
DigiHeri focuses on digitised and/or born-digital cultural heritage: its identification, utilisation, enrichment, and critical research. It combines cultural heritage research with cutting-edge study on digitality, language-based AI and natural language processing. The aim is to understand the transformation of cultural heritage and its impact in the digital era. What new forms of cultural heritage has digitalisation produced? How does the division between digital and non-digital heritage affect the way we perceive and study the world? What is the relationship between digitised and born-digital heritage?
DigiHeri enhances multidisciplinary research by bringing together and building on expertise and cooperation in computer science, cultural heritage studies, digital culture, education, history and archaeology, language technology, linguistics, legal studies, media studies, and the digital humanities. The overall goal of DigiHeri is to understand the transformation of cultural heritage, and its ramifications, in the digital age. It has a clear emphasis on social impact, on the visibility of cultural heritage in public life and on its role in education.
The project emphasises research on digital cultural heritage, the development and exploration of open methods and tools, the construction of shared digital infrastructures, and the societal and cultural implications of new forms of heritage.
DigiHeri is funded by the Academy of Finland (Profi 8).
Human Diversity investigates the effects of human contacts and communication networks on material and immaterial culture, genes, disease burden, transgenerational effects, and the evolutionary fitness of people. The approach is multi-disciplinary, and uses datasets including ancient genetic data, parish records detailing historical demographics and life events, language data, as well as archeological finds.
Human Diversity is funded by the Research Council of Finland (Profi7).
BEDLAN is a multi-disciplinary group of researchers studying the evolution of languages. The goal is to connect the linguistic history of the Uralic languages to the history of people in their speaker area. To achieve this goal, the group combines linguistics, geography, archeology, cultural history, and environmental sciences.
The data gathered by the group is shared according to principles of open science. The data includes information about the features of Uralic languages and Finnish dialects, geographical data about language speaker areas, and archeological data.
The group has been supported by the Kone foundation, the Institute for the Languages of Finland, the Otto A. Malm Foundation as well as the Ella and Georg Ehrnrooth Foundation.
The network connects the methods of digital culture, discourse analysis, and digital humanities. It aims on developing quantitative methods that enable the better usage of large digital datasets in human sciences.
Research topics include social media and internet phenomena like online cultures and communities, changing forms of communication, processes and genres, as well as societal and institutional communication on social media. The topics are studied from historical, communicative, cultural, and linguistic perspectives.
The network currently includes researchers from five faculties at the University of Turku.
The research group uses and develops computational methods for studying the past.
Research projects from the Turku Group for Digital History:
- Computational History and the Transformation of Public Discourse in Finland, 1640–1910: The project investigates the scope, nature, development, and transnational connections of Finnish public discourse. In addition, the project has published an open database on text reuse.
- Information Flows across the Baltic Sea: Swedish-language press as a cultural mediator, 1771–1918: The project uses digital newspaper collections to investigate text reuse between Swedish and Finnish newspapers. The project examines how the Swedish-language press acted as a cultural mediator between Finland and Sweden from the end of the 18th century to the beginning of the 20th century.
- Movie Making Finland: Finnish fiction films as audiovisual big data, 1907–2017: The project uses speech, recognition, image analysis, and natural language processing to explore Finnish fiction films. The project examines how cinema has imagined and interpreted Finnish modernisation and its discontents. In addition to research findings, it provides methods for the study of historical change in audiovisual cultural heritage.
- The Ancient Finnish Kings: a computational study of pseudohistory, medievalism and history politics in contemporary Finland and Russia: The project studies text reuse to find where pseudohistorical narratives about for example mighty warrior kings and ancient Slavic kingdoms arise from, how they spread, and how they acquire new contexts.
- Oceanic Exchanges: Tracing Global Information Networks In Historical Newspaper Repositories, 1840–1914: OcEx brings together leading efforts in computational periodicals research from six countries (Finland, Germany, Mexico, the Netherlands, the United Kingdom, and the United States) to examine the global connections between newspapers across national and linguistic boundaries.
- Propreau – Profiling Premodern Authors: The consortium develops machine learning tools for the examination of authorship in classical and medieval Latin texts. The authorship attribution of anonymous texts requires comparison to a large number of other texts, which can be done at a wider scale with computational methods compared to traditional methods.
- Romantic Cartographies: Lived and Imagined Space in English and German Romantic Texts, 1790–1840: The project applied named-entity linking to extract toponyms from German- and English-language Romantic fiction and travelogues. The results were visualised as maps, which provide findings about the relationships between the center and periphery and urban and natural areas in Romanticism.
- The Atlas of Finnish Literature 1870–1940: The project uses named entity recognition, linked open data, and geographic information systems to map places mentioned in Finnish-language literature. In addition to research findings, the project created a freely usable web app which allows the user to see places from literature on maps and explore the literature.
- Texts on the Move: Reception of Women’s Writing in Finland and in Russia 1840–2020: The project aims to map the reception of Russian women writers in Finland as well as reception of Finnish (both Finnish and Swedish speaking) women authors in Russia. By studying reception (translations and other transnational movement of texts) the project highlights a wide range of literary, cultural, and social networks. The results of the project will be added to the international SHEWROTE database, which examines the reception of works by female authors prior to the year 1940, as well as the Finnish-Russian and Russian-Finnish parallel corpora of literary texts ParFin and ParRus.
- Viral Culture in Early Nineteenth-Century Europe: The project analyses text reuse to examine what kind of ideas and themes were infectious and how texts circulated in the press.
The project investigates digitised Finnish-language newspapers and magazines published in North America between 1876 and 1923. New technologies offer new research possibilities and by using them the project is developing methods for the study of transnational Finnish culture in North America.
The aim of the project is to explore what the methods of digital humanities can tell us about the construction of the Finnish press in North America and its relationship to the former and present homelands. A multidisciplinary team will carry out the research using methods of text-reuse detection, textual genre detection, and named entity recognition, enabling both close and distant reading of the substantial data.
The Imagined Homelands project has a strong cultural significance. Finnish newspapers in North America have so far been a relatively under-used resource for researchers and citizens alike. In the future, the newspaper material digitized by the National Library of Finland will be useful for genealogical research, for example, as it will provide invaluable information about the generations who set out in search of a better life on the other side of the Atlantic. During the project, researchers have access to around 350,000 pages of Finnish newspapers and journals, or periodicals, from North America, and the material is already available to everyone to search and read via the digi.kansalliskirjasto.fi service.
The project is funded by the Kone Foundation, 2024–2027, and is based on a cooperation between the Universities of Helsinki and Turku and the National Library of Finland. The PI of the project is Hannu Salmi, University of Turku.
The IDA consortium critically examines digitalisation, data-lead media, and privacy as well as their conflicts in modern Finland.
The project examines:
- the impact of data-lead culture on social roles and relationships as well as vulnerabilities relating to them
- the usage of intimacy in social relationships and public professions like performing arts or politics
- the just control, sharing, and usage of private data
The project develops innovative tools that combine qualitative and quantitative methods, which can be used to study data culture and data leaks. The purpose of the consortion is to promote just and open principles for the handling of personal data.
The work packages "Digital intimacies", "Politicized intimacies", and "Legalized intimacies" are lead from the University of Turku.
The project is funded by the Research Council of Finland.
The consortium project aims to map and analyze 19th century Finnish fiction literature published as books, based on bibliographic metadata in the national bibliography Fennica. The project will examine literature that has been given less attention both quantitatively and qualitatively, and which earlier research has not highlighted. The project members are the National Library of Finland, the University of Turku, and the University of Eastern Finland.
Details about Finnish and Swedish language fiction will be mapped into an enriched format that permits large scale statistical analyses and the design of reproducible data science workflows. The focus of the project is on the years 1809–1917 when the Grand Duchy of Finland was an autonomous part of the Russian Empire.
In addition to literary history, the project will produce open research data and data science methods for use by the research community and the public.
The project is funded by the Research Council of Finland and participating universities.
The research project combines translation studies, literary studies, and language and translation technology. The project studies narrativity and its modelling, literary characteristics and problems in their translation, the working methods and technological needs of literary translators, and ethical challenges such as copyright issues and sustainable development that are connected to the use of translation technology in literary translation.
The goal of the project is to develop a prototype for a language technology program which aims to assist human translators in professionally translating literary texts from English into Finnish.
The project is funded by the Research Council of Finland.
Professor of Literature Viola Parente-Čapková is a member of the board of this DARIAH-EU working group, which investigates how female authors were read in the past.
Courses and Degree Programmes related to digital humanities
Digital Language Studies combines linguistics and language technology. Language technology develops computational methods for processing the language produced by humans. Well-known applications include machine translation, chatbots, and various text mining methods, such as identifying opinions from large text corpora.
The advancements of language technology have made possible artificial intelligence models, speech recognition, and text categorisation that are all around us. The importance of language technology and its applications has grown significantly in recent years for both businesses and universities, and so both language and ICT experts need an understanding of language technology.
The studies are organized in cooperation by the School of Languages and Translation Studies and Department of Computing. The 25 ECTS of basic studies as well as elective courses provide a thorough understanding of language use in a digital enviroments and the possibilities of automated methods in linguistic research.
The studies provide excellent skills for the management and processing of extensive big data materials for the needs of both language research and text mining. No prior knowledge of programming or computers is required.
In the Language Specialist Degree Programme, students can select to major in Digital Language Studies.
The purpose of this programme is to educate language experts who can use the methods of language technology for many applications and understand their operating principles and opportunities. After completing this training, the student will have a strong understanding of the usage of language technology tools and quantitative analysis of textual data.
The Master’s Degree Programme in Information and Communication Technology provides versatile and high quality ICT education in selected fields of ICT, with an established reputation in innovative, interdisciplinary, and international education.
The Data Analytics track trains specialists for the effective utilization and communication of data in research, decision-making, and society. The focus of teaching is on understanding and applying the operational principles of the key data analysis methods in practice.
The degree programme contains three disciplines: Digital Culture, Landscape Studies, and Cultural Heritage Studies. Digital Culture studies online communities and social media, game cultures, and the cultural change of technology. Landscape Studies examines the built environment and natural and cultural landscape as materiality, experience, and representation. Cultural Heritage Studies investigates cultural memory, history management, and the use of history.
Studies in Digital Culture offer diverse skills and expertise in the subject matter, theory, and methodology, which can be used in a wide range of professions. The three focal points (online communities and social media, game cultures, and the cultural change of technology) are strongly present in teaching both thematically and through practical examples and applications. Teaching and research go hand in hand: courses are connected to research projects so that teaching is based on recent research and students are able to participate in research.
More information about the programme
The mission of Utuling is to provide PhD training for language specialists in a world of multilingualism and multiculturalism, where encounters between different languages and their speakers are increasingly important.
The doctoral researchers' areas of research belong to linguistic, translation, and literary scholarship and to other humanistic studies, and their approach is often multi- or cross-disciplinary. Options for majors in Utuling include all subjects available at the School of Languages and Translation Studies as well as two additional options.
A course that provides the student with an understanding of digital humanities related to literary studies. By reading research articles and theory, the student will learn to know different branches of digital literary studies and have a critical understanding of the subject's central discussions.
Contact person
Multidisciplinary themes present in FIN-CLARIAH work