Search
Program Calendar
Browse By Day
Browse By Person
Browse By Division
Browse By Session Type
Search Tips
Personal Schedule
Sign In
X (Twitter)
Open access to language data is a fundamental prerequisite for the successful development of language technology. Despite there exist several Latvian speech corpora used for training competitive models for speech recognition, they are not openly available, whereas open speech corpora available at the beginning of 2023 were relatively modest, with the 18-hour Latvian Common Voice 13.0 corpus being the most extensive among them. The situation changed when researchers, technology enthusiasts, and cultural heritage experts joined forces to initiate a crowdsourcing initiative called "Balsu talka".
The first campaign of “Balsu talka” was initiated on May 4th, 2023, featuring a new landing page (balsutalka.lv) linked to the Mozilla Common Voice platform which had already accumulated the initial 18 hours of voice recordings. By the start of December 2023 which marks the seventh month of the initiative, the total contribution reached 186 hours of voice recordings in total. This remarkable achievement can be attributed to the dedicated contributions of numerous individuals actively engaged in the campaign. The current outcomes of this initiative not only highlight the valuable benefits for the accessibility of open speech data but also underscore the qualitative enrichment facilitated by a diverse range of contributors and their awareness of the essential importance inherent in each contribution.
The paper navigates the challenges and opportunities of citizen engagement in language technology advancement, aligning within the broader context of citizen science and participatory methodologies within the realm of humanities. By drawing insights from the “Balsu Talka" initiative and communication strategies, and analyzing their efficiency, the presentation will contribute to the discourse on democratizing access to language resources and research data in general, emphasizing the tangible impact of citizen-driven efforts in enhancing the accessibility of speech data for language technology development.
Sanita Reinsone, PhD, is a senior researcher and head of the DH Research Group at the Institute of Literature, Folklore and Art of the University of Latvia, specializing in digital humanities and folklore studies. Reinsone is a former head of the Digital Archives of Latvian Folklore (garamantas.lv) and a leader of national and international research projects dealing with digital humanities, autobiography heritage, and participatory methods in humanities. She teaches folklore studies and digital humanities courses at the University of Latvia and Riga Technical University. Since February 2022, Reinsone is also coordinating the international volunteer NGO #ScienceForUkraine.
Normunds Grūzītis, Dr. sc. comp., senior researcher at the Institute of Mathematics and Computer Sciences of the University of Latvia, head of the Artificial Intelligence Laboratory. He is also an associate professor at the Faculty of Computing, University of Latvia. Grūzītis is an experienced project coordinator on natural language understanding and generation, speech recognition, and the creation of advanced Latvian language resources. In cooperation with industry partners, he has been involved in several language technology-driven innovation projects. His main research interests are in computational linguistics and language technology, by combining knowledge-based and machine-learning approaches. He received his PhD in computer science from the University of Latvia in 2011, followed by a postdoc position at the University of Gothenburg, Sweden. He represents Latvia as Technical NAP at ELRC (CEF) and as National Expert at CLARIN ERIC.
Ilze Auziņa, Dr. philol., is a senior researcher at the AiLab, Institute of Mathematics and Computer Sciences of the University of Latvia. She is a Latvian linguist, defended her PhD thesis on computational phonology investigating syllable structure, grapheme-phoneme correspondences, phonotactics of Latvian. She co-authored “The Grammar on Modern Latvian” on phonetics and phonology. Auziņa holds an extensive research experience in corpus linguistics, speech data processing and analysis. Auziņa has coordinated several research projects on the creation of advanced Latvian language resources and carried out several specialised corpora development projects as the project coordinator and the leading researcher.
Baiba Valkovska, Dr. philol., senior researcher at the Institute of Mathematics and Computer Sciences of the University of Latvia. She is a Latvian linguist, defended her PhD thesis on word order and information structure in Latvian and co-author of “The Grammar on Modern Latvian” on syntax and information structure. Valkovska has over 10 years of experience in morphological, syntactic and semantic analysis of Latvian. Her research in computational linguistics focuses on multi-layered semantically annotated language resources for Latvian needed for natural language processing.