Webbinarier om crowdsourcing och maskintranskribering

Under oktober och november 2020 arrangerar Språkbanken Sam flera webbinarier om crowdsourcing och maskintranskribering. Välkommen att delta!

På många arkiv och museer pågår arbetet med att omvandla handskrivna texter till maskinläsbara. Lyckas man är fördelarna uppenbara – förutom att fler kan tillgodogöra sig samlingarna skapar det också möjligheter att söka i och ställa nya frågor till materialet.

Till denna webbinarieserie har vi bjudit in ett antal arkivarier och forskare och bett dem presentera pågående crowdsourcing och/eller maskintranskriberingsprojekt. Vad är tankarna bakom projekten? Hur är de uppbyggda? Resultatet? Vad kunde ha gjorts bättre och vilka är planerna inför framtiden? Varje webbinarium kommer att inledas med en projektpresentation på cirka 30 minuter följt av en diskussion.

Program

13 oktober, 10-12

Sanita Reinsone, Riga
Crowdsourcing in Practice: Digital Archives of Latvian Folklore (webbinariet är på engelska)

Crowdsourced transcription of manuscripts of the Archives of Latvian Folklore (ALF) is being carried out since 2014 when its digital archive http://garamantas.lv Länk till annan webbplats. was made open to the public. Two years later, a specialised digital platform lv100.garamantas.lv Länk till annan webbplats. was launched to more promote crowdsourcing of folklore manuscripts. Since 2016, volunteers have spent more than 24.700 hours in deciphering ALF's manuscripts in eleven languages providing invaluable help in making the folklore collections digitally accessible. The presentation will give insight into different user involvement strategies practised by ALF, reveal the main challenges and problems, as well as will discuss the motivation for participating.

20 oktober, 10-12

Críostóir Mac Cárthaigh, Dublin
Meitheal Dúchas.ie: Sharing the work of digitizing the National Folklore Collection
(webbinariet är på engelska)

Meitheal Dúchas.ie is a crowdsourcing project established in 2015 to promote the digitization of folklore texts from the National Folklore Collection, University College Dublin, hosted on its digital platform www.duchas.ie Länk till annan webbplats.. In the intervening years, almost 6,000 people have taken part in the project. They include academic researchers, students, educators, local historians, artists and writers. To date, more than 250,000 pages of archive material have been digitized, a process that has quickened noticeably in recent months as a consequence of the Covid-19 pandemic. It is hoped to extend this crowdsourcing model to other elements of the National Folklore Collection, including photographic and audio material.

17 november, 13-15

Karl-Magnus Johansson, Göteborg
Machine Learning and Local Knowledge -- A Presentation of an Ongoing Handwritten Text Recognition and Citizen Science Project at the National Archives in Gothenburg (webbinariet är på svenska)

In early 2020 a Handwritten Text Recognition (HTR) and Citizen Science project was initiated at the Swedish National Archives in collaboration with GPS400 - Centre for Collaborative Visual Research at the University of Gothenburg. The project’s archival material consists of police reports from Gothenburg 1868-1902 in more than 22 000 pages of handwritten text. To produce high quality training data for the HTR-model, as well as to raise the quality of the automatically transcribed data, people from civic society were invited to participate in the project. In this presentation, archivist Karl-Magnus Johansson talks about his experiences of the ongoing project, in connection to recent studies of the relationship between data and local knowledge.

24 november, 13-15

Erik Magnusson Petzell, Göteborg Automatic transcription of dialect texts – (webbinariet är på svenska)

In this seminar, I will describe my ongoing work with automatic transcription of 19th century dialect texts, handwritten in a traditional phonetic alphabet that is only marginally used today. Such texts exist in archives all over Scandinavia, and through them, we are granted access to the linguistic subtleties of an era that is too distant to have been caught on audio tape. So far, I have only scraped the surface of this great pile of detailed dialect data.

For practical reasons, I have started with texts from the dialect archive in Gothenburg, where I work. In the presentation, I will describe all the steps involved in converting the image of handwritten text to a digital and fully searchable correlate, highlighting various difficulties I have encountered on the way. These include transliteration issues (How does one transcribe non-Unicode fonts?), problems with machine learning (How can a HTR model trained on one hand/dialect be extended to more hands/dialects?), and not least challenges relating to output: In order to make the old dialect texts useful for different sorts of linguistic research, the precise phonetic transcription cannot constitute the only resource. In addition, there is need for several conversions of the original text into different more or less simplified formats, which, in turn, can be useful also for non-linguists (both other researchers and members of the general public).

How to best accomplish such a multi-layered resource is one of the questions that I look forward to discussing with the webinar attendants. Another one regards crowdsourcing. What would be suitable tasks for members of the public in this project? Layout analysis and metadata extraction only? Or more advanced tasks, such as corrections of machine transcriptions or even manual transcriptions of new hands?