Skip to main content

Building a cross-lingual dataset from medieval manuscript text recognition, challenges and outcomes of CATMuS

Online

Over the past few years, we have developed the CATMuS Medieval dataset, encompassing over 200 manuscripts and incunabula in ten different languages, comprising more than 160,000 lines of text and 5 million characters, spanning from the 8th to the 16th century. In this presentation, we will discuss the complexities involved in creating a dataset that serves both medievalists and computer vision researchers. We will explore the inherent tensions between the requirements of historical scholarship and the technical demands of machine learning applications. Additionally, we will share insights into the challenges faced during the construction and evaluation of the dataset, and demonstrate the current capabilities enabled by this resource.

The CATMuS dataset offers a uniform framework for annotating medieval manuscripts, providing a benchmarking environment for evaluating automatic text recognition models across multiple dimensions, thanks to its rich metadata, including century of production, language, genre, and script. It also supports other tasks such as script classification, dating approaches, and exploratory work in computer vision and digital palaeography. Developed through collaboration among various institutions and projects, CATMuS aims to mitigate challenges arising from the diversity in standards for medieval manuscript transcriptions, providing a comprehensive benchmark for evaluating handwritten text recognition models on historical sources.

Speaker Info:

Thibault Clérice is a computational humanities (CH) and natural language processing (NLP) researcher specialising in classical philology. He currently works at Inria within the ALMAnaCH project team, focusing on developing resources and models for the analysis and structuring of textual data, particularly for ancient and historical languages. He is involved in projects that promote open access and reproducibility for NLP and CH, such as the COLaF initiative, which focuses on the languages of France.


Search for another event