Skip to main content

Part-of-Speech Tagging & Lemmatisation in Unedited Greek: Simple Tasks, Complex Challenges?

Bush House, Strand Campus, London

In today’s landscape of language technology, dominated by large language models, tasks like part-of-speech tagging and lemmatisation receive less attention in current NLP research. However, these tasks still pose significant challenges, especially for under-resourced, morphologically rich languages like Ancient Greek. Our project focuses on the verbatim transcriptions of Byzantine marginal poetry stored in the Database of Byzantine Book Epigrams (DBBE). Due to the highly interconnected nature of the poems, we aim to eventually perform similarity detection across the corpus. As a first step, we sought to annotate the DBBE with part-of-speech tags, morphological analyses, and lemmas. Although research on these tasks dates back to more straightforward rule-based systems from the 1970s, current taggers struggle with these unedited texts. The inconsistent orthography — largely due to itacism — adds to this complexity. To mitigate these issues, we trained a transformer-based language model encompassing classical, medieval, and modern Greek. Our experiments, however, revealed that fine-tuning the model for each annotation task was not always fruitful. There is a growing tendency to address such challenges with a multi-task head, allowing the model to process multiple annotations concurrently, drawing inspiration from cognitive psychology. This raises the question: will this more intricate solution outshine the seemingly more transparent methods of the past?

Speaker info:

Colin Swaelens a PhD student at the Language & Translation Technology Team (LT3) and the Database of Byzantine Book Epigrams (DBBE) at Ghent University, under supervision of dr. Ilse De Vos (Flanders AI Academy) and prof. Els Lefever (LT3). His PhD project is embedded in the project Interconnected texts: a graph-based computational approach to Byzantine paratexts as nodes between textual transmission and cultural and linguistic developments. Within this project, he is developing an annotation pipeline to provide all texts in DBBE with a part-of-speech tag, morphological analysis and lemma. This linguistic information will, in a next stage, be used within the development of a tool to detect similar verses in this corpus, serving the other subprojects on manuscript culture and formulaicity.

To receive the link to join, please register here by 1 December 2024.


Search for another event