Domain Adaptation in Sequence Labelling: A Case Study for Two South African Languages

Tanja Gaustad; Roald Eiselen

doi:10.3384/nejlt.2000-1533.2026.6066

Authors

Tanja Gaustad Centre for Text Technology (CTexT), North-West University https://orcid.org/0000-0002-1455-1941
Roald Eiselen Centre for Text Technology (CTexT), North-West University https://orcid.org/0000-0002-8612-5175

DOI:

https://doi.org/10.3384/nejlt.2000-1533.2026.6066

Abstract

In this paper, we investigate domain adaptation for Part-of-speech (POS) tagging of two under-resourced South African languages, isiZulu and Sesotho sa Leboa, by studying its effect on the POS tagging results and how to possibly predict what quality can be expected when applying an existing POS tagger to a new domain. We carry out systematic experiments across six domains (governmental texts, exam texts for grade 12 South African learners, magazines, newspapers, novels, and PhD theses) to determine how POS tagger accuracy deteriorates when switching between domains. To mitigate this quality deterioration, three different domain adaptation strategies are tested to determine the most relevant approach in highly under-resourced scenarios. The results of these experiments show that adding even relatively small amounts of annotated data from a target domain delivers the highest accuracy on the target domain compared to other domain adaptation methods. To determine the underlying causes of the accuracy deterioration, a forward stepwise linear regression modelling experiment shows that a combination of lexical and syntactic divergence can account for a significant amount of the deterioration, and are good predictors of the expected deterioration when applying POS tagging to a new domain.

Domain Adaptation in Sequence Labelling: A Case Study for Two South African Languages

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

License

Make a Submission