Study of Language Identification Task on the Token Level for Ukrainian-Russian Code-Mixing Dataset

Olha Kanishcheva; Maria Shvedova; Liudmyla Dyka; Kristina Husenko

doi:10.3384/nejlt.2000-1533.2026.5725

Authors

Olha Kanishcheva Heidelberg University, SET University https://orcid.org/0000-0002-9035-1765
Maria Shvedova National Technical University "Kharkiv Polytechnic Institute" https://orcid.org/0000-0002-0759-1689
Liudmyla Dyka National University of Kyiv-Mohyla Academy https://orcid.org/0000-0003-2985-9292
Kristina Husenko University of Helsinki https://orcid.org/0000-0002-1327-0339

DOI:

https://doi.org/10.3384/nejlt.2000-1533.2026.5725

Abstract

This paper presents experiments on language identification for a Ukrainian-Russian code-switching dataset. Code-switching, a common phenomenon in multilingual societies, presents significant challenges for natural language processing. This study discusses various issues encountered during dataset creation, emphasizing the complexity of accurately annotating code-switching text. The study describes cases where identifying the language of individual tokens in sentences that switch between Ukrainian and Russian proves difficult even for human annotators. The relatedness of the languages and the use of Cyrillic in both orthographic systems complicate the task, leading to many cases where words are spelled identically despite clear phonetic differences between the languages that are not reflected in writing. The study explores different models and libraries for language identification on the token level. Experimental results suggest that BERT shows promising performance; however, other models, such as CRFs with n-grams, Char-level BiLSTM, and Word-level Neural Networks, are also promising for this task. This research contributes to the development of language processing technologies for multilingual contexts, with potential applications in sentiment analysis, information retrieval, and social media monitoring.

Study of Language Identification Task on the Token Level for Ukrainian-Russian Code-Mixing Dataset

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

License

Make a Submission