Study of Language Identification Task on the Token Level for Ukrainian-Russian Code-Mixing Dataset
DOI:
https://doi.org/10.3384/nejlt.2000-1533.2026.5725Abstract
This paper presents experiments on language identification for a Ukrainian-Russian code-switching dataset. Code-switching, a common phenomenon in multilingual societies, presents significant challenges for natural language processing. This study discusses various issues encountered during dataset creation, emphasizing the complexity of accurately annotating code-switching text. The study describes cases where identifying the language of individual tokens in sentences that switch between Ukrainian and Russian proves difficult even for human annotators. The relatedness of the languages and the use of Cyrillic in both orthographic systems complicate the task, leading to many cases where words are spelled identically despite clear phonetic differences between the languages that are not reflected in writing. The study explores different models and libraries for language identification on the token level. Experimental results suggest that BERT shows promising performance; however, other models, such as CRFs with n-grams, Char-level BiLSTM, and Word-level Neural Networks, are also promising for this task. This research contributes to the development of language processing technologies for multilingual contexts, with potential applications in sentiment analysis, information retrieval, and social media monitoring.
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Olha Kanishcheva, Maria Shvedova, Liudmyla Dyka, Kristina Husenko

This work is licensed under a Creative Commons Attribution 4.0 International License.