Low-Resource Active Learning of Morphological Segmentation


  • Stig-Arne Grönroos Department of Signal Processing and Acoustics, Aalto University, Finland
  • Katri Hiovain Institute of Behavioural Sciences, University of Helsinki, Finland
  • Peter Smit Department of Signal Processing and Acoustics, Aalto University, Finland
  • Ilona Rauhala Institute of Behavioural Sciences, University of Helsinki, Finland
  • Kristiina Jokinen Institute of Behavioural Sciences, University of Helsinki, Finland
  • Mikko Kurimo Department of Signal Processing and Acoustics, Aalto University, Finland
  • Sami Virpioja Department of Computer Science, Aalto University, Finland




Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.

Author Biography

Peter Smit, Department of Signal Processing and Acoustics, Aalto University, Finland


Aikio, Ante. 2005. Pohjoissaamen alkeiskurssi. Lecture material.

Baum, Leonard E. 1972. An inequality and an associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities 3(1):1–8.

Bosch, Sonja E, Laurette Pretorius, Kholisa Podile, and Axel Fleisch. 2008. Experimental fast-tracking of morphological analysers for Nguni languages. In LREC.

Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1):37–46. DOI: 10.1177/001316446002000104

Creutz, Mathias, Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph-based speech recognition and modeling of out-of-vocabulary words across languages. ACM Transactions on Speech and Language Processing 5(1):3:1–3:29.

Creutz, Mathias and Krista Lagus. 2002. Unsupervised discovery of morphemes. In Proceedings of the Workshop on Morphological and Phonological Learning of ACL’02, pages 21–30. Philadelphia, Pennsylvania, USA. DOI: 10.3115/1118647.1118650

Creutz, Mathias and Krista Lagus. 2004. Induction of a simple morphology for highlyinflecting languages. In Proc. 7th Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON), pages 43–51. Barcelona.

Creutz, Mathias and Krista Lagus. 2005. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the AKRR’05. Espoo, Finland.

Creutz, Mathias and Krista Lagus. 2007. Unsupervised models for morpheme segmentation and morphology learning. ACM Transactions on Speech and Language Processing 4(1). DOI: 10.1145/1187415.1187418

Creutz, Mathias and Krister Lindén. 2004. Morpheme segmentation gold standards for Finnish and English. Tech. Rep. A77, Publications in Computer and Information Science, Helsinki University of Technology.

Druck, Gregory, Burr Settles, and Andrew McCallum. 2009. Active learning by labeling features. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 81–90. Association for Computational Linguistics. DOI: 10.3115/1699510.1699522

Fishel, Mark and Harri Kirik. 2010. Linguistically motivated unsupervised segmentation for machine translation. In LREC.

Freund, Yoav, H Sebastian Seung, Eli Shamir, and Naftali Tishby. 1997. Selective sampling using the query by committee algorithm. Machine learning 28(2-3):133–168. DOI: 10.1023/A:1007330508534

Grönroos, Stig-Arne, Kristiina Jokinen, Katri Hiovain, Mikko Kurimo, and Sami Virpioja. 2015a. Low-resource active learning of North Sámi morphological segmentation. In Proceedings of 1st International Workshop in Computational Linguistics for Uralic Languages, pages 20–33. DOI: 10.7557/5.3465

Grönroos, Stig-Arne, Sami Virpioja, and Mikko Kurimo. 2015b. Tuning phrase-based segmented translation for a morphologically complex target language. In Proceedings of the Tenth Workshop on Statistical Machine Translation. Lisbon, Portugal: Association for Computational Linguistics. DOI: 10.18653/v1/W15-3010

Grönroos, Stig-Arne, Sami Virpioja, Peter Smit, and Mikko Kurimo. 2014. Morfessor FlatCat: An HMM-based method for unsupervised and semi-supervised learning of morphology. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics, pages 1177–1185. ACL.

Guyon, Isabelle, Gavin C. Cawley, Gideon Dror, and Vincent Lemaire. 2011. Results of the active learning challenge. In Active Learning and Experimental Design workshop, In conjunction with AISTATS 2010, Sardinia, Italy, May 16, 2010, pages 19–45.

Hammarström, Harald and Lars Borin. 2011. Unsupervised learning of morphology. Computational Linguistics 37(2):309–350. DOI: 10.1162/COLI_a_00050

Hirsimäki, Teemu, Mathias Creutz, Vesa Siivola, Mikko Kurimo, Sami Virpioja, and Janne Pylkkönen. 2006. Unlimited vocabulary speech recognition with morph language models applied to Finnish. Computer Speech and Language 20(4):515–541. DOI: 10.1016/j.csl.2005.07.002

Jokinen, Kristiina. 2014. Open-domain interaction and online content in the sami language. In Language Resources and Evaluation Conference, pages 517–522.

Jokinen, Kristiina and Graham Wilcock. 2014a. Community-based resource building and data collection. In The 4th International Workshop on Spoken Language Technologies for Under-resourced Languages (SLTU’14).

Jokinen, Kristiina and Graham Wilcock. 2014b. Multimodal open-domain conversations with the Nao robot. In J. Mariani, S. Rosset, M. Garnier-Rizet, and L. Devillers, eds., Natural Interaction with Robots, Knowbots and Smartphones: Putting Spoken Dialogue Systems into Practice, pages 213–224. Springer. DOI: 10.1007/978-1-4614-8280-2_19

Karlsson, Fred. 1982. Suomen kielen äänne- ja muotorakenne. Helsinki: WSOY.

Kohonen, Oskar, Sami Virpioja, and Krista Lagus. 2010. Semi-supervised learning of concatenative morphology. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, pages 78–86. Uppsala, Sweden: Association for Computational Linguistics.

Koskenniemi, Kimmo. 1983. Two-level morphology: A general computational model for word-form recognition and production. Ph.D. thesis, University of Helsinki.

Koskenniemi, Kimmo. 2008. How to build an open source morphological parser now. In Resourceful Language Technology–Festschrift in Honor of Anna Sågvall Hein, page 86.

Kurimo, Mikko, Mathias Creutz, and Ville Turunen. 2007. Unsupervised morpheme analysis evaluation by IR experiments – Morpho Challenge 2007. In A. Nardi and C. Peters, eds., Working Notes for the CLEF 2007 Workshop. CLEF. Invited paper.

Kurimo, Mikko, Sami Virpioja, and Ville T. Turunen. 2010. Overview and results of Morpho Challenge 2010. In Proceedings of the Morpho Challenge 2010 Workshop, pages 7–24. Espoo, Finland: Aalto University School of Science and Technology, Department of Information and Computer Science. Technical Report TKK-ICS-R37.

Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10(8):707–710.

Lewis, David D and Jason Catlett. 1994. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the eleventh international conference on machine learning, pages 148–156.

Lewis, David D and William A Gale. 1994. A sequential algorithm for training text classifiers. In Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, pages 3–12. Springer-Verlag New York, Inc.

Lindén, Krister, Miikka Silfverberg, and Tommi Pirinen. 2009. Hfst tools for morphology–an efficient open-source package for construction of morphological analyzers. In State of the Art in Computational Morphology, pages 28–47. Springer. DOI: 10.1007/978-3-642-04131-0_3

McCallumzy, Andrew Kachites and Kamal Nigamy. 1998. Employing EM and pool-based active learning for text classification. In Machine Learning: Proceedings of the Fifteenth International Conference, ICML. Citeseer.

Nickel, Klaus Peter and Pekka Sammallahti. 2011. Nordsamisk grammatikk. Kárášjohka: Davvi Girji.

Oflazer, Kemal, Sergei Nirenburg, and Marjorie McShane. 2001. Bootstrapping morphological analyzers by combining human elicitation and machine learning. Computational Linguistics 27(1):59–85. DOI: 10.1162/089120101300346804

Poon, Hoifung, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised morphological segmentation with log-linear models. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 209–217. Association for Computational Linguistics. DOI: 10.3115/1620754.1620785

Rissanen, Jorma. 1989. Stochastic Complexity in Statistical Inquiry, vol. 15. Singapore: World Scientific Series in Computer Science.

Ruokolainen, Teemu, Oskar Kohonen, Kairit Sirts, Stig-Arne Grönroos, Mikko Kurimo, and Sami Virpioja. 2016. A comparative study on minimally supervised morphological segmentation. Computational Linguistics. DOI: 10.1162/COLI_a_00243

Ruokolainen, Teemu, Oskar Kohonen, Sami Virpioja, and Mikko Kurimo. 2014. Painless semi-supervised morphological segmentation using conditional random fields. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 84–89. Gothenburg, Sweden: Association for Computational Linguistics. DOI: 10.3115/v1/e14-4017

Sammallahti, Pekka. 1998. The Saami Languages: An Introduction. Kárášjohka: Davvi Girji.

Scheffer, Tobias, Christian Decomain, and Stefan Wrobel. 2001. Active hidden markov models for information extraction. In F. Hoffmann, D. Hand, N. Adams, D. Fisher, and G. Guimaraes, eds., Advances in Intelligent Data Analysis, vol. 2189 of Lecture Notes in Computer Science, pages 309–318. Springer Berlin Heidelberg. ISBN 978-3-540-42581-6. DOI: 10.1007/3-540-44816-0_31

Settles, Burr. 2009. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin, Madison.

Seung, H Sebastian, Manfred Opper, and Haim Sompolinsky. 1992. Query by committee. In Proceedings of the fifth annual workshop on Computational learning theory, pages 287–294. ACM. DOI: 10.1145/130385.130417

Sirts, Kairit and Sharon Goldwater. 2013. Minimally-supervised morphological segmentation using adaptor grammars. TACL 1:255–266.

Thompson, Cynthia A., Mary Elaine Califf, and Raymond J. Mooney. 1999. Active learning for natural language parsing and information extraction. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML-99), pages 406–414. Bled, Slovenia.

Trosterud, Trond and Heli Uibo. 2005. Consonant gradation in Estonian and Sámi: two-level solution. In Inquiries into Words, Constraints and Contexts–Festschrift for Kimmo Koskenniemi on his 60th Birthday, page 136. Citeseer.

Tyers, Francis M, Linda Wiechetek, and Trond Trosterud. 2009. Developing prototypes for machine translation between two Sámi languages. In Proceedings of the 13th Annual Conf. of the EAMT, pages 120–128.

Virpioja, Sami, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. 2013. Morfessor 2.0: Python implementation and extensions for Morfessor Baseline. Report 25/2013 in Aalto University publication series SCIENCE + TECHNOLOGY, Department of Signal Processing and Acoustics, Aalto University, Helsinki, Finland.

Virpioja, Sami, Ville Turunen, Sebastian Spiegler, Oskar Kohonen, and Mikko Kurimo. 2011. Empirical comparison of evaluation methods for unsupervised learning of morphology. Traitement Automatique des Langues 52(2):45–90.

Virpioja, Sami, Jaakko J Väyrynen, Mathias Creutz, and Markus Sadeniemi. 2007. Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner. Machine Translation Summit XI 2007:491–498.

VISK. 2004. Auli Hakulinen, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Riitta Heinonen and Irja Alho. Iso suomen kielioppi. [Online database, http://scripta. kotus.fi/visk referenced 7.10.2016].

Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory 13(2):260–269. DOI: 10.1109/TIT.1967.1054010

Wilcock, G., N. Laxström, J. Leinonen, P. Smit, M. Kurimo, and K. Jokinen. 2016. Towards SamiTalk: a Sami-speaking robot linked to Sami Wikipedia. In K. Jokinen and G. Wilcock, eds., Dialogues with Social Robots: Enablements Analyses, and Evaluation, pages 301–309. Springer.

Xu, Zhao, Kai Yu, Volker Tresp, Xiaowei Xu, and Jizhi Wang. 2003. Representative sampling for text classification using support vector machines. In F. Sebastiani, ed., Advances in Information Retrieval, vol. 2633 of Lecture Notes in Computer Science, pages 393–407. Springer Berlin Heidelberg. ISBN 978-3-540-01274-0. DOI: 10.1007/3-540-36618-0_28

Yarowsky, David and Richard Wicentowski. 2000. Minimally supervised morphological analysis by multimodal alignment. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pages 207–216. Association for Computational Linguistics. DOI: 10.3115/1075218.1075245

Álgu-tietokanta. 2006. Kotimaisten kielten tutkimuskeskus. Sámegielaid etymologaš diehtovuođđu. [Online database, http://kaino.kotus.fi/algu/ referenced 15.8.2015].