Utilizing Language Technology in the Documentation of Endangered Uralic Languages

Authors

  • Ciprian Gerstenberger UiT – The Arctic University of Norway, Giellatekno – Saami Language Technology
  • Niko Partanen University of Hamburg, Department of Uralic Studies
  • Michael Rießler University of Freiburg, Department of Scandinavian Studies
  • Joshua Wilbur University of Freiburg, Department of Scandinavian Studies

DOI:

https://doi.org/10.3384/nejlt.2000-1533.1643

Abstract

The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered – and under-described – Uralic speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Specifically, we describe a script providing interactivity between different morphosyntactic analysis modules implemented as Finite State Transducers and ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora. Ultimately, the spoken corpora created in our projects will be useful for scientifically significant quantitative investigations on these languages in the future.

References

Antonsen, Lene, S. Huhmarniemi, and Trond Trosterud (2009). “Constraint Grammar in dialogue systems”. In: NEALT Proceedings Series 2009. Vol. 8. Tartu: Tartu ülikool, pp. 13–21.

Arkhangelskiy, Timofey and Maria Medvedeva (2016). “Developing morphologically annotated corpora for minority languages of Russia”. In: Proceedings of Corpus Linguistics Fest 2016. Bloomington, IN, USA, June 6–10, 2016. Ed. by Sandra Kübler and Markus Dickinson. CEUR Workshop Proceedings 1607. Bloomington: Indiana University, pp. 1–6. url: http://ceur-ws.org/Vol-1607/arckhangelskiy.pdf.

Austin, Peter K. (2013). “Language documentation and meta-documentation”. In: Keeping languages alive. Documentation, pedagogy and revitalisation. Ed. by Mari Jones and Sarah Ogilvie. Cambridge: Cambridge University Press, pp. 3–15.

Austin, Peter K. (2014). “Language documentation in the 21st century”. In: JournaLIPP 3, pp. 57–71. url: http://lipp.ub.lmu.de/article/download/190/83. DOI: 10.1017/CBO9781139245890.003

Beáta, Wagner-Nagy and Sándor Szeverényi (2015). “Linguistically annotated spoken Nganasan corpus”. In: Tomsk Journal of Linguistics and Anthropology 2, pp. 25–33.

Blokland, Rogier, Marina Fedina, Niko Partanen, and Michael Rießler (2009–2017). “Izhva Kyy”. In: The Language Archive (TLA). Donated Corpora. In collab. with Vasilij Čuprov, Marija Fedina, Dorit Jackermeier, Elena Karvovskaya, Dmitrij Levčenko, and Kateryna AND Olyzko. Nijmegen: Max Planck Institute for Psycholinguistics. url: https://corpus1.mpi.nl/ds/asv/?5&openhandle=hdl:1839/00- 0000-0000-000C-1CF6-F.

Blokland, Rogier, Ciprian Gerstenberger, Marina Fedina, Niko Partanen, Michael Rießler, and Joshua Wilbur (2015). “Language documentation meets language technology”. In: First International Workshop on Computational Linguistics for Uralic Languages, 16th January, 2015, Tromsø, Norway. Proceedings of the workshop. Ed. by Tommi A. Pirinen, Francis M. Tyers, and Trond Trosterud. Septentrio Conference Series 2015:2. Tromsø: The University Library of Tromsø, pp. 8–18. DOI: 10.7557/scs.2015.2

Broeder, Daan and Dieter van Uytvanck (2014). “Metadata formats”. In: The Oxford handbook of corpus phonology. Ed. by Jacques Durand, Ulrike Gut, and Gjert Kristoffersen. Oxford Handbooks. Oxford: Oxford University Press. DOI: 10.1093/oxfordhb/9780199571932.013.008

Comrie, Bernard, Andrey Shluinsky, and Olesya Khanina (2005–2017). “Documentation of Enets. Digitization and analysis of legacy field materials and fieldwork with last speakers”. In: Endangered Languages Archive (ELAR). Digital language archive. London: SOAS University of London. url: https://elar.soas.ac.uk/Collection/MPI950079.

Giellatekno and Divvun (2016). SIKOR UiT The Arctic University of Norway and the Norwegian Saami Parliament’s Saami text collection, Version 08.12.2016. http://gtweb.uit.no/korp. Accessed: 2016-12-08.

Gippert, Jost, Ulrike Mosel, and Nikolaus Himmelmann, eds. (2006). Essentials of language documentation. Trends in Linguistics. Studies and Monographs 178. Berlin: Mouton de Gruyter. DOI: 10.1515/9783110197730

Himmelmann, Nikolaus (2006). “Language documentation. What is it and what is it good for?” In: Essentials of Language Documentation. Ed. by Jost Gippert, Ulrike Mosel, and Nikolaus Himmelmann. Trends in Linguistics. Studies and Monographs 178. Berlin: Mouton de Gruyter, pp. 1–30.

Himmelmann, Nikolaus (2012). “Linguistic data types and the interface between language documentation and description”. In: Language Documentation & Conservation 6, pp. 187–207. url: http://hdl.handle.net/10125/4503.

Jauhiainen, Heidi, Tommi Jauhiainen, and Krister Lindén (2015). “The Finno-Ugric Languages and The Internet Project”. In: First International Workshop on Computational Linguistics for Uralic Languages, 16th January, 2015, Tromsø, Norway. Proceedings of the workshop. Ed. by Tommi A. Pirinen, Francis M. Tyers, and Trond Trosterud. Septentrio Conference Series 2015:2. Tromsø: The University Library of Tromsø, pp. 87–98. DOI: 10.7557/5.3471.

Johnson, Ryan, Lene Antonsen, and Trond Trosterud (2013). “Using finite state transducers for making efficient reading comprehension dictionaries”. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), May 22–24, 2013, Oslo. Ed. by Stephan Oepen and Janne Bondi Johannessen. Linköping Electronic Conference Proceedings 85. Linköping: Linköping University, pp. 59–71. url: http://emmtee.net/oe/nodalida13/conference/45.pdf.

Koskenniemi, Kimmo (1984). “A General Computational Model for Word-form Recognition and Production”. In: Proceedings of the 10th International Conference on Computational Linguistics and 22Nd Annual Meeting on Association for Computational Linguistics. ACL ’84. Stanford, California: Association for Computational Linguistics, pp. 178–181. DOI: 10.3115/980491.980529.

Lagercrantz, Eliel (1957–1966). Lappische Volksdichtung. 7 vols. Suomalais-ugrilaisen Seuran toimituksia 112,115,117,120,124,126,141. Helsinki: Suomalais-Ugrilainen Seura.

Moshagen, Sjur, Tommi A. Pirinen, and Trond Trosterud (2013). “Building an opensource development infrastructure for language technology projects”. In: Proceedings of the 19th Nordic Conference of Computational Linguistics (NODALIDA 2013), May 22–24, 2013, Oslo. Ed. by Stephan Oepen and Janne Bondi Johannessen. Linköping Electronic Conference Proceedings 85. Linköping: Linköping University, pp. 343–352. url: http://emmtee.net/oe/nodalida13/conference/43.pdf.

Moshagen, Sjur, Trond Trosterud, and Pekka Sammallahti (2008). “Twol at work”. In: Inquiries into Words, Constraints and Contexts. Ed. by Antti Arppe, Lauri Carlson, Krister Lindén, Jussi Piitulainen, Mickael Suominen, Martti Vainio, Hanna Westerlund, and Anssi Yli-Jyrä. Stanford: CSLI, pp. 94–105.

Nagy, Naomi and Miriam Meyerhoff (2015). “Extending ELAN into variationist sociolinguistics”. In: Linguistics Vanguard 1.1, pp. 271–281. DOI: 10.1515/lingvan-2015-0012.

Partanen, Niko, Alexandra Kellner, Timo Rantakaulio, Galina Misharina, and Hamel Tristan (2013). “Down River Vashka. Corpus of the Udora dialect of Komi-Zyrian”. In: The Language Archive (TLA). Donated Corpora. Nijmegen: Max Planck Institute for Psycholinguistics. url: https://hdl.handle.net/1839/00-0000-0000-001CD649-8.

Poibeau, Thierry and Benjamin Fagard (2016). “Exploring natural language processing methods for Finno-Ugric languages”. In: Second International Workshop on Computational Linguistics for Uralic Languages, 20th January, 2016, Szeged, Hungary. Proceedings of the workshop. Ed. by Tommi A. Pirinen, Francis M. Tyers, and Trond Trosterud. Szeged: University of Szeged. In press.

Rießler, Michael (2005–2017). “Kola Saami Documentation Project. Linguistic and ethnographic documentation of the endangered Kola Saami languages”. In: The Language Archive (TLA). DoBeS archive. Digital language archive. In collab. with Anna Afanas’- eva, Anja Behnke, Svetlana Danilova, Andrej Dubovcev, Aleksandra Erštadt, Dorit Jackermeier, Elena Karvovskaya, Kristina Kotcheva, Jurij Kusmenko, Maryna Litvak, Sergej Nikolaev, Kateryna Olyzko, Niko Partanen, Elisabeth Scheller, Nina Šarshina, Ganna Vinogradova, Joshua Wilbur, Evgenia Zhivotova, and Nadežda Zolotuchina. Nijmegen: Max Planck Institute for Psycholinguistics. url: https://corpus1.mpi.nl/ds/asv/?2&openhandle=hdl:1839/00-0000-0000-0005-8A34-E.

Snoek, Conor, Dorothy Thunder, Kaidi Lõo, Antti Arppe, Jordan Lachler, Sjur Moshagen, and Trond Trosterud (2014). “Modeling the noun morphology of Plains Cree”. In: Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages. Baltimore, Maryland, USA: Association for Computational Linguistics, pp. 34–42. url: http://www.aclweb.org/anthology/W/W14/W14-2205. DOI: 10.3115/v1/W14-2205

Trosterud, Trond (2006a). “Grammar-based language technology for the Sámi Languages”. In: Lesser used Languages & Computer Linguistics. Bozen: Europäische Akademie, pp. 133–148.

Trosterud, Trond (2006b). “Grammatically based language technology for minority languages. Status and policies, casestudies and applications of information technology”. In: Lesser-known languages of South Asia. Ed. by Anju Saxena and Lars Borin. Trends in Linguistics. Studies and Monographs 175. Berlin: Mouton de Gruyter, pp. 293–316.

Wilbur, Joshua (2008–2017). “Pite Saami. Documenting the language and culture”. In: Endangered Languages Archive (ELAR). Digital language archive. In collab. with Iris Perkmann, Elsy Rankvist, and Peter Steggo. London: SOAS University of London. url: https://elar.soas.ac.uk/Collection/MPI201072.

Wilbur, Joshua (2011). “Think Globally, Archive Locally. Opportunities and challenges in working with local archiving institutions”. In: Proceedings of the Workshop on Language Documentation and Archiving. Ed. by David Nathan. London: SOAS University of London, pp. 51–58.

Wilbur, Joshua (2014). “Archiving for the community. Engaging local archives in language documentation projects”. In: Language Documentation and Description 12: Special Issue on Language Documentation and Archiving. Ed. by David Nathan and Peter K. Austin, pp. 85–102. url: http://www.elpublishing.org/PID/139.

Woodbury, Anthony C. (2011). “Language documentation”. In: The Cambridge handbook of endangered languages. Ed. by Peter K. Austin and Julia Sallabank. Cambridge handbooks in language and linguistics. Cambridge: Cambridge University Press, pp. 159–186. DOI: 10.1017/cbo9780511975981.009

Downloads

Published

2016-03-13