Entry Generation by Analogy – Encoding New Words for Morphological Lexicons

Main Article Content

Krister Lindén


Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a lexicon, we need to indicate their base form and inflectional paradigm. In this article, we evaluate a combination of corpus-based and lexicon-based methods for assigning the base form and inflectional paradigm to new words in Finnish, Swedish and English finite-state transducer lexicons. The methods have been implemented with the open-source Helsinki Finite-State Technology (Lindén & al., 2009). As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By combining the probabilities calculated from corpus data and from lexical data, we get a more precise combined model. The combined method has 77-81 % precision and 89-97 % recall, i.e. the first correctly generated entry is on the average found as the first or second candidate for the test languages. A further study demonstrated that a native speaker could revise suggestions from the entry generator at a speed of 300-400 entries per hour.

Article Details



Allauzen, Cyril, Michael Riley, Johan Schalkwyk, Wojciech Skut and Mehryar Mohri. 2007. OpenFst: A General and Efficient Weighted Finite-State Transducer Library, Lecture Notes in Computer Science, pages 11–23. [Read this article]

Barg, Petra, and James Kilbury. 2000. Incremental Identification of Inflectional Types. In Proceedings of the 18th Conference on Computational Linguistics, pages 49–54. Saarbrücken, Germany

Barg, Petra, and Markus Walther. 1998. Processing unknown words in HPSG. In ACL-36: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics. pages 91— 95.

Bod, Rens, Jennifer Hay and Stefanie Jannedy (eds.). 2003. Probabilistic Linguistics. MIT Press.

Baldwin, Timothy. 2005. Bootstrapping Deep Lexical Resources: Resources for Courses. In Proceedings of the ACL-SIGLEX Workshop on Deep Lexical Acquisition, Association for Computational Linguistics, pages 67–76.

Baroni, Marco, Johannes Matiasek and Harald Trost. 2002. Unsupervised discovery of morphologically related words based on orthographic and semantic similarity. In Proceedings of the Workshop on Morphological and Phonological Learning, SIGPHONACL, pages 11–20.

Brown, Peter. F., Peter V. deSouza, Rober L. Mercer, Vincent J. Della Pietra and Jenifer C. Lai. 1992. Class-based n-gram models of natural language. Computational Linguistics 18:467–479.

Carlson, Lauri. 2005. Inducing a Morphological Transducer from Inflectional Paradigms. In Inquiries into Words, Constraints and Contexts. Festschrift in the Honour of Kimmo Koskenniemi on his 60th Birthday, CLSI Publications, ISSN 1557–5772, Stanford University, pages 18–24.

Claveau, Vincent, and Marie-Claude L'Homme. 2005. Structuring Terminology using Analogy-Based Machine Learning. In Proceedings of the 7th International Conference on Terminology and Knowledge Engineering, TKE 2005, Copenhagen, Denmark, pages 17–18, August

Copestake, Ann. 1992. The ACQUILEX LKB: Representation Issues in Semi-Automatic Acquisition of Large Lexicons. In Proceedings of the 3rd Conference on Applied Natural Language Processing (ANLP-92), pages 88–96.

Creutz, Mahtias., Teemu Hirsimäki, Mikko Kurimo, Antti Puurula, Janne Pylkkönen, Vesa Siivola, Matti Varjokallio, Ebru Arisoy, Murat Saraçlar, and Andreas Stolcke. 2007. Morph-based Speech Recognition and Modeling of Out-of-Vocabulary Words across Languages. In ACM Transactions on Speech and Language Processing, Vol. 5, No. 1, Article 3, 29 pages

Creutz, Mahtias, Krista Lagus, Krister Lindén and Sami Virpioja. 2005. Morfessor and Hutmegs: Unsupervised Morpheme Segmentation for Highly-Inflecting and Compounding Languages. In Proceedings of the Second Baltic Conference on Human Language Technologies, Tallinn, Estonia.

Daelemans, Walter, Jakub Zavrel, Ko van der Sloot and Antal van den Bosch. 2007. TiMBL: Tilburg Memory-Based Learner, version 6.1, Reference Guide. Technical Report–ILK 07-07, Department of Communication and Information Sciences, Tilburg University, 64 pages.

Eddington, David. 2006. Paradigm Uniformity and Analogy: The Capitalistic versus Militaristic Debate. IJES, International Journal of English Studies, 6 (2): 1–18.

Forsberg, Markus, Harald Hammarström and Aarne Ranta. 2006. Morphological Lexicon Extraction from Raw Text Data. FinTAL 2006, LNCS 4139, pages 488–499.

FreeLing 2.1–An Open Source Suite of Language Analyzers (computer file). 2007.
Available at: http://garraf.epsevg.upc.es/freeling/

Gentner, Dedre, Jeffrey Loewenstein, Leigh Thompson. 2004. Analogical Encoding: Facilitating Knowledge Transfer and Integration. In K. Forbus, D. Gentner, & T. Regier (Eds), Proceedings of the 26th Meeting of the Cognitive Science Society, pages 452–457.

Gold, E. Mark. 1967. Language Identification in the Limit. Information and Control, 10(5):447–474. [Read this article]

Goldsmith, John. 2007. Morphological Analogy: Only a Beginning.
Available at: http://hum.uchicago.edu/~jagoldsm/Papers/analogy.pdf

Goldsmith, John. 2008. Segmentation and morphology. Departments of Linguistics and Computer Science, The University of Chicago. (To appear in The Handbook of Computational Linguistics).
Available at: http://hum.uchicago.edu/~jagoldsm//Papers/segmentation.pdf

HFST – Helsinki Finite-State Technology (computer file). 2008.
Available at: http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/index.shtml

Hoffman, Robert R. 1995. Monster Analogies. Artificial Intelligence Magazine, 16(3):11–35. Horning, James Jay. 1969. A Study of Grammatical Inference. PhD Thesis. Stanford University.

Itkonen, Esa, and Jussi Haukioja. 1997. A rehabilitation of analogy in syntax (and elsewhere). In András Kertész (ed.), Metalinguistik im Wandel: die cognitive Wende in Wissenschafttheorie und Lingusitik. Frankfurt am Main. Peter Lang, pages 131–17 .

Keuleers, Emmanuel, Dominiek Sandra, Walter Daelemans, Steven Gillis, Gert Durieux and Evelyn Martens. 2007. Dutch Plural Inflection: The Exception That Proves the Analogy. Cognitive Psychology, 54(4), pages 283–318. [Read this article]

Koskenniemi, Kimmo. 1983. Two-Level Morphology: A General Computational Model of Word-Form Recognition and Production. Publications No 11. University of Helsinki, Department of General Linguistics.

Kuenning, Geoff. 2007. Dictionaries for International Ispell.
Available at: http://www.lasr.cs.ucla.edu/geoff/ispell-dictionaries.html

Kurimo, Mikko, Mathias Creutz and Ville Turunen. 2007. Overview of Morpho Challenge in CLEF 2007. Working Notes of the CLEF 2007 Workshop, pages. 19–21.

Kurtz, Kennth J., and Jeffrey Loewenstein. 2007. Converging on a New Role for Analogy in Problem Solving and Retrieval. In Memory & Cognition 35(2):334–341.

Lepage, Yves. 1998. Solving Analogies on Words: An Algorithm. COLING-ACL, pages 728–734.

Lepage, Yves. 2000. Languages of Analogical Strings. In Proceedings of the 18th conference on Computational linguistics, vol. 1, pages 488 – 494. Saarbrücken, Germany.

Lepage, Yves. 2001. Analogy and Formal Languages. In Proceedings of FG/MOL 2001, pages 373–378.

Lepage, Yves, and Etienne Denoual. 2005. Purest ever Example-based Machine Translation: Detailed Presentation and Assessment’. In Journal of Machine Translation 19, pages 251–282.

Lingsoft, Inc. 2007. Demos.
Available at: http://www.lingsoft.fi/?doc_id=107&lang=en

Lindén, Krister. 2006. Multilingual Modeling of Cross-lingual Spelling Variants. In Journal of Information Retrieval, vol 9, pages 295–310.

Lindén, Krister. 2008a. A Probabilistic Model for Guessing Base Forms of New Words by Analogy. In CICling-2008, 9th International Conference on Intelligent Text Processing and Computational Linguistics, Haifa, Israel.

Lindén, Krister. 2008b. Assigning an Inflectional Paradigm using the Longest Matching Affix. In Ei mitään ongelmia. Juhlakirja Juhani Reimanille 50-vuotispäiväksi 23.1.2008 Eds. Matti Wiberg and Antti Koura. Turku 2008.

Lindén, Krister. 2009. Guessers for Finite-State Transducer Lexicons. In CICling-2009, 10thInternational Conference on Intelligent Text Processing and Computational Linguistics, March 1–7, 2009, Mexico City, Mexico.

Lindén, Krister, and Jussi Tuovila. 2009a. Corpus-based Paradigm Selection for Morphological Entries. In Proceedings of NODALIDA 2009, May, Odense, Denmark.

Lindén, Krister, and Jussi Tuovila. 2009b. Corpus-based Lexeme Ranking for Morphological Guessers. In Proceedings of the Workshop on Systems and Frameworks for Computational Morphology 2009. September, Zürich, Switzerland.

Lindén, Krister, Miikka Silfverberg and Tommi Pirinen. 2009. HFST Tools for Morphology–An Efficient Open-Source Package for Construction of Morphological Analyzers. In Proceedings of the Workshop on Systems and Frameworks for Computational Morphology 2009. September, Zürich, Switzerland.

Listenmaa, Inari. 2009. Combining Word Lists: Nykysuomen sanalista, Joukahainen-sanasto and Käänteissanakirja (in Finnish). Bachelor’s Thesis. Department of Linguistics, University of Helsinki.

Loewenstein, Jeffrey, Leigh Thompson and Dedre Gentner. 2003. Analogical Learning in Negotiation Teams: Comparing Cases Promotes Learning and Transfer. Academy of Management Learning and Education, 2(2):119–127.

Lombardy, Sylvain, Yann Régis-Gianas and Jaques Sakarovitch. 2004. Introducing Vaucanson. Theoretical Computer Science, 328(1–2):77–96. [Read this article]

Mikheev, Andrei. 1996. Unsupervised Learning of Word-Category Guessing Rules. In Proceedings of the 34th Annual Meeting of the Association for Computational Linguistics (ACL-96), pages 327–334.

Mikheev, Andrei. 1997. Automatic Rule Induction for Unknown-Word Guessing. In Computational Linguistics, 23(3):405–423.

Moreau, Fabienne, Vincent Claveau and Pascale Sebillot. 2007. Automatic Morphological Query Expansion Using Analogy-Based Machine Learning. Advances in Information Retrieval, Lecture Notes in Computer Science, pages 222–233. [Read this article]

Nykysuomen sanalista (computer file). 2007.
Available at: http://kaino.kotus.fi/sanat/nykysuomi/

Oflazer, Kemal, Sergei Nirenburg and Marjorie McShane. 2001. Bootstrapping Morphological Analyzers by Combining Human Elicitation and Machine Learning. In Computational Linguistics 27(1):59–85. [Read this article]

Pirinen, Tommi. 2008. Open Source Morphology for Finnish using Finite-State Methods (in Finnish). Master’s Thesis. Department of Linguistics, University of Helsinki.

Skousen, Royal. 1989. Analogical modeling of language. Dordrecht: Kluwer.

Skousen, Royal. 2003. Analogical Modeling: Exemplars, Rules, and Quantum Computing. Presented at the Berkeley Linguistics Society Conference.

Stroppa, Nicolas, and François Yvon. 2005. An Analogical Learner for Morphological Analysis. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL), pages 120–127, Ann Arbor, June.

Stroppa, Nicolas, and François Yvon. 2006. Formal models of analogical proportions. Technical Report D008. Télécom Paris D, ISSN: 0751–1345, Telecom ParisTech – École Nationale Supérieure de Télécommunications.

Turney, Peter. 2008. A Uniform Approach to Analogies, Synonyms, Antonyms, and Associations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), August, Manchester, UK, pages 905–912.

Westerberg, Tom. 2008. Den stora svenska ordlistan (computer file)
Available at: http://www.dsso.se/

Wicentowski, Richard. 2002. Modeling and Learning Multilingual Inflectional Morphology in a Minimally Supervised Framework. PhD Thesis. Baltimore, USA.

Wicentowski, Richard. 2004. Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model. In Proceedings of the Seventh Meeting of the ACL Special Interest Group in Computational Phonology, ACL, pages 70–77.

Yarowsky, David, and Richard Wicentowski. 2000. Minimally Supervised Morphological Analysis by Multimodal Alignment. In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics.

Yarowsky, David, Grace Ngai and Richard Wicentowski. 2001. Inducing Multi-lingual Text Analysis Tools via Robust Projection across Aligned Corpora. In HLT '01: Proceedings of the first international conference on Human language technology research. Association for Computational Linguistics, pages 1–8.

Yvon, François. 2003. Finite-State Transducers Solving Analogies on Words, Technical Report D008. In Télécom Paris D, ISSN: 0751–1345. TELECOM ParisTech.