Named Entity Recognition in Bengali A Multi-Engine Approach

Main Article Content

Asif Ekbal
Sivaji Bandyopadhyay

Abstract

This paper reports about a multi-engine approach for the development of a Named Entity Recognition (NER) system in Bengali by combining the classifiers such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) with the help of weighted voting techniques. The training set consists of approximately 272K wordforms, out of which 150K wordforms have been manually annotated with the four major named entity (NE) tags, namely Person name, Location name, Organization name and Miscellaneous name. An appropriate tag conversion routine has been defined in order to convert the 122K wordforms of the IJCNLP-08 NER Shared Task on South and South East Asian Languages (NERSSEAL)1 data into the desired forms. The individual classifiers make use of the different contextual information of the words along with the variety of features that are helpful to predict the various NE classes. Lexical context patterns, generated from an unlabeled corpus of 3 million wordforms in a semi-automatic way, have been used as the features of the classifiers in order to improve their performance. In addition, we propose a number of techniques to post-process the output of each classifier in order to reduce the errors and to improve the performance further. Finally, we use three weighted voting techniques to combine the individual models. Experimental results show the effectiveness of the proposed multi-engine approach with the overall Recall, Precision and F-Score values of 93.98%, 90.63% and 92.28%, respectively, which shows an improvement of 14.92% in F-Score over the best performing baseline SVM based system and an improvement of 18.36% in F-Score over the least performing baseline ME based system. Comparative evaluation results also show that the proposed system outperforms the three other existing Bengali NER systems.

Article Details

Section
Articles

References

Anderson, T. W. and SL Scolve. 1978. Introduction to the Statistical Analysis of Data. Houghton Mifflin.


Bikel, Daniel M., Richard L. Schwartz, and Ralph M. Weischedel. 1999. An Algorithm that Learns What’s in a Name. Machine Learning 34(1–3): 211–231. [Read this article]


Borthwick, A. 1999. Maximum Entropy Approach to Named Entity Recognition. Ph.D. thesis, New York University.


Collins, M. and Y. Singer. 1999. Unsupervised Models for Named Entity Classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora.


Cucerzan, S. and David Yarowsky. 1999. Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence. In Proceedings of the 1999 Joint SIGDAT conference on EMNLP and VLC. Washington, D.C.


Cucerzan, S. and D. Yarowsky. 2002. Language Independent NER using a Unified Model of Internal and Contextual Evidence. In Proceedings of CoNLL 2002, pages 171–175.


Ekbal, A. and S. Bandyopadhyay. 2007a. Lexical Pattern Learning from Corpus Data for Named Entity Recognition. In Proceedings of 5th International Conference on Natural Language Processing (ICON), pages 123–128. India.


Ekbal, A. and S. Bandyopadhyay. 2007b. Pattern Based Bootstrapping Method for Named Entity Recognition. In Proceedings of the 6th International Conference on Advances in Pattern Recognition (ICAPR), pages 349–355. World Scientific.


Ekbal, Asif and S. Bandyopadhyay. 2008a. Bengali Named Entity Recognition using Support Vector Machine. In Proceedings of NERSSEAL, IJCNLP-08, pages 51–58.


Ekbal, A. and S. Bandyopadhyay. 2008b. A Web-based Bengali News Corpus for Named Entity Recognition. Language Resources and Evaluation Journal 42(2): 173–182. [Read this article]


Ekbal, Asif, Rejwanul Haque, and Sivaji Bandyopadhyay. 2007a. Bengali Part of Speech Tagging using Conditional Random Field. In Proceedings of Seventh Inter-national Symposium on Natural Language Processing (SNLP-2007). Thailand.


Ekbal, Asif, R Haque, and S. Bandyopadhyay. 2008. Named Entity Recognition in Bengali: A Conditional Random Field Approach. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP08), pages 589–594.


Ekbal, A., S.K. Naskar, and S. Bandyopadhyay. 2007b. Named Entity Recognition and Transliteration in Bengali. Named Entities: Recognition, Classification and Use, Special Issue of Lingvisticae Investigationes Journal 30(1): 95–114. [Read this article]


Florian, R., A. Ittycheriah, H. Jing, and T. Zhang. 2003. Named Entity Recognition through Classifier Combination. In Proceedings of CoNLL-2003, pages 168–171. Edmonton, Canada.


Joachims, T. 1999. Making Large Scale SVM Learning Practical, pages 169–184. MIT Press.

Krebel, Ulrich H.G. 1999. Pairwise Classification and Support Vector Machine. In Advances in Kernel Methods.


Kudo, Taku and Yuji Matsumoto. 2001. Chunking with Support Vector Machines. In NAACL ’01: Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, pages 1–8. Association for Computational Linguistics.


Kumar, N. and Pushpak Bhattacharyya. 2006. Named Entity Recognition in Hindi using MEMM. Technical report, IIT Bombay, India.


Lafferty, John D., Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning (ICML), pages 282–289.


Li, Wei and Andrew McCallum. 2004. Rapid Development of Hindi Named Entity Recognition using Conditional Random Fields and Feature Induction. ACM Transactions on Asian Languages Information Processing 2(3): 290–294. [Read this article]


Malouf, R. 2002. A Comparison of Algorithms for Maximum Entropy Parameter Estimation. In Proceedings of Sixth Conference on Natural Language Learning, pages 49–55.


Munro, Robert, Daren Ler, and Jon Patrick. 2003. Meta-learning Orthographic and Contextual Models for Language Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 192–195. Association for Computational Linguistics.


Phillips, William and Ellen Riloff. 2002. Exploiting Strong Syntactic Heuristics and Co-training to Learn Semantic Lexicons. In EMNLP ’02: Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing, pages 125–132. Association for Computational Linguistics.


Riloff, Ellen and Rosie Jones. 1999. Learning Dictionaries for Information Extraction by Multi-level Bootstrapping. In AAAI ’99/IAAI ’99: Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference, pages 474–479. American Association for Artificial Intelligence. ISBN 0-262-51106-1.


Sha, Fei and Fernando Pereira. 2003. Shallow Parsing with Conditional Random Fields. In Proceedings of NAACL ’03, pages 134–141. Canada.


Strzalkowski, Tomek and Jin Wang. 1996. A Self-learning Universal Concept Spotter. In Proceedings of the 16th Conference on Computational Linguistics, pages 931–936.


Thelen, Michael and Ellen Riloff. 2002. A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts. In EMNLP ’02: Proceedings of the ACL-02 conference on Empirical Methods in Natural Language Processing, pages 214–221.


Vapnik, Vladimir N. 1995. The Nature of Statistical Learning Theory. New York, NY, USA: Springer-Verlag New York, Inc. ISBN 0-387-94559-8.


Wu, Dekai, Grace Ngai, and Marine Carpuat. 2003. A Stacked, Voted, Stacked Model for Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 200–203. Association for Computational Linguistics.


Yamada, Hiroyasu, Taku Kudo, and Yuji Matsumoto. 2001. Japanese Named Entity Extraction using Support Vector Machine. In Transactions of IPSJ 43(1): 44–53.


Yangarber, Roman, Winston Lin, and Ralph Grishman. 2002. Unsupervised learning of generalized names. In Proceedings of the 19th International Conference on Computational Linguistics, pages 1–7.