SUC-CORE: A Balanced Corpus Annotated with Noun Phrase Coreference

Main Article Content

Kristina Nilsson Björkenstam

Abstract

This paper describes SUC-CORE, a subset of the Stockholm Ume°a Corpus and the Swedish Treebank annotated with noun phrase coreference. While most coreference annotated corpora consist of exts of similar types within related domains, SUC-CORE consists of both informative and imaginative prose and covers a wide range of literary genres and domains. This allows for exploration of coreference cross different text types, but it also means that there are limited amounts of data within each type. Future work on coreference resolution for Swedish should include making more annotated data vailable for the research community.

Article Details

Section
Articles

References

ACE. 2008. ACE (Automatic Content Extraction) English Annotation Guidelines for Entities. LDC. Version 6.6.

Androutsopoulos, Ion and Maria Aretoulaki. 2003. Natural Language Interaction. In R. Mitkov, ed., The Oxford Handbook of Computational Linguistics, chap. 35, pages 629–649. Oxford University Press.

Björkenstam, Kristina Nilsson and Emil Byström. 2012. SUC-CORE: SUC 2.0 Annotated with NP Coreference. In Proceedings of SLTC 2012. The Fourth Swedish Language Technology Conference. Lund, Sweden.

Bonelli, E.T. and J. Sinclair. 2006. Corpora. In K. Brown, ed., Encyclopedia of Language and Linguistics, pages 206–220. Oxford: Elsevier, 2nd edn.

Borthen, Kaja. 2004a. Annotation scheme for BREDT. Version 1.0. Tech. rep., University of Bergen.

Borthen, Kaja. 2004b. Predicative NPs and the annotation of reference chains. In Proceedings of Coling 2004, pages 1175–1178. Geneva, Switzerland.

Calhoun, Sasha, Jean Carletta, Jason Brenier, Neil Mayo, Dan Jurafsky, Mark Steedman, and David Beaver. 2010. The NXT-format Switchboard Corpus: A Rich Resource for Investigating the Syntax, Semantics, Pragmatics and Prosody of Dialogue. Language Resources and
Evaluation 44(4):387–419. DOI: 10.1007/s10579-010-9120-1

Cardie, Claire and KiriWagstaff. 1999. Noun Phrase Coreference as Clustering. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pages 82–89. ACL.

Chen, Ying and Kadri Hacioglu. 2006. Exploration of coreference resolution: The ACE entity detection and recognition task. In Text, Speech and Dialogue, vol. 4188/2006 of Lecture Notes in Computer Science. Springer Berlin/Heidelberg.

Chinchor, Nancy. 1997. MUC-7 Named Entity Task Definition (version 3.5). In Proceedings of the Seventh Message Understanding Conference (MUC-7). Available from http://www.itl.nist.gov/ (Last checked Oct. 14, 2005.).

Connolly, Dennis, John D. Burger, and David S. Day. 1994. A Manchine Learning Approach to Anaphoric Reference. In Proceedings of International Conference on New Methods in Language Processing, pages 255–261.

De Clercq, Orhee, Veronique Hoste, and Iris Hendrickx. 2011. Cross-Domain Dutch Coreference Resolution. In Proceedings of the 8th International Conference on Recent Advances in Natural Language Processing (RANLP 2011). Hissar, Bulgaria.

Doddington, George, Alexis Mitchell, Mark Przybocki, Lance Ramshaw, Stephanie Strassel, and Ralph Weischedel. 2004. The Automatic Content Extraction (ACE) Program: Tasks, Data, and Evaluation. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04).

Ejerhed, Eva, Gunnel Källgren, Ola Wennstedt, and Magnus Åström. 1992. The Linguistic Annotation System of the Stockholm-Umeå Corpus Project. Tech. Rep. 33, Department of General Linguistics, University of Umeå.

Grishman, Ralph. 2003. Information Extraction. In R. Mitkov, ed., The Oxford Handbook of Computational Linguistics, chap. 30, pages 545–559. Oxford University Press.

Haghighi, Aria and Dan Klein. 2009. Simple coreference resolution with rich syntactic and semantic features. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. Singapore: ACL.

Hartrumpf, Sven. 2001. Coreference Resolution with Syntactico-Semantic Rules and Corpus Statistics. In Proceedings of the Fifth Computational Natural Language Learning Workshop (CoNLL-2001), pages 137–144. Toulouse, France.

Hendrickx, Iris, Gosse Bouma, Frederik Coppens, Walter Daelemans, Veronique Hoste, Geert Kloosterman, Anne-Marie Mineur, Joeri Van Der Vloet, and Jean-Luc Verschelde. 2008. A coreference corpus and resolution system for Dutch. In Proceedings of the Sixth International Language Resources and Evaluation (LREC’08). Marrakech, Morocco.

Hinrichs, Erhard, Sandra Kübler, Karin Naumann, Heike Telljohann, and Julia Trushkina. 2004. Recent developments in linguistic annotations of the TüBa-D/Z Treebank. In Proceedings of the Third Workshop on Treebanks and Linguistic Theories. Tübingen, Germany.

Hirschman, Lynette and Nancy Chinchor. 1997. MUC-7 Coreference Task Definition (version 3.0). In Proceedings of the Seventh Message Understanding Conference (MUC-7). Available from http://www.itl.nist.gov/ (Last checked Oct. 14, 2005.).

Hirschman, Lynette, Patricia Robinson, John Burger, and Marc Vilain. 1997. Automating Coreference: The Role of Annotated Training Data. In AAAI Spring Symposium on Applying Machine Learning to Discourse Processing. AAAI.

Hobbs, Jerry R. 1978. Resolving Pronoun References. Lingua 44:311–338. Reprinted in Readings in Natural Language Processing, B. Grosz, K. Sparck-Jones, and B.Webber, editors, pp. 339-352, Morgan Kaufmann Publishers, Los Altos, California. DOI: 10.1016/0024-3841(78)90006-2

Holen, Gordana Ilíc. 2007. Automatic anaphora resolution for Norwegian. In Anaphora: Analysis, Algorithms and Applications. 6th Discourse Anaphora and Anaphor Resolution Colloquium, DAARC 2007. Lagos, Portugal, March 2007. Selected papers., pages 151–167. Springer.

Hoste, Véronique. 2005. Optimization Issues in Machine Learning of Coreference Resolution. Ph.D. thesis, Universiteit Antwerpen.

Hovy, E.H., M. Marcus, M. Palmer, S. Pradhan, L. Ramshaw, and R. Weischedel. 2006. OntoNotes: The 90% Solution. In Proceedings of the Human Language Technology/North American Association of Computational Linguistics conference (HLT-NAACL 2006). New York, NY.

Iida, Ryu, Mamoru Komachi, Kentaro Inui, and Yuji Matsumoto. 2007. Annotating a japanese text corpus with predicate-argument and coreference relations. In Proceedings of the Linguistic Annotation Workshop, pages 132–139. ACL, Prague. DOI: 10.3115/1642059.1642081

Källgren, Gunnel. 2006. Documentation of the Stockholm Umeå Corpus. In S. Gustafson-Čapková and B. Hartmann, eds., Manual of the Stockholm Umeå Corpus version 2.0, pages 5–85. Department of Linguistics, Stockholm University.

Kim, J-D., T. Ohta, Y. Tateisi, and J. Tsujii. 2003. Genia corpus - a semantically annotated corpus for bio-textmining. Bioinformatics 19 (suppl 1).

Kim, Jin-Dong, Sampo Pyysalo, Tomoko Ohta, Robert Bossy, Ngan Nguyen, and J. Tsujii. 2011. Overview of BioNLP Shared Task 2011. In Proceedings of BioNLP Shared Task 2011 Workshop, pages 1–6. Association for Computational Linguistics, Portland, Oregon, USA.

Lappin, Shalom and Herbert Leass. 1994. An algorithm for pronominal anaphora resolution. Computational Linguistics 20(4):535–561.

Luo, Xiaoqiang, Abe Ittycheriah, Hongyan Jing, Nanda Kambhatla, and Salim Roukos. 2004. A mention-synchronous coreference resolution algorithm based on the Bell tree. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL’04). DOI: 10.3115/1218955.1218973

McCarthy, Joseph F. andWendy G. Lehnert. 1995. Using decision trees for coreference resolution. In C. Mellish, ed., Proceedings of the Fourteenth International Conference on Artificial Intelligence, pages 1050–1055.

Mitkov, Ruslan. 2002. Anaphora Resolution. Longman.

Morton, Thomas. 2005. Using Semantic Relations to Improve Information Retrieval. Ph.D. thesis, University of Pennsylvania.

Ng, Vincent. 2007. Shallow semantics for coreference resolution. In Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI-07).

Ng, Vincent. 2010. Supervised noun phrase coreference research: The first fifteen years. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL-10), pages 1396–1411. ACL, Uppsala, Sweden.

Ng, Vincent and Claire Cardie. 2002a. Combining Sample Selection and Error-Driven Pruning for Machine Learning of Coreference Rules. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 55–62. ACL.

Ng, Vincent and Claire Cardie. 2002b. Improving Machine Learning Approaches to Coreference Resolution. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), pages 104–111. ACL, Philadelphia, PA, USA.

Nilsson, Kristina. 2010. Hybrid Methods for Coreference Resolution in Swedish. Ph.D. thesis, Stockholm University.

Nivre, J., B. Megyesi, S. Gustafson-Čapková, F. Salomonsson, and B. Dahlqvist. 2008. Cultivating a Swedish Treebank. In Resourceful Language Technology: Festschrift in Honor of Anna Sågvall Hein, pages 111–120. Acta Universitatis Upsaliensis.

Nøklestad, Anders. 2009. A Machine Learning Approach to Anaphora Resolution Including Named Entity Recognition, PP Attachment Disambiguation, and Animacy Detection. Ph.D. thesis, University of Oslo.

Östling, Robert. 2012. Stagger: A modern POS tagger for Swedish. In Proceedings of SLTC 2012. The Fourth Swedish Language Technology Conference. Lund, Sweden.

Poesio, Massimo. 2004. The MATE/GNOME Scheme for Anaphoric Annotation, Revisited. In Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-NAACL 2004, pages 154–162. Boston, MA, USA.

Pradhan, Sameer, Lance Ramshaw, Mitchell Marcus, Martha Palmer, Ralph Weischedel, and Nianwen Xue. 2011. CoNLL-2011 Shared Task: Modeling Unrestricted Coreference in OntoNotes. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning (CoNLL 2011). Portland, Oregon.

Recasens, Marta, Lluís Màrquez, Emili Sapena, M. Antònia Martí, Mariona Taué, Véronique Hoste, Massimo Poesio, and Yannick Versley. 2010. SemEval-2010 task 1: Coreference resolution in multiple languages. In Proceedings of the 5th International Workshop on Semantic Evaluation, SemEval ’10.

Recasens, Marta and M. Antonia Marti. 2010. AnCora-CO: Coreferentally annotated corpora for Spanish and Catalan. Language Resources and Evaluation 44(4):315–345. DOI: 10.1007/s10579-009-9108-x

Rodriguez, K.J., F. Delogu, Y. Versley, E. Stemle, and M. Poesio. 2010. Anaphoric Annotation of Wikipedia and Blogs in the Live Memories Corpus. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC). Valletta, Malta.

Soon,Wee Meng, Hwee Tou Ng, and Daniel Chung Yong Lim. 2001. A Machine Learning Approach to Coreference Resolution of Noun Phrases. Computational Linguistics 27(4):521–544. DOI: 10.1162/089120101753342653

Stenetorp, P., S. Pyysalo, G. Topic, T. Ohta, S. Ananiadou, and J. Tsujii. 2012. brat: a Webbased Tool for NLP-Assisted Text Annotation. In Proceedings of the Demonstrations Sessions at EACL 2012. ACL, France.

Stent, Amanda J. and Srinivas Bangalore. 2010. Interaction between dialog structure and coreference resolution. In Proceedings of the Spoken Language Technology Workshop (SLT), 2010, pages 342–347.

Strassel, Stephanie, Mark Przybocki, Kay Peterson, Zhiyi Song, and Kazuaki Maeda. 2008. Linguistic Resource and Evaluation Techniques for Evaluation of Cross-Document Automatic Content Extraction. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08).

Tetreault, Joel R. 1999. Analysis of syntax-based pronoun resolution methods. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL’99), pages 602–605. Maryland, USA.

Van Deemter, Kees and Rodger Kibble. 1999. What is coreference, and what should coreference annotation be? In A. Bagga, B. Baldwin, and S. Shelton, eds., Proceedings of the ACL Workshop on Coreference and Its Applications. ACL, Maryland.

Van Deemter, Kees and Rodger Kibble. 2000. On Coreferring: Coreference in MUC and related annotation schemes. Computational Linguistics 26(4):615–623.

Vilain, Marc, John Burger, John Aberdeen, Dennis Connolly, and Lynette Hirschman. 1995. A Model-Theoretic Coreference Scoring Scheme. In Proceedings of the Sixth Message Understanding Conference (MUC-6). Columbia, Maryland: Morgan Kaufmann.

Watson, Rebecca, Juditha Preiss, and Ted Briscoe. 2003. The contribution of domainindependent robust pronominal resolution to open-domain question answering. In Symposium on Reference Resolution and its Applications to Question Answering and Summarization, pages 75–82.

Wennstedt, Ola. 1995. Annotering av namn i SUC-korpusen. In K. G. Ottosson, R. V. Fjeld, and A. Torp, eds., The Nordic Languages and Modern Linguistics 9. Proceedings of the Ninth International Conference of Nordic and General Linguistics, pages 315–324. University of Oslo, Novis forlag.

Yang, XiaoFeng, GuoDong Zhou, Jian Su, and ChewLim Tan. 2003. Coreference resolution using competition learning approach. In Proceedings of ACL 2003, Sapporo, Japan, 7-12 July 2003, pages 176–183.