https://nejlt.ep.liu.se/issue/feed Northern European Journal of Language Technology 2024-03-14T17:17:56+01:00 Northern European Journal of Language Technology nejlt@nejlt.org Open Journal Systems <p>NEJLT is a global journal publishing peer-reviewed research on natural lanauge processing and computational linguistics research <strong>for all languages</strong>.</p> https://nejlt.ep.liu.se/article/view/5249 DANSK: Domain Generalization of Danish Named Entity Recognition 2024-03-11T13:54:31+01:00 Kenneth Enevoldsen kenneth.enevoldsen@cas.au.dk Emil Trenckner Jessen emil.tj@hotmail.com Rebekah Baglini rbkh@cc.au.dk <p>Named entity recognition is an important application within Danish NLP, essential within both industry and research. However, Danish NER is inhibited by a lack coverage across domains and entity types. As a consequence, no current models are capable of fine-grained named entity recognition, nor have they been evaluated for potential generalizability issues across datasets and domains. To alleviate these limitations, this paper introduces: 1) DANSK: a named entity dataset providing for high-granularity tagging as well as within-domain evaluation of models across a diverse set of domains; 2) and three generalizable models with fine-grained annotation available in DaCy 2.6.0; and 3) an evaluation of current state-of-the-art models’ ability to generalize across domains. The evaluation of existing and new models revealed notable performance discrepancies across domains, which should be addressed within the field. Shortcomings of the annotation quality of the dataset and its impact on model training and evaluation are also discussed. Despite these limitations, we advocate for the use of the new dataset DANSK alongside further work on<br />generalizability within Danish NER.</p> 2024-07-23T00:00:00+02:00 Copyright (c) 2024 Kenneth Enevoldsen, Emil Trenckner Jessen, Rebekah Baglini https://nejlt.ep.liu.se/article/view/5217 Documenting Geographically and Contextually Diverse Language Data Sources 2024-02-15T06:51:57+01:00 Angelina McMillan-Major aymm@uw.edu Francesco De Toni francesco.detoni@uwa.edu.au Zaid Alyafeai alyafey22@gmail.com Stella Biderman stellabiderman@gmail.com Kimbo Chen chentenghung@gmail.com Gérard Dupont ger.dupont@gmail.com Hady Elsahar hadyelsahar@gmail.com Chris Emezue chris.emezue@gmail.com Alham Fikri Aji alham.fikri@mbzuai.ac.ae Suzana Ilić suzana@mltokyo.ai Nurulaqilla Khamis nurulaqilla@utm.my Colin Leong cleong1@udayton.edu Maraim Masoud maraim.elbadri@gmail.com Aitor Soroa a.soroa@ehu.eus Pedro Ortiz Suarez pedro.ortiz@uni-mannheim.de Daniel van Strien daniel.van-strien@bl.uk Zeerak Talat z@zeerak.org Yacine Jernite yacine@huggingface.co <p><span class="fontstyle0">Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.</span> </p> 2024-09-12T00:00:00+02:00 Copyright (c) 2024 Angelina McMillan-Major, Francesco De Toni, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Daniel van Strien, Zeerak Talat, Yacine Jernite https://nejlt.ep.liu.se/article/view/5203 Understanding Counterspeech for Online Harm Mitigation 2024-02-05T11:22:54+01:00 Yi-Ling Chung ychung@turing.ac.uk Gavin Abercrombie g.abercrombie@hw.ac.uk Florence Enock fenock@turing.ac.uk Jonathan Bright jbright@turing.ac.uk Verena Rieser v.t.rieser@hw.ac.uk <p>Counterspeech offers direct rebuttals to hateful speech by challenging perpetrators of hate and showing support to targets of abuse. It provides a promising alternative to more contentious measures, such as content moderation and deplatforming, by contributing a greater amount of positive online speech rather than attempting to mitigate harmful content through removal. Advances in the development of large language models mean that the process of producing counterspeech could be made more efficient by automating its generation, which would enable large-scale online campaigns. However, we currently lack a systematic understanding of several important factors relating to the efficacy of counterspeech for hate mitigation, such as which types of counterspeech are most effective, what are the optimal conditions for implementation, and which specific effects of hate it can best ameliorate. This paper aims to fill this gap by systematically reviewing counterspeech research in the social sciences and comparing methodologies and findings with natural language processing (NLP) and computer science efforts in automatic counterspeech generation. By taking this multi-disciplinary view, we identify promising future directions in both fields.</p> 2024-09-04T00:00:00+02:00 Copyright (c) 2024 Yi-Ling Chung, Gavin Abercrombie, Florence Enock, Jonathan Bright, Verena Rieser https://nejlt.ep.liu.se/article/view/5000 On Using Self-Report Studies to Analyze Language Models 2023-11-06T12:03:59+01:00 Matúš Pikuliak matus.pikuliak@gmail.com <p>We are at a curious point in time where our ability to build language models (LMs) has outpaced our ability to analyze them. We do not really know how to reliably determine their capabilities, biases, dangers, knowledge, and so on. The benchmarks we have are often overly specific, do not generalize well, and are susceptible to data leakage. Recently, I have noticed a trend of using self-report studies, such as various polls and questionnaires originally designed for humans, to analyze the properties of LMs. I think that this approach can easily lead to false results, which can be quite dangerous considering the current discussions on AI safety, governance, and regulation. To illustrate my point, I will delve deeper into several papers that employ self-report methodologies and I will try to highlight some of their weaknesses.</p> 2024-09-18T00:00:00+02:00 Copyright (c) 2024 Matúš Pikuliak https://nejlt.ep.liu.se/article/view/4939 QUA-RC: the semi-synthetic dataset of multiple choice questions for assessing reading comprehension in Ukrainian 2023-09-13T11:56:27+02:00 Mariia Zyrianova mariiaz@kth.se Dmytro Kalpakchi dmytroka@kth.se <p>In this article we present the first dataset of multiple choice questions for assessing reading comprehension in Ukrainian. The dataset is based on the texts from the Ukrainian national tests for reading comprehension, and the MCQs themselves are created semi-automatically in three stages. The first stage was to use GPT-3 to generate the MCQs zero-shot, the second stage was to select MCQs of sufficient quality and revise the ones with minor errors, whereas the final stage was to expand the dataset with the MCQs written manually. The dataset is created by the Ukrainian language native speakers, one of whom is also a language teacher. The resulting corpus has slightly more than 900 MCQs, of which only 43 MCQs could be kept as they were generated by GPT-3.</p> 2023-11-16T00:00:00+01:00 Copyright (c) 2023 Mariia Zyrianova, Dmytro Kalpakchi https://nejlt.ep.liu.se/article/view/4932 Efficient Structured Prediction with Transformer Encoders 2024-02-22T13:05:38+01:00 Ali Basirat alib@hum.ku.dk <p>Finetuning is a useful method for adapting Transformer-based text encoders to new tasks but can be computationally expensive for structured prediction tasks that require tuning at the token level. Furthermore, finetuning is inherently inefficient in updating all base model parameters, which prevents parameter sharing across tasks. To address these issues, we propose a method for efficient task adaptation of frozen Transformer encoders based on the local contribution of their intermediate layers to token representations. Our adapter uses a novel attention mechanism to aggregate intermediate layers and tailor the resulting representations to a target task. Experiments on several structured prediction tasks demonstrate that our method outperforms previous approaches, retaining over 99% of the finetuning performance at a fraction of the training cost. Our proposed method offers an efficient solution for adapting frozen Transformer encoders to new tasks, improving performance and enabling parameter sharing across different tasks.</p> 2024-03-14T00:00:00+01:00 Copyright (c) 2024 Ali Basirat https://nejlt.ep.liu.se/article/view/4884 Resource papers as registered reports: a proposal 2023-06-28T22:25:06+02:00 Emiel van Miltenburg C.W.J.vanMiltenburg@tilburguniversity.edu <p>This is a proposal for publishing resource papers as registered reports in the Northern European Journal of Language Technology. The idea is that authors write a data collection plan with a full data statement, to the extent that it can be written before data collection starts. Once the proposal is approved, publication of the final resource paper is guaranteed, as long as the data collection plan is followed (modulo reasonable changes due to unforeseen circumstances). This proposal changes the reviewing process from an antagonistic to a collaborative enterprise, and hopefully encourages NLP resources to develop and publish more high-quality datasets. The key advantage of this proposal is that it helps to promote <em>responsible resource development</em> (through constructive peer review) and to avoid <em>research waste</em>.</p> 2023-07-13T00:00:00+02:00 Copyright (c) 2023 Emiel van Miltenburg https://nejlt.ep.liu.se/article/view/4855 Unsupervised Text Embedding Space Generation Using Generative Adversarial Networks for Text Synthesis 2023-08-21T19:01:48+02:00 Jun-Min Lee ljm56897@gmail.com Tae-Bin Ha taebinalive@gmail.com <div>Generative Adversarial Networks (GAN) is a model for data synthesis, which creates plausible data through the competition of generator and discriminator. Although GAN application to image synthesis is extensively studied, it has inherent limitations to natural language generation. Because natural language is composed of discrete tokens, a generator has difficulty updating its gradient through backpropagation; therefore, most text-GAN studies generate sentences starting with a random token based on a reward system. Thus, the generators of previous studies are pre-trained in an autoregressive way before adversarial training, causing data memorization that synthesized sentences reproduce the training data. In this paper, we synthesize sentences using a framework similar to the original GAN. More specifically, we propose Text Embedding Space Generative Adversarial Networks (TESGAN) which generate continuous text embedding spaces instead of discrete tokens to solve the gradient backpropagation problem. Furthermore, TESGAN conducts unsupervised learning which does not directly refer to the text of the training data to overcome the data memorization issue. By adopting this novel method, TESGAN can synthesize new sentences, showing the potential of unsupervised learning for text synthesis. We expect to see extended research combining Large Language Models with a new perspective of viewing text as an continuous space.</div> 2023-10-24T00:00:00+02:00 Copyright (c) 2023 Jun-Min Lee, Tae-Bin Ha https://nejlt.ep.liu.se/article/view/4725 NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation 2023-03-08T19:39:50+01:00 Kaustubh Dhole kdhole@emory.edu Varun Gangal vgangal@cs.cmu.edu Sebastian Gehrmann gehrmann@google.com Aadesh Gupta aadesh.gupta@ipsoft.com Zhenhao Li zhenhao.li18@imperial.ac.uk Saad Mahamood saad.mahamood@trivago.com Abinaya Mahadiran abinaya.m02@mphasis.com Simon Mille simon.mille@upf.edu Ashish Shrivastava ashish3586@gmail.com Samson Tan samson.tan@salesforce.com Tongshang Wu sherryw@cs.cmu.edu Jascha Sohl-Dickstein jaschasd@google.com Jinho Choi Jinho.Choi@emory.edu Eduard Hovy hovy@cmu.edu Ondřej Dušek odusek@ufal.mff.cuni.cz Sebastian Ruder sebastian@ruder.io Sajant Anand sajant@berkeley.edu Nagender Aneja naneja@gmail.com Rabin Banjade rbnjade1@memphis.edu Lisa Barthe lisa.barthe@inetum.com Hanna Behnke hanna.behnke20@imperial.ac.uk Ian Berlot-Attwell ianberlot@cs.toronto.edu Connor Boyle connor.bo@gmail.com Caroline Brun caroline.brun@naverlabs.com Marco Antonio Sobrevilla Cabezudo msobrevillac@usp.br Samuel Cahyawijaya scahyawijaya@connect.ust.hk Emile Chapuis chapuis.emile@gmail.com Wanxiang Che fuxuanwei@ir.hit.edu.cn Mukund Choudhary mukund.choudhary@research.iiit.ac.in Christian Clauss cclauss@me.com Pierre Colombo colombo.pierre@gmail.com Filip Cornell c.filip.cornell@gmail.com Gautier Dagan gautierdagan@gmail.com Mayukh Das mayukh.das@tu-bs.de Tanay Dixit dixittanay@gmail.com Thomas Dopierre Thomas.Dopierre@univ-st-etienne.fr Paul-Alexis Dray paul.alexis.dray@gmail.com Suchitra Dubey suchitra27288@gmail.com Tatiana Ekeinhor tatiana.ekeinhor@vadesecure.com Marco Di Giovanni marco.digiovanni@polimi.it Tanya Goyal tanyagoyal@utexas.edu Rishabh Gupta rishabh19089@iiitd.ac.in Louanes Hamla louanes.hamla@inetum.com Sang Han sanghan@protonmail.com Fabrice Harel-Canada fabricehc@cs.ucla.edu Antoine Honoré antoine.honore@vadesecure.com Ishan Jindal ishan.jindal@ibm.com Przemysław Joniak joniak@g.ecc.u-tokyo.ac.jp Denis Kleyko denis.kleyko@gmail.com Venelin Kovatchev vkovatchev@ub.edu Kalpesh Krishna kalpesh@cs.umass.edu Ashutosh Kumar ashutosh@iisc.ac.in Stefan Langer langer.stefan@siemens.com Seungjae Ryan Lee seungjaeryanlee@gmail.com Corey James Levinson thecoreylevinson@gmail.com Hualou Liang hualou.liang@drexel.edu Kaizhao Liang kl2@illinois.edu Zhexiong Liu zhexiong@cs.pitt.edu Andrey Lukyanenko and-lukyane@yandex.ru Vukosi Marivate vukosi.marivate@cs.up.ac.za Gerard de Melo demelo@uni-potsdam.de Simon Meoni simonmeoni@aol.com Maxine Meyer maxime.meyer@vadesecure.com Afnan Mir afnanmir@utexas.edu Nafise Sadat Moosavi N.S.Moosavi@sheffield.ac.uk Niklas Meunnighoff muennighoff@stu.pku.edu.cn Timothy Sum Hon Mun timothy22000@gmail.com Kenton Murray kenton@jhu.edu Marcin Namysl Marcin.Namysl@iais.fraunhofer.de Maria Obedkova maryobedkova@gmail.com Priti Oli poli@memphis.edu Nivranshu Pasricha pasricha@protonmail.com Jan Pfister pfister@informatik.uni-wuerzburg.de Richard Plant r.plant@napier.ac.uk Vinay Prabhu vinay@unify.id Vasile Pais vasile@racai.ro Libo Qin lbqin@ir.hit.edu.cn Shahab Raji shahab.raji@rutgers.edu Pawan Kumar Rajpoot pawan.rajpoot2411@gmail.com Vikas Raunak viraunak@microsoft.com Roy Rinberg royrinberg@gmail.com Nicholas Roberts nick11roberts@cs.wisc.edu Juan Diego Rodriguez juand-r@utexas.edu Claude Roux claude.roux@naverlabs.com Vasconcellos Samus phsamus@gmail.com Ananya Sai ananya@cse.iitm.ac.in Robin Schmidt rob.schmidt@student.uni-tuebingen.de Thomas Scialom t.scialom@gmail.com Tshephisho Sefara sefaratj@gmail.com Saqib Shamsi shamsi.saqib@gmail.com Xudong Shen xudong.shen@u.nus.edu Yiwen Shi yiwen.shi@drexel.edu Haoyue Shi freda@ttic.edu Anna Shvets anna.shvets@inetum.com Nick Siegel nsiegel@arlut.utexas.edu Damien Sileo damien.sileo@kuleuven.be Jamie Simon james.simon@berkeley.edu Chandan Singh chandan_singh@berkeley.edu Roman Sitelew sitelewr@gmail.com Priyank Soni priyanksonigeca7@gmail.com Taylor Sorensen tsor1313@gmail.com William Soto williamsotomartinez@gmail.com Aman Srivastava amanit0812@gmail.com Aditya Srivatsa k.v.aditya@research.iiit.ac.in Tony Sun thetonysun@gmail.com Mukund Varma mukundvarmat@gmail.com A Tabassum atabassum.bee15seecs@seecs.edu.pk Fiona Tan tan.f@u.nus.edu Ryan Teehan rsteehan@gmail.com Mo Tiwari motiwari@stanford.edu Marie Tolkiehn marie.tolkiehn@desy.de Athena Wang wangathena68@yahoo.com Zijian Wang zijwang@hotmail.com Zijie Wang jayw@gatech.edu Gloria Wang gwang1@imsa.edu Fuxuan Wei fuxuanwei@ir.hit.edu.cn Bryan Wilie bryanwilie92@gmail.com Genta Indra Winata giwinata@connect.ust.hk Xinyu Wu xinyiwu.nlp@gmail.com Witold Wydmanski witold.wydmanski@uj.edu.pl Tianbao Xie tianbaoxiexxx@gmail.com Usama Yaseen usama.yaseen@siemens.com Michael Yee mayee@engin.umich.edu Jing Zhang jing.zhang2@emory.edu Yue Zhang yue.zhang@wias.org.cn <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.</p> <p><span style="font-size: 0.875rem; font-family: 'Noto Sans', 'Noto Kufi Arabic', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">El aumento de datos es un método importante para evaluar la solidez y mejorar la diversidad del entrenamiento datos para modelos de procesamiento de lenguaje natural (NLP). इस लेख में, हम एनएल-ऑगमेंटर का प्रस्ताव करते हैं - एक नया भागी- दारी पूर्वक, पायथन में बनाया गया, लैंग्वेज (एनएल) ऑग्मेंटेशन फ्रेमवर्क जो ट्रांसफॉर्मेशन (डेटा में बदलाव करना) और फीलटर (फीचर्स के अनुसार डेटा का भाग करना) के नीरमान का समर्थन करता है।. 我们描述了NL-Augmenter框架及其初步包含的117种转换和23个过滤器,并 大致标注分类了一系列可适配的自然语言任务. </span><span style="font-size: 0.875rem; font-family: 'Noto Sans', 'Noto Kufi Arabic', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">این دگرگونی ها شامل نویز، اشتباهات عمدی و تصادفی انسانی، تنوع اجتماعی-زبانی، سبک معنایی معتبر، تغییرات نحوی و همچنین ساختارهای مصنوعی است که برای انسان ها مبهم است. </span><span style="font-size: 0.875rem; font-family: 'Noto Sans', 'Noto Kufi Arabic', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">NL-Augmenterpa allin kaynintam qawachiyku, tikrakuyninku- nata servichikuspayku, chaywanmi qawariyku modelos de lenguaje popular nisqapa allin takyasqa kayninta. Kami menemukan model yang berbeda ditantang secara berbeda pada tugas yang berbeda, dengan penurunan skor kuasi-sistematis. Infrastruktur, kartu data, dan hasil evaluasi ketahanan dipublikasikan tersedia secara gratis di GitHub untuk kepentingan para peneliti yang </span><span style="font-size: 0.875rem; font-family: 'Noto Sans', 'Noto Kufi Arabic', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">mengerjakan pembuatan parafrase, analisis ketahanan, dan NLP sumber daya rendah.</span></p> <p> </p> </div> </div> </div> 2023-04-08T00:00:00+02:00 Copyright (c) 2023 Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondřej Dušek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo, Samuel Cahyawijaya, Emile Chapuis, Wanxiang Che, Mukund Choudhary, Christian Clauss, Pierre Colombo, Filip Cornell, Gautier Dagan, Mayukh Das, Tanay Dixit, Thomas Dopierre, Paul-Alexis Dray, Suchitra Dubey, Tatiana Ekeinhor, Marco Di Giovanni, Tanya Goyal, Rishabh Gupta, Louanes Hamla, Sang Han, Fabrice Harel-Canada, Antoine Honoré, Ishan Jindal, Przemysław K. Joniak, Denis Kleyko, Venelin Kovatchev, Kalpesh Krishna, Ashutosh Kumar, Stefan Langer, Seungjae Ryan Lee, Corey James Levinson, Hualou Liang, Kaizhao Liang, Zhexiong Liu, Andrey Lukyanenko, Vukosi Marivate, Gerard de Melo, Simon Meoni, Maxine Meyer, Afnan Mir, Nafise Sadat Moosavi, Niklas Meunnighoff, Timothy Sum Hon Mun, Kenton Murray, Marcin Namysl, Maria Obedkova, Priti Oli, Nivranshu Pasricha, Jan Pfister, Richard Plant, Vinay Prabhu, Vasile Pais, Libo Qin, Shahab Raji, Pawan Kumar Rajpoot, Vikas Raunak, Roy Rinberg, Nicholas Roberts, Juan Diego Rodriguez, Claude Roux, Vasconcellos P. H. S., Ananya B. Sai, Robin M. Schmidt, Thomas Scialom, Tshephisho Sefara, Saqib N. Shamsi, Xudong Shen, Yiwen Shi, Haoyue Shi, Anna Shvets, Nick Siegel, Damien Sileo, Jamie Simon, Chandan Singh, Roman Sitelew, Priyank Soni, Taylor Sorensen, William Soto, Aman Srivastava, KV Aditya Srivatsa, Tony Sun, Mukund Varma T, A Tabassum, Fiona Anting Tan, Ryan Teehan, Mo Tiwari, Marie Tolkiehn, Athena Wang, Zijian Wang, Zijie J. Wang, Gloria Wang, Fuxuan Wei, Bryan Wilie, Genta Indra Winata, Xinyu Wu, Witold Wydmanski, Tianbao Xie, Usama Yaseen, Michael A. Yee, Jing Zhang, Yue Zhang https://nejlt.ep.liu.se/article/view/4617 Foreword to NEJLT Volume 8, 2022 2023-01-12T06:13:18+01:00 Leon Derczynski leon@nejlt.org <p>An introduction to the Northern European Journal of Language Technology in 2022</p> 2022-12-31T00:00:00+01:00 Copyright (c) 2022 Leon Derczynski https://nejlt.ep.liu.se/article/view/4561 Prevention or Promotion? Predicting Author's Regulatory Focus 2023-08-14T20:13:56+02:00 Aswathy Velutharambath aswathy.velutharambath@100worte.de Kai Sassenberg k.sassenberg@iwm-tuebingen.de Roman Klinger roman.klinger@ims.uni-stuttgart.de <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>People differ fundamentally in what motivates them to pursue a goal and how they approach it. For instance, some people seek growth and show eagerness, whereas others prefer security and are vigilant. The concept of regulatory focus is employed in psychology, to explain and predict this goal-directed behavior of humans underpinned by two unique motivational systems – the promotion and the prevention system. Traditionally, text analysis methods using closed-vocabularies are employed to assess the distinctive linguistic patterns associated with the two systems. From an NLP perspective, automatically detecting the regulatory focus of individuals from text provides valuable insights into the behavioral inclinations of the author, finding its applications in areas like marketing or health communication. However, the concept never made an impactful debut in computational linguistics research. To bridge this gap we introduce the novel task of regulatory focus classification from text and present two complementary German datasets – (1) experimentally generated event descriptions and (2) manually annotated short social media texts used for evaluating the generalizability of models on real-world data. First, we conduct a correlation analysis to verify if the linguistic footprints of regulatory focus reported in psychology studies are observable and to what extent in our datasets. For automatic classification, we compare closed-vocabulary-based analyses with a state-of-the-art BERT-based text classification model and observe that the latter outperforms lexicon-based approaches on experimental data and is notably better on out-of-domain Twitter data.</p> </div> </div> </div> 2023-09-15T00:00:00+02:00 Copyright (c) 2023 Aswathy Velutharambath, Kai Sassenberg, Roman Klinger https://nejlt.ep.liu.se/article/view/4529 Barriers and enabling factors for error analysis in NLG research 2022-11-23T22:03:01+01:00 Emiel van Miltenburg C.W.J.vanMiltenburg@tilburguniversity.edu Miruna Clinciu miruna.clinciu@gmail.com Ondřej Dušek odusek@ufal.mff.cuni.cz Dimitra Gkatzia D.Gkatzia@napier.ac.uk Stephanie Inglis stephanie.inglis@arria.com Leo Leppänen leo.leppanen@helsinki.fi Saad Mahamood Saad.Mahamood@trivago.com Stephanie Schoch sns2gr@virginia.edu Craig Thomson c.thomson@abdn.ac.uk Luou Wen luouwen97@gmail.com <p>Earlier research has shown that few studies in Natural Language Generation (NLG) evaluate their system outputs using an error analysis, despite known limitations of automatic evaluation metrics and human ratings. This position paper takes the stance that error analyses should be encouraged, and discusses several ways to do so. This paper is based on our shared experience as authors as well as a survey we distributed as a means of public consultation. We provide an overview of existing barriers to carrying out error analyses, and propose changes to improve error reporting in the NLG literature.</p> 2023-02-21T00:00:00+01:00 Copyright (c) 2023 Emiel van Miltenburg, Miruna Clinciu, Ondřej Dušek, Dimitra Gkatzia, Stephanie Inglis, Leo Leppänen, Saad Mahamood, Stephanie Schoch, Craig Thomson, Luou Wen https://nejlt.ep.liu.se/article/view/4453 PARSEME Meets Universal Dependencies: Getting on the Same Page in Representing Multiword Expressions 2022-12-01T02:42:43+01:00 Agata Savary agata.savary@universite-paris-saclay.fr Sara Stymne sara.stymne@lingfil.uu.se Verginica Barbu Mititelu vergi@racai.ro Nathan Schneider Nathan.Schneider@georgetown.edu Carlos Ramisch carlos.ramisch@lis-lab.fr Joakim Nivre joakim.nivre@lingfil.uu.se <p>Multiword expressions (MWEs) are challenging and pervasive phenomena whose idiosyncratic properties show notably at the levels of lexicon, morphology, and syntax. Thus, they should best be annotated jointly with morphosyntax. We discuss two multilingual initiatives, Universal Dependencies and PARSEME, addressing these annotation layers in cross-lingually unified ways. We compare the annotation principles of these initiatives with respect to MWEs, and we put forward a roadmap towards their gradual unification. The expected outcomes are more consistent treebanking and higher universality in modeling idiosyncrasy.</p> 2023-02-21T00:00:00+01:00 Copyright (c) 2023 Agata Savary, Sara Stymne, Verginica Barbu Mititelu, Nathan Schneider, Carlos Ramisch, Joakim Nivre https://nejlt.ep.liu.se/article/view/4462 Spanish Abstract Meaning Representation: Annotation of a General Corpus 2022-11-09T20:28:20+01:00 Shira Wein sbmw15@gmail.com Lucia Donatelli donatelli@coli.uni-saarland.de Ethan Ricker ear131@georgetown.edu Calvin Engstrom cle41@georgetown.edu Alex Nelson amn106@georgetown.edu Leonie Harter leonie-harter@web.de Nathan Schneider nathan.schneider@georgetown.edu <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Abstract Meaning Representation (AMR), originally designed for English, has been adapted to a number of languages to facilitate cross-lingual semantic representation and analysis. We build on previous work and present the first sizable, general annotation project for Spanish AMR. We release a detailed set of annotation guidelines and a corpus of 486 gold-annotated sentences spanning multiple genres from an existing, cross-lingual AMR corpus. Our work constitutes the second largest non-English gold AMR corpus to date. Fine-tuning an AMR to-Spanish generation model with our annotations results in a BERTScore improvement of 8.8%, demonstrating initial utility of our work.</p> </div> </div> </div> 2022-11-23T00:00:00+01:00 Copyright (c) 2022 Shira Wein, Lucia Donatelli, Ethan Ricker, Calvin Engstrom, Alex Nelson, Leonie Harter, Nathan Schneider https://nejlt.ep.liu.se/article/view/4438 Task-dependent Optimal Weight Combinations for Static Embeddings 2022-08-12T11:05:47+02:00 Nathaniel Robinson nrrobins@cs.cmu.edu Nathaniel Carlson natec18@byu.edu David Mortensen dmortens@cs.cmu.edu Elizabeth Vargas elizag17@byu.edu Thomas Fackrell tfac1997@byu.edu Nancy Fulda nfulda@cs.byu.edu <p>A variety of NLP applications use word2vec skip-gram, GloVe, and fastText word embeddings. These models learn two sets of embedding vectors, but most practitioners use only one of them, or alternately an unweighted sum of both. This is the first study to systematically explore a range of linear combinations between the first and second embedding sets. We evaluate these combinations on a set of six NLP benchmarks including IR, POS-tagging, and sentence similarity. We show that the default embedding combinations are often suboptimal and demonstrate 1.0-8.0% improvements. Notably, GloVe’s default unweighted sum is its least effective combination across tasks. We provide a theoretical basis for weighting one set of embeddings more than the other according to the algorithm and task. We apply our findings to improve accuracy in applications of cross-lingual alignment and navigational knowledge by up to 15.2%.</p> 2022-11-14T00:00:00+01:00 Copyright (c) 2022 Nate Robinson, Nate Carlson, David Mortensen, Elizabeth Vargas, Thomas Fackrell, Nancy Fulda https://nejlt.ep.liu.se/article/view/4396 An Empirical Configuration Study of a Common Document Clustering Pipeline 2023-04-04T22:44:40+02:00 Anton Eklund anton.eklund@cs.umu.se Mona Forsman mona.forsman@adlede.com Frank Drewes drewes@cs.umu.se <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or creating topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.</p> </div> </div> </div> 2023-09-15T00:00:00+02:00 Copyright (c) 2023 Anton Eklund, Mona Forsman, Frank Drewes https://nejlt.ep.liu.se/article/view/4361 On the Relationship between Frames and Emotionality in Text 2023-06-21T11:29:05+02:00 Enrica Troiano enrica.troiano@ims.uni-stuttgart.de Roman Klinger roman.klinger@ims.uni-stuttgart.de Sebastian Padó pado@ims.uni-stuttgart.de <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Emotions, which are responses to salient events, can be realized in text implicitly, for instance with mere references to facts (e.g., “That was the beginning of a long war”). Interpreting affective meanings thus relies on the readers’ background knowledge, but that is hardly modeled in computational emotion analysis. Much work in the field is focused on the word level and treats individual lexical units as the fundamental emotion cues in written communication. We shift our attention to word relations. We leverage Frame Semantics, a prominent theory for the description of predicate-argument structures, which matches the study of emotions: frames build on a “semantics of understanding” whose assumptions rely precisely on people’s world knowledge. Our overarching question is whether and to what extent the events that are represented by frames possess an emotion meaning. To carry out a large corpus-based correspondence analysis, we automatically annotate texts with emotions as well as with FrameNet frames and roles, and we analyze the correlations between them. Our main finding is that substantial groups of frames have an emotional import. With an extensive qualitative analysis, we show that they capture several properties of emotions that are purported by theories from psychology. These observations boost insights on the two strands of research that we bring together: emotion analysis can profit from the event-based perspective of frame semantics; in return, frame semantics gains a better grip of its position vis-à-vis emotions, an integral part of word meanings.</p> </div> </div> </div> 2023-09-15T00:00:00+02:00 Copyright (c) 2023 Enrica Troiano, Roman Klinger, Sebastian Padó https://nejlt.ep.liu.se/article/view/4315 Part-of-Speech and Morphological Tagging of Algerian Judeo-Arabic 2022-08-22T16:25:25+02:00 Ofra Tirosh-Becker otirosh@mail.huji.ac.il Michal Kessler michalskessler@gmail.com Oren Becker becker.oren@gmail.com Yonatan Belinkov belinkov@technion.ac.il <p>Most linguistic studies of Judeo-Arabic, the ensemble of dialects spoken and written by Jews in Arab lands, are qualitative in nature and rely on laborious manual annotation work, and are therefore limited in scale. In this work, we develop automatic methods for morpho-syntactic tagging of Algerian Judeo-Arabic texts published by Algerian Jews in the 19th--20th centuries, based on a linguistically tagged corpus. First, we describe our semi-automatic approach for preprocessing these texts. Then, we experiment with both an off-the-shelf morphological tagger and several specially designed neural network taggers. Finally, we perform a real-world evaluation of new texts that were never tagged before in comparison with human expert annotators. Our experimental results demonstrate that these methods can dramatically speed up and improve the linguistic research pipeline, enabling linguists to study these dialects on a much greater scale.</p> 2022-12-14T00:00:00+01:00 Copyright (c) 2022 Ofra Tirosh-Becker, Michal Kessler, Oren Becker, Yonatan Belinkov https://nejlt.ep.liu.se/article/view/4132 Benchmark for Evaluation of Danish Clinical Word Embeddings 2023-02-22T19:06:50+01:00 Martin Sundahl Laursen msla@mmmi.sdu.dk Jannik Skyttegaard Pedersen jasp@mmmi.sdu.dk Pernille Just Vinholt pernille.vinholt@rsyd.dk Rasmus Søgaard Hansen rasmus.sogaard.hansen@rsyd.dk Thiusius Rajeeth Savarimuthu trs@mmmi.sdu.dk <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>In natural language processing, benchmarks are used to track progress and identify useful models. Currently, no benchmark for Danish clinical word embeddings exists. This paper describes the development of a Danish benchmark for clinical word embeddings. The clinical benchmark consists of ten datasets: eight intrinsic and two extrinsic. Moreover, we evaluate word embeddings trained on text from the clinical domain, general practitioner domain and general domain on the established benchmark. All the intrinsic tasks of the benchmark are publicly available.</p> </div> </div> </div> 2023-03-01T00:00:00+01:00 Copyright (c) 2023 Martin Sundahl Laursen, Jannik Skyttegaard Pedersen, Pernille Just Vinholt, Rasmus Søgaard Hansen, Thiusius Rajeeth Savarimuthu https://nejlt.ep.liu.se/article/view/4017 Building Analyses from Syntactic Inference in Local Languages: An HPSG Grammar Inference System 2022-04-06T16:42:10+02:00 Kristen Howell kjpiepgrass@gmail.com Emily M. Bender ebender@uw.edu <p>We present a grammar inference system that leverages linguistic knowledge recorded in the form of annotations in interlinear glossed text (IGT) and in a meta-grammar engineering system (the LinGO Grammar Matrix customization system) to automatically produce machine-readable HPSG grammars. Building on prior work to handle the inference of lexical classes, stems, affixes and position classes, and preliminary work on inferring case systems and word order, we introduce an integrated grammar inference system that covers a wide range of fundamental linguistic phenomena. System development was guided by 27 geneologically and geographically diverse languages, and we test the system's cross-linguistic generalizability on an additional 5 held-out languages, using datasets provided by field linguists. Our system out-performs three baseline systems in increasing coverage while limiting ambiguity and producing richer semantic representations, while also producing richer representations than previous work in grammar inference.</p> 2022-07-01T00:00:00+02:00 Copyright (c) 2022 Kristen Howell, Emily M. Bender https://nejlt.ep.liu.se/article/view/3874 6 Questions for Socially Aware Language Technologies 2021-06-28T17:35:35+02:00 Diyi Yang diyi.yang@cc.gatech.edu <p>Over the last few decades, natural language processing (NLP) has dramatically improved performance and produced industrial applications like personal assistants. Despite being sufficient to enable these applications, current NLP systems largely ignore the social part of language. This severely limits the functionality and growth of these applications. This work discusses 6 questions towards how to build socially aware language technologies, with the hope of inspire more research into Social NLP and push our research field to the next level.</p> 2021-07-01T00:00:00+02:00 Copyright (c) 2022 Diyi Yang https://nejlt.ep.liu.se/article/view/3566 Lexical variation in English language podcasts, editorial media, and social media 2022-05-16T18:28:31+02:00 Jussi Karlgren jussi@lingvi.st <div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>The study presented in this paper demonstrates how transcribed podcast material differs with respect to lexical content from other collections of English language data: editorial text, social media, both long form and microblogs, dialogue from movie scripts, and transcribed phone conversations. Most of the recorded differences are as might be expected, reflecting known or assumed difference between spoken and written language, between dialogue and soliloquy, and between scripted formal and unscripted informal language use. Most notably, podcast material, compared to the hitherto typical training sets from editorial media, is characterised by being in the present tense, and with a much higher incidence of pronouns, interjections, and negations. These characteristics are, unsurprisingly, largely shared with social media texts. Where podcast material differs from social media material is in its attitudinal content, with many more amplifiers and much less negative attitude than in blog texts. This variation, besides being of philological interest, has ramifications for computational work. Information access for material which is not primarily topical should be designed to be sensitive to such variation that defines the data set itself and discriminates items within it. In general, training sets for language models are a non-trivial parameter which are likely to show effects both expected and unexpected when applied to data from other sources and the characteristics and provenance of data used to train a model should be listed on the label as a minimal form of downstream consumer protection.</p> </div> </div> </div> 2022-08-11T00:00:00+02:00 Copyright (c) 2022 Jussi Karlgren https://nejlt.ep.liu.se/article/view/3505 Bias Identification and Attribution in NLP Models With Regression and Effect Sizes 2022-06-13T20:36:56+02:00 Erenay Dayanik erenay.dayanik@ims.uni-stuttgart.de Ngoc Thang Vu thang.vu@ims.uni-stuttgart.de Sebastian Padó pado@ims.uni-stuttgart.de <p>In recent years, there has been an increasing awareness that many NLP systems incorporate biases of various types (e.g., regarding gender or race) which can have significant negative consequences. At the same time, the techniques used to statistically analyze such biases are still relatively simple. Typically, studies test for the presence of a significant difference between two levels of a single bias variable (e.g., male vs. female) without attention to potential confounders, and do not quantify the importance of the bias variable. This article proposes to analyze bias in the output of NLP systems using multivariate regression models. They provide a robust and more informative alternative which (a) generalizes to multiple bias variables, (b) can take covariates into account, (c) can be combined with measures of effect size to quantify the size of bias. Jointly, these effects contribute to a more robust statistical analysis of bias that can be used to diagnose system behavior and extract informative examples. We demonstrate the benefits of our method by analyzing a range of current NLP models on one regression and one classification tasks (emotion intensity prediction and coreference resolution, respectively).</p> 2022-08-11T00:00:00+02:00 Copyright (c) 2022 Erenay Dayanik, Thang Vu, Sebastian Padó https://nejlt.ep.liu.se/article/view/3478 Contextualized embeddings for semantic change detection: Lessons learned 2022-02-04T17:07:53+01:00 Andrey Kutuzov andreku@ifi.uio.no Erik Velldal erikve@ifi.uio.no Lilja Øvrelid liljao@ifi.uio.no <p>We present a qualitative analysis of the (potentially erroneous) outputs of contextualized embedding-based methods for detecting diachronic semantic change. First, we introduce an ensemble method outperforming previously described contextualized approaches. This method is used as a basis for an in-depth analysis of the degrees of semantic change predicted for English words across 5 decades. Our findings show that contextualized methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift in the lexicographic sense of the term (or at least the status of these shifts is questionable). Such challenging cases are discussed in detail with examples, and their linguistic categorization is proposed. Our conclusion is that pre-trained contextualized language models are prone to confound changes in lexicographic senses and changes in contextual variance, which naturally stem from their distributional nature, but is different from the types of issues observed in methods based on static embeddings. Additionally, they often merge together syntactic and semantic aspects of lexical entities. We propose a range of possible future solutions to these issues.</p> 2022-08-26T00:00:00+02:00 Copyright (c) 2022 Andrey Kutuzov, Erik Velldal, Lilja Øvrelid https://nejlt.ep.liu.se/article/view/3454 Policy-focused Stance Detection in Parliamentary Debate Speeches 2022-05-05T10:53:56+02:00 Gavin Abercrombie gavin.abercrombie@manchester.ac.uk Riza Batista-Navarro riza.batista@manchester.ac.uk <p>Legislative debate transcripts provide citizens with information about the activities of their elected representatives, but are difficult for people to process. We propose the novel task of policy-focused stance detection, in which both the policy proposals under debate and the position of the speakers towards those proposals are identified. We adapt a previously existing dataset to include manual annotations of policy preferences, an established schema from political science.&nbsp; We evaluate a range of approaches to the automatic classification of policy preferences and speech sentiment polarity, including transformer-based text representations and a multi-task learning paradigm. We find that it is possible to identify the policies under discussion using features derived from the speeches, and that incorporating motion-dependent debate modelling, previously used to classify speech sentiment, also improves performance in the classification of policy preferences. We analyse the output of the best performing system, finding that discriminating features for the task are highly domain-specific, and that speeches that address policy preferences proposed by members of the same party can be among the most difficult to predict.</p> 2022-07-01T00:00:00+02:00 Copyright (c) 2022 Gavin Abercrombie, Riza Batista-Navarro https://nejlt.ep.liu.se/article/view/3128 Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts versus Non-Experts 2021-05-11T09:51:13+02:00 David Alfter david.alfter@svenska.gu.se Therese Lindström Tiedemann therese.lindstromtiedemann@helsinki.fi Elena Volodina elena.volodina@svenska.gu.se <p>In this study we investigate to which degree experts and non-experts agree on&nbsp;questions of linguistic complexity in a crowdsourcing experiment. We ask non-experts (second language learners of Swedish) and two groups of experts (teachers&nbsp;of Swedish as a second/foreign language and CEFR experts) to rank multi-word&nbsp;expressions in a crowdsourcing experiment. We nd that the resulting rankings&nbsp;by all the three tested groups correlate to a very high degree, which suggests that&nbsp;judgments produced in a comparative setting are not inuenced by professional&nbsp;insights into Swedish as a second language.</p> 2022-07-01T00:00:00+02:00 Copyright (c) 2021 David Alfter, Therese Lindström Tiedemann, Elena Volodina https://nejlt.ep.liu.se/article/view/1665 Special Issue of Selected Contributions from the Seventh Swedish Language Technology Conference (SLTC 2018) 2020-05-10T10:55:52+02:00 Hercules Dalianis hercules@dsv.su.se Robert Östling robert@ling.su.se Rebecka Weegar rebeckaw@dsv.su.se Mats Wirén mats.wiren@ling.su.se <p>This Special Issue contains three papers that are extended versions of abstracts presented at the Seventh Swedish Language Technology Conference (SLTC 2018), held at Stockholm University 8-9 November 2018.1 SLTC 2018 received 34 submissions, of which 31 were accepted for presentation. The number of registered participants was 113, including both attendees at SLTC 2018 and two co-located workshops that took place on 7 November. 32 participants were internationally affiliated, of which 14 were from outside the Nordic countries. Overall participation was thus on a par with previous editions of SLTC, but international participation was higher.</p> 2019-12-20T00:00:00+01:00 Copyright (c) 2019 Hercules Dalianis, Robert Östling, Rebecka Weegar, Mats Wirén https://nejlt.ep.liu.se/article/view/1662 Low-Resource Active Learning of Morphological Segmentation 2020-05-10T10:56:08+02:00 Stig-Arne Grönroos stig-arne.gronroos@aalto.fi Katri Hiovain katri.hiovain@helsinki.fi Peter Smit peter.smit@aalto.fi Ilona Rauhala ilona.rauhala@helsinki.fi Kristiina Jokinen kristiina.jokinen@helsinki.fi Mikko Kurimo mikko.kurimo@aalto.fi Sami Virpioja sami.virpioja@aalto.fi <p>Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.</p> 2016-03-13T00:00:00+01:00 Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1660 Utilizing Language Technology in the Documentation of Endangered Uralic Languages 2020-05-10T10:56:10+02:00 Ciprian Gerstenberger ciprian.gerstenberger@uit.no Niko Partanen niko.partanen@uni-hamburg.de Michael Rießler michael.riessler@skandinavistik.uni-freiburg.de Joshua Wilbur joshua.wilbur@skandinavistik.uni-freiburg.de <p>The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered – and under-described – Uralic speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Specifically, we describe a script providing interactivity between different morphosyntactic analysis modules implemented as Finite State Transducers and ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora. Ultimately, the spoken corpora created in our projects will be useful for scientifically significant quantitative investigations on these languages in the future.</p> 2016-03-13T00:00:00+01:00 Copyright (c) 2016 Ciprian Gerstenberger, Niko Partanen, Michael Rießler, Joshua Wilbur https://nejlt.ep.liu.se/article/view/1659 A North Saami to South Saami Machine Translation Prototype 2020-05-10T10:56:12+02:00 Lene Antonsen lene.antonsen@uit.no Trond Trosterud trond.trosterud@uit.no Francis M. Tyers francis.tyers@uit.no <p>The paper describes a rule-based machine translation (MT) system from North to South Saami. The system is designed for a workflow where North Saami functions as pivot language in translation from Norwegian or Swedish. We envisage manual translation from Norwegian or Swedish to North Saami, and thereafter MT to South Saami. The system was aimed at a single domain, that of texts for use in school administration. We evaluated the system in terms of the quality of translations for postediting. Two out of three of the Norwegian to South Saami professional translators found the output of the system to be useful. The evaluation shows that it is possible to make a functioning rule-based system with a small transfer lexicon and a small number of rules and achieve results that are useful for a restricted domain, even if there are substantial differences b etween the languages.</p> 2016-03-13T00:00:00+01:00 Copyright (c) 2016 Lene  Antonsen , Trond Trosterud, Francis M. Tyers https://nejlt.ep.liu.se/article/view/1657 Foreword to the Special Issue on Uralic Languages 2020-05-10T10:58:52+02:00 Tommi A Pirinen tommi.antero.pirinen@uni-hamburg.de Trond Trosterud trond.trosterud@uit.no Francis M. Tyers francis.tyers@uit.no Veronika Vincze vinczev@inf.u-szeged.hu Eszter Simon simon.eszter@nytud.mta.hu Jack Rueter jack.rueter@helsinki.fi <p>In this introduction we have tried to present concisely the history of language technology for Uralic languages up until today, and a bit of a desiderata from the point of view of why we organised this special issue. It is of course not possible to cover everything that has happened in a short introduction like this. We have attempted to cover the beginnings of the (Uralic) language-technology scene in 1980’s as far as it’s relevant to much of the current work, including the ones presented in this issue. We also go through the Uralic area by the main languages to survey on existing resources, to also form a systematic overview of what is missing. Finally we talk about some possible future directions on the pan-Uralic level of language technology management.</p> 2016-03-27T00:00:00+01:00 Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1656 SUC-CORE: A Balanced Corpus Annotated with Noun Phrase Coreference 2020-05-10T10:56:16+02:00 Kristina Nilsson Björkenstam kristina.nilsson@ling.su.se <p>This paper describes SUC-CORE, a subset of the Stockholm Ume°a Corpus and the Swedish Treebank annotated with noun phrase coreference. While most coreference annotated corpora consist of exts of similar types within related domains, SUC-CORE consists of both informative and imaginative prose and covers a wide range of literary genres and domains. This allows for exploration of coreference cross different text types, but it also means that there are limited amounts of data within each type. Future work on coreference resolution for Swedish should include making more annotated data vailable for the research community.</p> 2013-09-16T00:00:00+02:00 Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1655 Investigations of Synonym Replacement for Swedish 2020-02-06T14:39:17+01:00 Robin Keskisärkkä robin.keskisarkka@liu.se Arne Jönsson arnjo@ida.liu.se <p>We present results from an investigation on automatic synonym replacement for Swedish. Three different methods for choosing alternative synonyms were evaluated: (1) based on word frequency, (2) based on word length, and (3) based on level of synonymy. These three strategies were evaluated in terms of standardized readability metrics for Swedish, average word length, proportion of long words, and in relation to the ratio of errors in relation to replacements. The results show an improvement in readability for most strategies, but also show that erroneous substitutions are frequent.</p> 2013-12-19T00:00:00+01:00 Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1653 Stagger: an Open-Source Part of Speech Tagger for Swedish 2020-05-10T10:56:18+02:00 Robert Östling robert@ling.su.se <p>This work presents Stagger, a new open-source part of speech tagger for Swedish based on the Averaged Perceptron. By using the SALDO morphological lexicon and semi-supervised learning in the form of Collobert andWeston embeddings, it reaches an accuracy of 96.4% on the standard Stockholm-Umeå Corpus dataset, making it the best single part of speech tagging system reported for Swedish. Accuracy increases to 96.6% on the latest version of the corpus, where the annotation has been revised to increase consistency. Stagger is also evaluated on a new corpus of Swedish blog posts, investigating its out-of-domain performance.</p> 2013-09-16T00:00:00+02:00 Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1651 Transition-Based Techniques for Non-Projective Dependency Parsing 2020-05-10T10:56:21+02:00 Marco Kuhlmann marco.kuhlmann@lingfil.uu.se Joakim Nivre joakim.nivre@lingfil.uu.se <p>We present an empirical evaluation of three methods for the treatment of non-projective structures in transition-based dependency parsing: pseudo-projective parsing, non-adjacent arc transitions, and online reordering. We compare both the theoretical coverage and the empirical performance of these methods using data from Czech, English and German. The results show that although online reordering is the only method with complete theoretical coverage, all three techniques exhibit high precision but somewhat lower recall on non-projective dependencies and can all improve overall parsing accuracy provided that non-projective dependencies are frequent enough. We also find that the use of non-adjacent arc transitions may lead to a drop in accuracy on projective dependencies in the presence of long-distance non-projective dependencies, an effect that is not found for the two other techniques.</p> 2010-10-01T00:00:00+02:00 Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1650 Named Entity Recognition in Bengali 2020-05-10T10:56:24+02:00 Asif Ekbal ekbal@cl.uni-heidelberg.de Sivaji Bandyopadhyay asif.ekbal@gmail.com <p>This paper reports about a multi-engine approach for the development of a Named Entity Recognition (NER) system in Bengali by combining the classifiers such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) with the help of weighted voting techniques. The training set consists of approximately 272K wordforms, out of which 150K wordforms have been manually annotated with the four major named entity (NE) tags, namely Person name, Location name, Organization name and Miscellaneous name. An appropriate tag conversion routine has been defined in order to convert the 122K wordforms of the IJCNLP-08 NER Shared Task on South and South East Asian Languages (NERSSEAL)1 data into the desired forms. The individual classifiers make use of the different contextual information of the words along with the variety of features that are helpful to predict the various NE classes. Lexical context patterns, generated from an unlabeled corpus of 3 million wordforms in a semi-automatic way, have been used as the features of the classifiers in order to improve their performance. In addition, we propose a number of techniques to post-process the output of each classifier in order to reduce the errors and to improve the performance further. Finally, we use three weighted voting techniques to combine the individual models. Experimental results show the effectiveness of the proposed multi-engine approach with the overall Recall, Precision and F-Score values of 93.98%, 90.63% and 92.28%, respectively, which shows an improvement of 14.92% in F-Score over the best performing baseline SVM based system and an improvement of 18.36% in F-Score over the least performing baseline ME based system. Comparative evaluation results also show that the proposed system outperforms the three other existing Bengali NER systems.</p> 2010-02-02T00:00:00+01:00 Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1649 Entry Generation by Analogy – Encoding New Words for Morphological Lexicons 2020-05-10T10:56:26+02:00 Krister Lindén krister.linden@helsinki.fi <p>Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a lexicon, we need to indicate their base form and inflectional paradigm. In this article, we evaluate a combination of corpus-based and lexicon-based methods for assigning the base form and inflectional paradigm to new words in Finnish, Swedish and English finite-state transducer lexicons. The methods have been implemented with the open-source Helsinki Finite-State Technology (Lindén &amp; al., 2009). As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By combining the probabilities calculated from corpus data and from lexical data, we get a more precise combined model. The combined method has 77-81 % precision and 89-97 % recall, i.e. the first correctly generated entry is on the average found as the first or second candidate for the test languages. A further study demonstrated that a native speaker could revise suggestions from the entry generator at a speed of 300-400 entries per hour.</p> 2009-05-18T00:00:00+02:00 Copyright (c) 2009 Krister Lindén https://nejlt.ep.liu.se/article/view/1374 The SweLL Language Learner Corpus 2020-05-10T10:55:55+02:00 Elena Volodina elena.volodina@svenska.gu.se Lena Granstedt lena.granstedt@umu.se Arild Matsson arild.matsson@gu.se Beáta Megyesi beata.megyesi@lingfil.uu.se Ildikó Pilán ildiko.pilan@gmail.com Julia Prentice julia.prentice@svenska.gu.se Dan Rosén dan.rosen@svenska.gu.se Lisa Rudebeck lisa.rudebeck@su.se Carl-Johan Schenström carl-johan.schenstrom@gu.se Gunlög Sundberg gunlog.sundberg@su.se Mats Wirén mats.wiren@ling.su.se <p>The article presents a new language learner corpus for Swedish, SweLL, and the methodology from collection and pesudonymisation to protect personal information of learners to annotation adapted to second language learning. The main aim is to deliver a well-annotated corpus of essays written by second language learners of Swedish and make it available for research through a browsable environment. To that end, a new annotation tool and a new project management tool have been implemented, – both with the main purpose to ensure reliability and quality of the final corpus. In the article we discuss reasoning behind metadata selection, principles of gold corpus compilation and argue for separation of normalization from correction annotation.</p> 2019-12-20T00:00:00+01:00 Copyright (c) 2019 Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirén https://nejlt.ep.liu.se/article/view/1037 The Interplay Between Loss Functions and Structural Constraints in Dependency Parsing 2020-05-10T10:55:58+02:00 Robin Kurtz robin.kurtz@liu.se Marco Kuhlmann marco.kuhlmann@liu.se <p>Dependency parsing can be cast as a combinatorial optimization problem with the objective to find the highest-scoring graph, where edge scores are learnt from data. Several of the decoding algorithms that have been applied to this task employ structural restrictions on candidate solutions, such as the restriction to projective dependency trees in syntactic parsing, or the restriction to noncrossing graphs in semantic parsing. In this paper we study the interplay between structural restrictions and a common loss function in neural dependency parsing, the structural hingeloss. We show how structural constraints can make networks trained under this loss function diverge and propose a modified loss function that solves this problem. Our experimental evaluation shows that the modified loss function can yield improved parsing accuracy, compared to the unmodified baseline.</p> 2019-12-20T00:00:00+01:00 Copyright (c) 2019 Robin Kurtz, Marco Kuhlmann https://nejlt.ep.liu.se/article/view/1036 The Koala Part-of-Speech Tagset for Written Swedish 2020-05-10T10:56:00+02:00 Yvonne Adesam yvonne.adesam@gu.se Gerlof Bouma gerlof.bouma@gu.se <p>We present the Koala part-of-speech tagset for written Swedish. The categorization takes the Swedish Academy grammar (SAG) as its main starting point, to fit with the current descriptive view on Swedish grammar. We argue that neither SAG, as is, nor any of the existing part-of-speech tagsets, meet our requirements for a broadly applicable categorization. Our proposal is outlined and compared to the other descriptions, and motivations for both the tagset as a whole as well as decisions about individual tags are discussed.</p> 2019-12-20T00:00:00+01:00 Copyright (c) 2019 Yvonne Adesam, Gerlof Bouma https://nejlt.ep.liu.se/article/view/218 Part of Speech Tagging: Shallow or Deep Learning? 2020-05-10T10:56:02+02:00 Robert Östling robert@ling.su.se <p>Deep neural networks have advanced the state of the art in numerous fields, but they generally suffer from low computational efficiency and the level of improvement compared to more efficient machine learning models is not always significant. We perform a thorough PoS tagging evaluation on the Universal Dependencies treebanks, pitting a state-of-the-art neural network approach against UDPipe and our sparse structured perceptron-based tagger, efselab. In terms of computational efficiency, efselab is three orders of magnitude faster than the neural network model, while being more accurate than either of the other systems on 47 of 65 treebanks.</p> 2018-06-19T00:00:00+02:00 Copyright (c) 2018 Robert Östling