https://nejlt.ep.liu.se/issue/feedNorthern European Journal of Language Technology2024-03-14T17:17:56+01:00Northern European Journal of Language Technologynejlt@nejlt.orgOpen Journal Systems<p>NEJLT is a global journal publishing peer-reviewed research on natural lanauge processing and computational linguistics research <strong>for all languages</strong>.</p>https://nejlt.ep.liu.se/article/view/5249DANSK: Domain Generalization of Danish Named Entity Recognition2024-03-11T13:54:31+01:00Kenneth Enevoldsenkenneth.enevoldsen@cas.au.dkEmil Trenckner Jessenemil.tj@hotmail.comRebekah Baglinirbkh@cc.au.dk<p>Named entity recognition is an important application within Danish NLP, essential within both industry and research. However, Danish NER is inhibited by a lack coverage across domains and entity types. As a consequence, no current models are capable of fine-grained named entity recognition, nor have they been evaluated for potential generalizability issues across datasets and domains. To alleviate these limitations, this paper introduces: 1) DANSK: a named entity dataset providing for high-granularity tagging as well as within-domain evaluation of models across a diverse set of domains; 2) and three generalizable models with fine-grained annotation available in DaCy 2.6.0; and 3) an evaluation of current state-of-the-art models’ ability to generalize across domains. The evaluation of existing and new models revealed notable performance discrepancies across domains, which should be addressed within the field. Shortcomings of the annotation quality of the dataset and its impact on model training and evaluation are also discussed. Despite these limitations, we advocate for the use of the new dataset DANSK alongside further work on<br />generalizability within Danish NER.</p>2024-07-23T00:00:00+02:00Copyright (c) 2024 Kenneth Enevoldsen, Emil Trenckner Jessen, Rebekah Baglinihttps://nejlt.ep.liu.se/article/view/5217Documenting Geographically and Contextually Diverse Language Data Sources2024-02-15T06:51:57+01:00Angelina McMillan-Majoraymm@uw.eduFrancesco De Tonifrancesco.detoni@uwa.edu.auZaid Alyafeaialyafey22@gmail.comStella Bidermanstellabiderman@gmail.comKimbo Chenchentenghung@gmail.comGérard Dupontger.dupont@gmail.comHady Elsaharhadyelsahar@gmail.comChris Emezuechris.emezue@gmail.comAlham Fikri Ajialham.fikri@mbzuai.ac.aeSuzana Ilićsuzana@mltokyo.aiNurulaqilla Khamisnurulaqilla@utm.myColin Leongcleong1@udayton.eduMaraim Masoudmaraim.elbadri@gmail.comAitor Soroaa.soroa@ehu.eusPedro Ortiz Suarezpedro.ortiz@uni-mannheim.deDaniel van Striendaniel.van-strien@bl.ukZeerak Talatz@zeerak.orgYacine Jerniteyacine@huggingface.co<p><span class="fontstyle0">Contemporary large-scale data collection efforts have prioritized the amount of data collected to improve large language models (LLM). This quantitative approach has resulted in concerns for the rights of data subjects represented in data collections. This concern is exacerbated by a lack of documentation and analysis tools, making it difficult to interrogate these collections. Mindful of these pitfalls, we present a methodology for documentation-first, human-centered data collection. We apply this approach in an effort to train a multilingual LLM. We identify a geographically diverse set of target language groups (Arabic varieties, Basque, Chinese varieties, Catalan, English, French, Indic languages, Indonesian, Niger-Congo languages, Portuguese, Spanish, and Vietnamese, as well as programming languages) for which to collect metadata on potential data sources. We structure this effort by developing an online catalogue in English as a tool for gathering metadata through public hackathons. We present our tool and analyses of the resulting resource metadata, including distributions over languages, regions, and resource types, and discuss our lessons learned.</span> </p>2024-09-12T00:00:00+02:00Copyright (c) 2024 Angelina McMillan-Major, Francesco De Toni, Zaid Alyafeai, Stella Biderman, Kimbo Chen, Gérard Dupont, Hady Elsahar, Chris Emezue, Alham Fikri Aji, Suzana Ilić, Nurulaqilla Khamis, Colin Leong, Maraim Masoud, Aitor Soroa, Pedro Ortiz Suarez, Daniel van Strien, Zeerak Talat, Yacine Jernitehttps://nejlt.ep.liu.se/article/view/5203Understanding Counterspeech for Online Harm Mitigation2024-02-05T11:22:54+01:00Yi-Ling Chungychung@turing.ac.ukGavin Abercrombieg.abercrombie@hw.ac.ukFlorence Enockfenock@turing.ac.ukJonathan Brightjbright@turing.ac.ukVerena Rieserv.t.rieser@hw.ac.uk<p>Counterspeech offers direct rebuttals to hateful speech by challenging perpetrators of hate and showing support to targets of abuse. It provides a promising alternative to more contentious measures, such as content moderation and deplatforming, by contributing a greater amount of positive online speech rather than attempting to mitigate harmful content through removal. Advances in the development of large language models mean that the process of producing counterspeech could be made more efficient by automating its generation, which would enable large-scale online campaigns. However, we currently lack a systematic understanding of several important factors relating to the efficacy of counterspeech for hate mitigation, such as which types of counterspeech are most effective, what are the optimal conditions for implementation, and which specific effects of hate it can best ameliorate. This paper aims to fill this gap by systematically reviewing counterspeech research in the social sciences and comparing methodologies and findings with natural language processing (NLP) and computer science efforts in automatic counterspeech generation. By taking this multi-disciplinary view, we identify promising future directions in both fields.</p>2024-09-04T00:00:00+02:00Copyright (c) 2024 Yi-Ling Chung, Gavin Abercrombie, Florence Enock, Jonathan Bright, Verena Rieserhttps://nejlt.ep.liu.se/article/view/5000On Using Self-Report Studies to Analyze Language Models2023-11-06T12:03:59+01:00Matúš Pikuliakmatus.pikuliak@gmail.com<p>We are at a curious point in time where our ability to build language models (LMs) has outpaced our ability to analyze them. We do not really know how to reliably determine their capabilities, biases, dangers, knowledge, and so on. The benchmarks we have are often overly specific, do not generalize well, and are susceptible to data leakage. Recently, I have noticed a trend of using self-report studies, such as various polls and questionnaires originally designed for humans, to analyze the properties of LMs. I think that this approach can easily lead to false results, which can be quite dangerous considering the current discussions on AI safety, governance, and regulation. To illustrate my point, I will delve deeper into several papers that employ self-report methodologies and I will try to highlight some of their weaknesses.</p>2024-09-18T00:00:00+02:00Copyright (c) 2024 Matúš Pikuliakhttps://nejlt.ep.liu.se/article/view/4939QUA-RC: the semi-synthetic dataset of multiple choice questions for assessing reading comprehension in Ukrainian2023-09-13T11:56:27+02:00Mariia Zyrianovamariiaz@kth.seDmytro Kalpakchidmytroka@kth.se<p>In this article we present the first dataset of multiple choice questions for assessing reading comprehension in Ukrainian. The dataset is based on the texts from the Ukrainian national tests for reading comprehension, and the MCQs themselves are created semi-automatically in three stages. The first stage was to use GPT-3 to generate the MCQs zero-shot, the second stage was to select MCQs of sufficient quality and revise the ones with minor errors, whereas the final stage was to expand the dataset with the MCQs written manually. The dataset is created by the Ukrainian language native speakers, one of whom is also a language teacher. The resulting corpus has slightly more than 900 MCQs, of which only 43 MCQs could be kept as they were generated by GPT-3.</p>2023-11-16T00:00:00+01:00Copyright (c) 2023 Mariia Zyrianova, Dmytro Kalpakchihttps://nejlt.ep.liu.se/article/view/4932Efficient Structured Prediction with Transformer Encoders2024-02-22T13:05:38+01:00Ali Basiratalib@hum.ku.dk<p>Finetuning is a useful method for adapting Transformer-based text encoders to new tasks but can be computationally expensive for structured prediction tasks that require tuning at the token level. Furthermore, finetuning is inherently inefficient in updating all base model parameters, which prevents parameter sharing across tasks. To address these issues, we propose a method for efficient task adaptation of frozen Transformer encoders based on the local contribution of their intermediate layers to token representations. Our adapter uses a novel attention mechanism to aggregate intermediate layers and tailor the resulting representations to a target task. Experiments on several structured prediction tasks demonstrate that our method outperforms previous approaches, retaining over 99% of the finetuning performance at a fraction of the training cost. Our proposed method offers an efficient solution for adapting frozen Transformer encoders to new tasks, improving performance and enabling parameter sharing across different tasks.</p>2024-03-14T00:00:00+01:00Copyright (c) 2024 Ali Basirathttps://nejlt.ep.liu.se/article/view/4884Resource papers as registered reports: a proposal2023-06-28T22:25:06+02:00Emiel van MiltenburgC.W.J.vanMiltenburg@tilburguniversity.edu<p>This is a proposal for publishing resource papers as registered reports in the Northern European Journal of Language Technology. The idea is that authors write a data collection plan with a full data statement, to the extent that it can be written before data collection starts. Once the proposal is approved, publication of the final resource paper is guaranteed, as long as the data collection plan is followed (modulo reasonable changes due to unforeseen circumstances). This proposal changes the reviewing process from an antagonistic to a collaborative enterprise, and hopefully encourages NLP resources to develop and publish more high-quality datasets. The key advantage of this proposal is that it helps to promote <em>responsible resource development</em> (through constructive peer review) and to avoid <em>research waste</em>.</p>2023-07-13T00:00:00+02:00Copyright (c) 2023 Emiel van Miltenburghttps://nejlt.ep.liu.se/article/view/4855Unsupervised Text Embedding Space Generation Using Generative Adversarial Networks for Text Synthesis2023-08-21T19:01:48+02:00Jun-Min Leeljm56897@gmail.comTae-Bin Hataebinalive@gmail.com<div>Generative Adversarial Networks (GAN) is a model for data synthesis, which creates plausible data through the competition of generator and discriminator. Although GAN application to image synthesis is extensively studied, it has inherent limitations to natural language generation. Because natural language is composed of discrete tokens, a generator has difficulty updating its gradient through backpropagation; therefore, most text-GAN studies generate sentences starting with a random token based on a reward system. Thus, the generators of previous studies are pre-trained in an autoregressive way before adversarial training, causing data memorization that synthesized sentences reproduce the training data. In this paper, we synthesize sentences using a framework similar to the original GAN. More specifically, we propose Text Embedding Space Generative Adversarial Networks (TESGAN) which generate continuous text embedding spaces instead of discrete tokens to solve the gradient backpropagation problem. Furthermore, TESGAN conducts unsupervised learning which does not directly refer to the text of the training data to overcome the data memorization issue. By adopting this novel method, TESGAN can synthesize new sentences, showing the potential of unsupervised learning for text synthesis. We expect to see extended research combining Large Language Models with a new perspective of viewing text as an continuous space.</div>2023-10-24T00:00:00+02:00Copyright (c) 2023 Jun-Min Lee, Tae-Bin Hahttps://nejlt.ep.liu.se/article/view/4725NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation2023-03-08T19:39:50+01:00Kaustubh Dholekdhole@emory.eduVarun Gangalvgangal@cs.cmu.eduSebastian Gehrmanngehrmann@google.comAadesh Guptaaadesh.gupta@ipsoft.comZhenhao Lizhenhao.li18@imperial.ac.ukSaad Mahamoodsaad.mahamood@trivago.comAbinaya Mahadiranabinaya.m02@mphasis.comSimon Millesimon.mille@upf.eduAshish Shrivastavaashish3586@gmail.comSamson Tansamson.tan@salesforce.comTongshang Wusherryw@cs.cmu.eduJascha Sohl-Dicksteinjaschasd@google.comJinho ChoiJinho.Choi@emory.eduEduard Hovyhovy@cmu.eduOndřej Dušekodusek@ufal.mff.cuni.czSebastian Rudersebastian@ruder.ioSajant Anandsajant@berkeley.eduNagender Anejananeja@gmail.comRabin Banjaderbnjade1@memphis.eduLisa Barthelisa.barthe@inetum.comHanna Behnkehanna.behnke20@imperial.ac.ukIan Berlot-Attwellianberlot@cs.toronto.eduConnor Boyleconnor.bo@gmail.comCaroline Bruncaroline.brun@naverlabs.comMarco Antonio Sobrevilla Cabezudomsobrevillac@usp.brSamuel Cahyawijayascahyawijaya@connect.ust.hkEmile Chapuischapuis.emile@gmail.comWanxiang Chefuxuanwei@ir.hit.edu.cnMukund Choudharymukund.choudhary@research.iiit.ac.inChristian Clausscclauss@me.comPierre Colombocolombo.pierre@gmail.comFilip Cornellc.filip.cornell@gmail.comGautier Dagangautierdagan@gmail.comMayukh Dasmayukh.das@tu-bs.deTanay Dixitdixittanay@gmail.comThomas DopierreThomas.Dopierre@univ-st-etienne.frPaul-Alexis Draypaul.alexis.dray@gmail.comSuchitra Dubeysuchitra27288@gmail.comTatiana Ekeinhortatiana.ekeinhor@vadesecure.comMarco Di Giovannimarco.digiovanni@polimi.itTanya Goyaltanyagoyal@utexas.eduRishabh Guptarishabh19089@iiitd.ac.inLouanes Hamlalouanes.hamla@inetum.comSang Hansanghan@protonmail.comFabrice Harel-Canadafabricehc@cs.ucla.eduAntoine Honoréantoine.honore@vadesecure.comIshan Jindalishan.jindal@ibm.comPrzemysław Joniakjoniak@g.ecc.u-tokyo.ac.jpDenis Kleykodenis.kleyko@gmail.comVenelin Kovatchevvkovatchev@ub.eduKalpesh Krishnakalpesh@cs.umass.eduAshutosh Kumarashutosh@iisc.ac.inStefan Langerlanger.stefan@siemens.comSeungjae Ryan Leeseungjaeryanlee@gmail.comCorey James Levinsonthecoreylevinson@gmail.comHualou Lianghualou.liang@drexel.eduKaizhao Liangkl2@illinois.eduZhexiong Liuzhexiong@cs.pitt.eduAndrey Lukyanenkoand-lukyane@yandex.ruVukosi Marivatevukosi.marivate@cs.up.ac.zaGerard de Melodemelo@uni-potsdam.deSimon Meonisimonmeoni@aol.comMaxine Meyermaxime.meyer@vadesecure.comAfnan Mirafnanmir@utexas.eduNafise Sadat MoosaviN.S.Moosavi@sheffield.ac.ukNiklas Meunnighoffmuennighoff@stu.pku.edu.cnTimothy Sum Hon Muntimothy22000@gmail.comKenton Murraykenton@jhu.eduMarcin NamyslMarcin.Namysl@iais.fraunhofer.deMaria Obedkovamaryobedkova@gmail.comPriti Olipoli@memphis.eduNivranshu Pasrichapasricha@protonmail.comJan Pfisterpfister@informatik.uni-wuerzburg.deRichard Plantr.plant@napier.ac.ukVinay Prabhuvinay@unify.idVasile Paisvasile@racai.roLibo Qinlbqin@ir.hit.edu.cnShahab Rajishahab.raji@rutgers.eduPawan Kumar Rajpootpawan.rajpoot2411@gmail.comVikas Raunakviraunak@microsoft.comRoy Rinbergroyrinberg@gmail.comNicholas Robertsnick11roberts@cs.wisc.eduJuan Diego Rodriguezjuand-r@utexas.eduClaude Rouxclaude.roux@naverlabs.comVasconcellos Samusphsamus@gmail.comAnanya Saiananya@cse.iitm.ac.inRobin Schmidtrob.schmidt@student.uni-tuebingen.deThomas Scialomt.scialom@gmail.comTshephisho Sefarasefaratj@gmail.comSaqib Shamsishamsi.saqib@gmail.comXudong Shenxudong.shen@u.nus.eduYiwen Shiyiwen.shi@drexel.eduHaoyue Shifreda@ttic.eduAnna Shvetsanna.shvets@inetum.comNick Siegelnsiegel@arlut.utexas.eduDamien Sileodamien.sileo@kuleuven.beJamie Simonjames.simon@berkeley.eduChandan Singhchandan_singh@berkeley.eduRoman Sitelewsitelewr@gmail.comPriyank Sonipriyanksonigeca7@gmail.comTaylor Sorensentsor1313@gmail.comWilliam Sotowilliamsotomartinez@gmail.comAman Srivastavaamanit0812@gmail.comAditya Srivatsak.v.aditya@research.iiit.ac.inTony Sunthetonysun@gmail.comMukund Varmamukundvarmat@gmail.comA Tabassumatabassum.bee15seecs@seecs.edu.pkFiona Tantan.f@u.nus.eduRyan Teehanrsteehan@gmail.comMo Tiwarimotiwari@stanford.eduMarie Tolkiehnmarie.tolkiehn@desy.deAthena Wangwangathena68@yahoo.comZijian Wangzijwang@hotmail.comZijie Wangjayw@gatech.eduGloria Wanggwang1@imsa.eduFuxuan Weifuxuanwei@ir.hit.edu.cnBryan Wiliebryanwilie92@gmail.comGenta Indra Winatagiwinata@connect.ust.hkXinyu Wuxinyiwu.nlp@gmail.comWitold Wydmanskiwitold.wydmanski@uj.edu.plTianbao Xietianbaoxiexxx@gmail.comUsama Yaseenusama.yaseen@siemens.comMichael Yeemayee@engin.umich.eduJing Zhangjing.zhang2@emory.eduYue Zhangyue.zhang@wias.org.cn<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP.</p> <p><span style="font-size: 0.875rem; font-family: 'Noto Sans', 'Noto Kufi Arabic', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">El aumento de datos es un método importante para evaluar la solidez y mejorar la diversidad del entrenamiento datos para modelos de procesamiento de lenguaje natural (NLP). इस लेख में, हम एनएल-ऑगमेंटर का प्रस्ताव करते हैं - एक नया भागी- दारी पूर्वक, पायथन में बनाया गया, लैंग्वेज (एनएल) ऑग्मेंटेशन फ्रेमवर्क जो ट्रांसफॉर्मेशन (डेटा में बदलाव करना) और फीलटर (फीचर्स के अनुसार डेटा का भाग करना) के नीरमान का समर्थन करता है।. 我们描述了NL-Augmenter框架及其初步包含的117种转换和23个过滤器,并 大致标注分类了一系列可适配的自然语言任务. </span><span style="font-size: 0.875rem; font-family: 'Noto Sans', 'Noto Kufi Arabic', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">این دگرگونی ها شامل نویز، اشتباهات عمدی و تصادفی انسانی، تنوع اجتماعی-زبانی، سبک معنایی معتبر، تغییرات نحوی و همچنین ساختارهای مصنوعی است که برای انسان ها مبهم است. </span><span style="font-size: 0.875rem; font-family: 'Noto Sans', 'Noto Kufi Arabic', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">NL-Augmenterpa allin kaynintam qawachiyku, tikrakuyninku- nata servichikuspayku, chaywanmi qawariyku modelos de lenguaje popular nisqapa allin takyasqa kayninta. Kami menemukan model yang berbeda ditantang secara berbeda pada tugas yang berbeda, dengan penurunan skor kuasi-sistematis. Infrastruktur, kartu data, dan hasil evaluasi ketahanan dipublikasikan tersedia secara gratis di GitHub untuk kepentingan para peneliti yang </span><span style="font-size: 0.875rem; font-family: 'Noto Sans', 'Noto Kufi Arabic', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, Oxygen-Sans, Ubuntu, Cantarell, 'Helvetica Neue', sans-serif;">mengerjakan pembuatan parafrase, analisis ketahanan, dan NLP sumber daya rendah.</span></p> <p> </p> </div> </div> </div>2023-04-08T00:00:00+02:00Copyright (c) 2023 Kaustubh Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahadiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshang Wu, Jascha Sohl-Dickstein, Jinho D. Choi, Eduard Hovy, Ondřej Dušek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo, Samuel Cahyawijaya, Emile Chapuis, Wanxiang Che, Mukund Choudhary, Christian Clauss, Pierre Colombo, Filip Cornell, Gautier Dagan, Mayukh Das, Tanay Dixit, Thomas Dopierre, Paul-Alexis Dray, Suchitra Dubey, Tatiana Ekeinhor, Marco Di Giovanni, Tanya Goyal, Rishabh Gupta, Louanes Hamla, Sang Han, Fabrice Harel-Canada, Antoine Honoré, Ishan Jindal, Przemysław K. Joniak, Denis Kleyko, Venelin Kovatchev, Kalpesh Krishna, Ashutosh Kumar, Stefan Langer, Seungjae Ryan Lee, Corey James Levinson, Hualou Liang, Kaizhao Liang, Zhexiong Liu, Andrey Lukyanenko, Vukosi Marivate, Gerard de Melo, Simon Meoni, Maxine Meyer, Afnan Mir, Nafise Sadat Moosavi, Niklas Meunnighoff, Timothy Sum Hon Mun, Kenton Murray, Marcin Namysl, Maria Obedkova, Priti Oli, Nivranshu Pasricha, Jan Pfister, Richard Plant, Vinay Prabhu, Vasile Pais, Libo Qin, Shahab Raji, Pawan Kumar Rajpoot, Vikas Raunak, Roy Rinberg, Nicholas Roberts, Juan Diego Rodriguez, Claude Roux, Vasconcellos P. H. S., Ananya B. Sai, Robin M. Schmidt, Thomas Scialom, Tshephisho Sefara, Saqib N. Shamsi, Xudong Shen, Yiwen Shi, Haoyue Shi, Anna Shvets, Nick Siegel, Damien Sileo, Jamie Simon, Chandan Singh, Roman Sitelew, Priyank Soni, Taylor Sorensen, William Soto, Aman Srivastava, KV Aditya Srivatsa, Tony Sun, Mukund Varma T, A Tabassum, Fiona Anting Tan, Ryan Teehan, Mo Tiwari, Marie Tolkiehn, Athena Wang, Zijian Wang, Zijie J. Wang, Gloria Wang, Fuxuan Wei, Bryan Wilie, Genta Indra Winata, Xinyu Wu, Witold Wydmanski, Tianbao Xie, Usama Yaseen, Michael A. Yee, Jing Zhang, Yue Zhanghttps://nejlt.ep.liu.se/article/view/4617Foreword to NEJLT Volume 8, 20222023-01-12T06:13:18+01:00Leon Derczynskileon@nejlt.org<p>An introduction to the Northern European Journal of Language Technology in 2022</p>2022-12-31T00:00:00+01:00Copyright (c) 2022 Leon Derczynskihttps://nejlt.ep.liu.se/article/view/4561Prevention or Promotion? Predicting Author's Regulatory Focus2023-08-14T20:13:56+02:00Aswathy Velutharambathaswathy.velutharambath@100worte.deKai Sassenbergk.sassenberg@iwm-tuebingen.deRoman Klingerroman.klinger@ims.uni-stuttgart.de<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>People differ fundamentally in what motivates them to pursue a goal and how they approach it. For instance, some people seek growth and show eagerness, whereas others prefer security and are vigilant. The concept of regulatory focus is employed in psychology, to explain and predict this goal-directed behavior of humans underpinned by two unique motivational systems – the promotion and the prevention system. Traditionally, text analysis methods using closed-vocabularies are employed to assess the distinctive linguistic patterns associated with the two systems. From an NLP perspective, automatically detecting the regulatory focus of individuals from text provides valuable insights into the behavioral inclinations of the author, finding its applications in areas like marketing or health communication. However, the concept never made an impactful debut in computational linguistics research. To bridge this gap we introduce the novel task of regulatory focus classification from text and present two complementary German datasets – (1) experimentally generated event descriptions and (2) manually annotated short social media texts used for evaluating the generalizability of models on real-world data. First, we conduct a correlation analysis to verify if the linguistic footprints of regulatory focus reported in psychology studies are observable and to what extent in our datasets. For automatic classification, we compare closed-vocabulary-based analyses with a state-of-the-art BERT-based text classification model and observe that the latter outperforms lexicon-based approaches on experimental data and is notably better on out-of-domain Twitter data.</p> </div> </div> </div>2023-09-15T00:00:00+02:00Copyright (c) 2023 Aswathy Velutharambath, Kai Sassenberg, Roman Klingerhttps://nejlt.ep.liu.se/article/view/4529Barriers and enabling factors for error analysis in NLG research2022-11-23T22:03:01+01:00Emiel van MiltenburgC.W.J.vanMiltenburg@tilburguniversity.eduMiruna Clinciumiruna.clinciu@gmail.comOndřej Dušekodusek@ufal.mff.cuni.czDimitra GkatziaD.Gkatzia@napier.ac.ukStephanie Inglisstephanie.inglis@arria.comLeo Leppänenleo.leppanen@helsinki.fiSaad MahamoodSaad.Mahamood@trivago.comStephanie Schochsns2gr@virginia.eduCraig Thomsonc.thomson@abdn.ac.ukLuou Wenluouwen97@gmail.com<p>Earlier research has shown that few studies in Natural Language Generation (NLG) evaluate their system outputs using an error analysis, despite known limitations of automatic evaluation metrics and human ratings. This position paper takes the stance that error analyses should be encouraged, and discusses several ways to do so. This paper is based on our shared experience as authors as well as a survey we distributed as a means of public consultation. We provide an overview of existing barriers to carrying out error analyses, and propose changes to improve error reporting in the NLG literature.</p>2023-02-21T00:00:00+01:00Copyright (c) 2023 Emiel van Miltenburg, Miruna Clinciu, Ondřej Dušek, Dimitra Gkatzia, Stephanie Inglis, Leo Leppänen, Saad Mahamood, Stephanie Schoch, Craig Thomson, Luou Wenhttps://nejlt.ep.liu.se/article/view/4453PARSEME Meets Universal Dependencies: Getting on the Same Page in Representing Multiword Expressions2022-12-01T02:42:43+01:00Agata Savaryagata.savary@universite-paris-saclay.frSara Stymnesara.stymne@lingfil.uu.seVerginica Barbu Mititeluvergi@racai.roNathan SchneiderNathan.Schneider@georgetown.eduCarlos Ramischcarlos.ramisch@lis-lab.frJoakim Nivrejoakim.nivre@lingfil.uu.se<p>Multiword expressions (MWEs) are challenging and pervasive phenomena whose idiosyncratic properties show notably at the levels of lexicon, morphology, and syntax. Thus, they should best be annotated jointly with morphosyntax. We discuss two multilingual initiatives, Universal Dependencies and PARSEME, addressing these annotation layers in cross-lingually unified ways. We compare the annotation principles of these initiatives with respect to MWEs, and we put forward a roadmap towards their gradual unification. The expected outcomes are more consistent treebanking and higher universality in modeling idiosyncrasy.</p>2023-02-21T00:00:00+01:00Copyright (c) 2023 Agata Savary, Sara Stymne, Verginica Barbu Mititelu, Nathan Schneider, Carlos Ramisch, Joakim Nivrehttps://nejlt.ep.liu.se/article/view/4462Spanish Abstract Meaning Representation: Annotation of a General Corpus2022-11-09T20:28:20+01:00Shira Weinsbmw15@gmail.comLucia Donatellidonatelli@coli.uni-saarland.deEthan Rickerear131@georgetown.eduCalvin Engstromcle41@georgetown.eduAlex Nelsonamn106@georgetown.eduLeonie Harterleonie-harter@web.deNathan Schneidernathan.schneider@georgetown.edu<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Abstract Meaning Representation (AMR), originally designed for English, has been adapted to a number of languages to facilitate cross-lingual semantic representation and analysis. We build on previous work and present the first sizable, general annotation project for Spanish AMR. We release a detailed set of annotation guidelines and a corpus of 486 gold-annotated sentences spanning multiple genres from an existing, cross-lingual AMR corpus. Our work constitutes the second largest non-English gold AMR corpus to date. Fine-tuning an AMR to-Spanish generation model with our annotations results in a BERTScore improvement of 8.8%, demonstrating initial utility of our work.</p> </div> </div> </div>2022-11-23T00:00:00+01:00Copyright (c) 2022 Shira Wein, Lucia Donatelli, Ethan Ricker, Calvin Engstrom, Alex Nelson, Leonie Harter, Nathan Schneiderhttps://nejlt.ep.liu.se/article/view/4438Task-dependent Optimal Weight Combinations for Static Embeddings2022-08-12T11:05:47+02:00Nathaniel Robinsonnrrobins@cs.cmu.eduNathaniel Carlsonnatec18@byu.eduDavid Mortensendmortens@cs.cmu.eduElizabeth Vargaselizag17@byu.eduThomas Fackrelltfac1997@byu.eduNancy Fuldanfulda@cs.byu.edu<p>A variety of NLP applications use word2vec skip-gram, GloVe, and fastText word embeddings. These models learn two sets of embedding vectors, but most practitioners use only one of them, or alternately an unweighted sum of both. This is the first study to systematically explore a range of linear combinations between the first and second embedding sets. We evaluate these combinations on a set of six NLP benchmarks including IR, POS-tagging, and sentence similarity. We show that the default embedding combinations are often suboptimal and demonstrate 1.0-8.0% improvements. Notably, GloVe’s default unweighted sum is its least effective combination across tasks. We provide a theoretical basis for weighting one set of embeddings more than the other according to the algorithm and task. We apply our findings to improve accuracy in applications of cross-lingual alignment and navigational knowledge by up to 15.2%.</p>2022-11-14T00:00:00+01:00Copyright (c) 2022 Nate Robinson, Nate Carlson, David Mortensen, Elizabeth Vargas, Thomas Fackrell, Nancy Fuldahttps://nejlt.ep.liu.se/article/view/4396An Empirical Configuration Study of a Common Document Clustering Pipeline2023-04-04T22:44:40+02:00Anton Eklundanton.eklund@cs.umu.seMona Forsmanmona.forsman@adlede.comFrank Drewesdrewes@cs.umu.se<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or creating topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.</p> </div> </div> </div>2023-09-15T00:00:00+02:00Copyright (c) 2023 Anton Eklund, Mona Forsman, Frank Dreweshttps://nejlt.ep.liu.se/article/view/4361On the Relationship between Frames and Emotionality in Text2023-06-21T11:29:05+02:00Enrica Troianoenrica.troiano@ims.uni-stuttgart.deRoman Klingerroman.klinger@ims.uni-stuttgart.deSebastian Padópado@ims.uni-stuttgart.de<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>Emotions, which are responses to salient events, can be realized in text implicitly, for instance with mere references to facts (e.g., “That was the beginning of a long war”). Interpreting affective meanings thus relies on the readers’ background knowledge, but that is hardly modeled in computational emotion analysis. Much work in the field is focused on the word level and treats individual lexical units as the fundamental emotion cues in written communication. We shift our attention to word relations. We leverage Frame Semantics, a prominent theory for the description of predicate-argument structures, which matches the study of emotions: frames build on a “semantics of understanding” whose assumptions rely precisely on people’s world knowledge. Our overarching question is whether and to what extent the events that are represented by frames possess an emotion meaning. To carry out a large corpus-based correspondence analysis, we automatically annotate texts with emotions as well as with FrameNet frames and roles, and we analyze the correlations between them. Our main finding is that substantial groups of frames have an emotional import. With an extensive qualitative analysis, we show that they capture several properties of emotions that are purported by theories from psychology. These observations boost insights on the two strands of research that we bring together: emotion analysis can profit from the event-based perspective of frame semantics; in return, frame semantics gains a better grip of its position vis-à-vis emotions, an integral part of word meanings.</p> </div> </div> </div>2023-09-15T00:00:00+02:00Copyright (c) 2023 Enrica Troiano, Roman Klinger, Sebastian Padóhttps://nejlt.ep.liu.se/article/view/4315Part-of-Speech and Morphological Tagging of Algerian Judeo-Arabic2022-08-22T16:25:25+02:00Ofra Tirosh-Beckerotirosh@mail.huji.ac.ilMichal Kesslermichalskessler@gmail.comOren Beckerbecker.oren@gmail.comYonatan Belinkovbelinkov@technion.ac.il<p>Most linguistic studies of Judeo-Arabic, the ensemble of dialects spoken and written by Jews in Arab lands, are qualitative in nature and rely on laborious manual annotation work, and are therefore limited in scale. In this work, we develop automatic methods for morpho-syntactic tagging of Algerian Judeo-Arabic texts published by Algerian Jews in the 19th--20th centuries, based on a linguistically tagged corpus. First, we describe our semi-automatic approach for preprocessing these texts. Then, we experiment with both an off-the-shelf morphological tagger and several specially designed neural network taggers. Finally, we perform a real-world evaluation of new texts that were never tagged before in comparison with human expert annotators. Our experimental results demonstrate that these methods can dramatically speed up and improve the linguistic research pipeline, enabling linguists to study these dialects on a much greater scale.</p>2022-12-14T00:00:00+01:00Copyright (c) 2022 Ofra Tirosh-Becker, Michal Kessler, Oren Becker, Yonatan Belinkovhttps://nejlt.ep.liu.se/article/view/4132Benchmark for Evaluation of Danish Clinical Word Embeddings2023-02-22T19:06:50+01:00Martin Sundahl Laursenmsla@mmmi.sdu.dkJannik Skyttegaard Pedersenjasp@mmmi.sdu.dkPernille Just Vinholtpernille.vinholt@rsyd.dkRasmus Søgaard Hansenrasmus.sogaard.hansen@rsyd.dkThiusius Rajeeth Savarimuthutrs@mmmi.sdu.dk<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>In natural language processing, benchmarks are used to track progress and identify useful models. Currently, no benchmark for Danish clinical word embeddings exists. This paper describes the development of a Danish benchmark for clinical word embeddings. The clinical benchmark consists of ten datasets: eight intrinsic and two extrinsic. Moreover, we evaluate word embeddings trained on text from the clinical domain, general practitioner domain and general domain on the established benchmark. All the intrinsic tasks of the benchmark are publicly available.</p> </div> </div> </div>2023-03-01T00:00:00+01:00Copyright (c) 2023 Martin Sundahl Laursen, Jannik Skyttegaard Pedersen, Pernille Just Vinholt, Rasmus Søgaard Hansen, Thiusius Rajeeth Savarimuthuhttps://nejlt.ep.liu.se/article/view/4017Building Analyses from Syntactic Inference in Local Languages: An HPSG Grammar Inference System2022-04-06T16:42:10+02:00Kristen Howellkjpiepgrass@gmail.comEmily M. Benderebender@uw.edu<p>We present a grammar inference system that leverages linguistic knowledge recorded in the form of annotations in interlinear glossed text (IGT) and in a meta-grammar engineering system (the LinGO Grammar Matrix customization system) to automatically produce machine-readable HPSG grammars. Building on prior work to handle the inference of lexical classes, stems, affixes and position classes, and preliminary work on inferring case systems and word order, we introduce an integrated grammar inference system that covers a wide range of fundamental linguistic phenomena. System development was guided by 27 geneologically and geographically diverse languages, and we test the system's cross-linguistic generalizability on an additional 5 held-out languages, using datasets provided by field linguists. Our system out-performs three baseline systems in increasing coverage while limiting ambiguity and producing richer semantic representations, while also producing richer representations than previous work in grammar inference.</p>2022-07-01T00:00:00+02:00Copyright (c) 2022 Kristen Howell, Emily M. Benderhttps://nejlt.ep.liu.se/article/view/38746 Questions for Socially Aware Language Technologies2021-06-28T17:35:35+02:00Diyi Yangdiyi.yang@cc.gatech.edu<p>Over the last few decades, natural language processing (NLP) has dramatically improved performance and produced industrial applications like personal assistants. Despite being sufficient to enable these applications, current NLP systems largely ignore the social part of language. This severely limits the functionality and growth of these applications. This work discusses 6 questions towards how to build socially aware language technologies, with the hope of inspire more research into Social NLP and push our research field to the next level.</p>2021-07-01T00:00:00+02:00Copyright (c) 2022 Diyi Yanghttps://nejlt.ep.liu.se/article/view/3566Lexical variation in English language podcasts, editorial media, and social media2022-05-16T18:28:31+02:00Jussi Karlgrenjussi@lingvi.st<div class="page" title="Page 1"> <div class="layoutArea"> <div class="column"> <p>The study presented in this paper demonstrates how transcribed podcast material differs with respect to lexical content from other collections of English language data: editorial text, social media, both long form and microblogs, dialogue from movie scripts, and transcribed phone conversations. Most of the recorded differences are as might be expected, reflecting known or assumed difference between spoken and written language, between dialogue and soliloquy, and between scripted formal and unscripted informal language use. Most notably, podcast material, compared to the hitherto typical training sets from editorial media, is characterised by being in the present tense, and with a much higher incidence of pronouns, interjections, and negations. These characteristics are, unsurprisingly, largely shared with social media texts. Where podcast material differs from social media material is in its attitudinal content, with many more amplifiers and much less negative attitude than in blog texts. This variation, besides being of philological interest, has ramifications for computational work. Information access for material which is not primarily topical should be designed to be sensitive to such variation that defines the data set itself and discriminates items within it. In general, training sets for language models are a non-trivial parameter which are likely to show effects both expected and unexpected when applied to data from other sources and the characteristics and provenance of data used to train a model should be listed on the label as a minimal form of downstream consumer protection.</p> </div> </div> </div>2022-08-11T00:00:00+02:00Copyright (c) 2022 Jussi Karlgrenhttps://nejlt.ep.liu.se/article/view/3505Bias Identification and Attribution in NLP Models With Regression and Effect Sizes2022-06-13T20:36:56+02:00Erenay Dayanikerenay.dayanik@ims.uni-stuttgart.deNgoc Thang Vuthang.vu@ims.uni-stuttgart.deSebastian Padópado@ims.uni-stuttgart.de<p>In recent years, there has been an increasing awareness that many NLP systems incorporate biases of various types (e.g., regarding gender or race) which can have significant negative consequences. At the same time, the techniques used to statistically analyze such biases are still relatively simple. Typically, studies test for the presence of a significant difference between two levels of a single bias variable (e.g., male vs. female) without attention to potential confounders, and do not quantify the importance of the bias variable. This article proposes to analyze bias in the output of NLP systems using multivariate regression models. They provide a robust and more informative alternative which (a) generalizes to multiple bias variables, (b) can take covariates into account, (c) can be combined with measures of effect size to quantify the size of bias. Jointly, these effects contribute to a more robust statistical analysis of bias that can be used to diagnose system behavior and extract informative examples. We demonstrate the benefits of our method by analyzing a range of current NLP models on one regression and one classification tasks (emotion intensity prediction and coreference resolution, respectively).</p>2022-08-11T00:00:00+02:00Copyright (c) 2022 Erenay Dayanik, Thang Vu, Sebastian Padóhttps://nejlt.ep.liu.se/article/view/3478Contextualized embeddings for semantic change detection: Lessons learned2022-02-04T17:07:53+01:00Andrey Kutuzovandreku@ifi.uio.noErik Velldalerikve@ifi.uio.noLilja Øvrelidliljao@ifi.uio.no<p>We present a qualitative analysis of the (potentially erroneous) outputs of contextualized embedding-based methods for detecting diachronic semantic change. First, we introduce an ensemble method outperforming previously described contextualized approaches. This method is used as a basis for an in-depth analysis of the degrees of semantic change predicted for English words across 5 decades. Our findings show that contextualized methods can often predict high change scores for words which are not undergoing any real diachronic semantic shift in the lexicographic sense of the term (or at least the status of these shifts is questionable). Such challenging cases are discussed in detail with examples, and their linguistic categorization is proposed. Our conclusion is that pre-trained contextualized language models are prone to confound changes in lexicographic senses and changes in contextual variance, which naturally stem from their distributional nature, but is different from the types of issues observed in methods based on static embeddings. Additionally, they often merge together syntactic and semantic aspects of lexical entities. We propose a range of possible future solutions to these issues.</p>2022-08-26T00:00:00+02:00Copyright (c) 2022 Andrey Kutuzov, Erik Velldal, Lilja Øvrelidhttps://nejlt.ep.liu.se/article/view/3454Policy-focused Stance Detection in Parliamentary Debate Speeches2022-05-05T10:53:56+02:00Gavin Abercrombiegavin.abercrombie@manchester.ac.ukRiza Batista-Navarroriza.batista@manchester.ac.uk<p>Legislative debate transcripts provide citizens with information about the activities of their elected representatives, but are difficult for people to process. We propose the novel task of policy-focused stance detection, in which both the policy proposals under debate and the position of the speakers towards those proposals are identified. We adapt a previously existing dataset to include manual annotations of policy preferences, an established schema from political science. We evaluate a range of approaches to the automatic classification of policy preferences and speech sentiment polarity, including transformer-based text representations and a multi-task learning paradigm. We find that it is possible to identify the policies under discussion using features derived from the speeches, and that incorporating motion-dependent debate modelling, previously used to classify speech sentiment, also improves performance in the classification of policy preferences. We analyse the output of the best performing system, finding that discriminating features for the task are highly domain-specific, and that speeches that address policy preferences proposed by members of the same party can be among the most difficult to predict.</p>2022-07-01T00:00:00+02:00Copyright (c) 2022 Gavin Abercrombie, Riza Batista-Navarrohttps://nejlt.ep.liu.se/article/view/3128Crowdsourcing Relative Rankings of Multi-Word Expressions: Experts versus Non-Experts2021-05-11T09:51:13+02:00David Alfterdavid.alfter@svenska.gu.seTherese Lindström Tiedemanntherese.lindstromtiedemann@helsinki.fiElena Volodinaelena.volodina@svenska.gu.se<p>In this study we investigate to which degree experts and non-experts agree on questions of linguistic complexity in a crowdsourcing experiment. We ask non-experts (second language learners of Swedish) and two groups of experts (teachers of Swedish as a second/foreign language and CEFR experts) to rank multi-word expressions in a crowdsourcing experiment. We nd that the resulting rankings by all the three tested groups correlate to a very high degree, which suggests that judgments produced in a comparative setting are not inuenced by professional insights into Swedish as a second language.</p>2022-07-01T00:00:00+02:00Copyright (c) 2021 David Alfter, Therese Lindström Tiedemann, Elena Volodinahttps://nejlt.ep.liu.se/article/view/1665Special Issue of Selected Contributions from the Seventh Swedish Language Technology Conference (SLTC 2018)2020-05-10T10:55:52+02:00Hercules Dalianishercules@dsv.su.seRobert Östlingrobert@ling.su.seRebecka Weegarrebeckaw@dsv.su.seMats Wirénmats.wiren@ling.su.se<p>This Special Issue contains three papers that are extended versions of abstracts presented at the Seventh Swedish Language Technology Conference (SLTC 2018), held at Stockholm University 8-9 November 2018.1 SLTC 2018 received 34 submissions, of which 31 were accepted for presentation. The number of registered participants was 113, including both attendees at SLTC 2018 and two co-located workshops that took place on 7 November. 32 participants were internationally affiliated, of which 14 were from outside the Nordic countries. Overall participation was thus on a par with previous editions of SLTC, but international participation was higher.</p>2019-12-20T00:00:00+01:00Copyright (c) 2019 Hercules Dalianis, Robert Östling, Rebecka Weegar, Mats Wirénhttps://nejlt.ep.liu.se/article/view/1662Low-Resource Active Learning of Morphological Segmentation2020-05-10T10:56:08+02:00Stig-Arne Grönroosstig-arne.gronroos@aalto.fiKatri Hiovainkatri.hiovain@helsinki.fiPeter Smitpeter.smit@aalto.fiIlona Rauhalailona.rauhala@helsinki.fiKristiina Jokinenkristiina.jokinen@helsinki.fiMikko Kurimomikko.kurimo@aalto.fiSami Virpiojasami.virpioja@aalto.fi<p>Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.</p>2016-03-13T00:00:00+01:00Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1660Utilizing Language Technology in the Documentation of Endangered Uralic Languages2020-05-10T10:56:10+02:00Ciprian Gerstenbergerciprian.gerstenberger@uit.noNiko Partanenniko.partanen@uni-hamburg.deMichael Rießlermichael.riessler@skandinavistik.uni-freiburg.deJoshua Wilburjoshua.wilbur@skandinavistik.uni-freiburg.de<p>The paper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which record new spoken language data, digitize available recordings and annotate these multimedia data in order to provide comprehensive language corpora as databases for future research on and for endangered – and under-described – Uralic speech communities. Applying language technology in language documentation helps us to create more systematically annotated corpora, rather than eclectic data collections. Specifically, we describe a script providing interactivity between different morphosyntactic analysis modules implemented as Finite State Transducers and ELAN, a Graphical User Interface tool for annotating and presenting multimodal corpora. Ultimately, the spoken corpora created in our projects will be useful for scientifically significant quantitative investigations on these languages in the future.</p>2016-03-13T00:00:00+01:00Copyright (c) 2016 Ciprian Gerstenberger, Niko Partanen, Michael Rießler, Joshua Wilburhttps://nejlt.ep.liu.se/article/view/1659A North Saami to South Saami Machine Translation Prototype2020-05-10T10:56:12+02:00Lene Antonsenlene.antonsen@uit.noTrond Trosterudtrond.trosterud@uit.noFrancis M. Tyersfrancis.tyers@uit.no<p>The paper describes a rule-based machine translation (MT) system from North to South Saami. The system is designed for a workflow where North Saami functions as pivot language in translation from Norwegian or Swedish. We envisage manual translation from Norwegian or Swedish to North Saami, and thereafter MT to South Saami. The system was aimed at a single domain, that of texts for use in school administration. We evaluated the system in terms of the quality of translations for postediting. Two out of three of the Norwegian to South Saami professional translators found the output of the system to be useful. The evaluation shows that it is possible to make a functioning rule-based system with a small transfer lexicon and a small number of rules and achieve results that are useful for a restricted domain, even if there are substantial differences b etween the languages.</p>2016-03-13T00:00:00+01:00Copyright (c) 2016 Lene Antonsen , Trond Trosterud, Francis M. Tyershttps://nejlt.ep.liu.se/article/view/1657Foreword to the Special Issue on Uralic Languages2020-05-10T10:58:52+02:00Tommi A Pirinentommi.antero.pirinen@uni-hamburg.deTrond Trosterudtrond.trosterud@uit.noFrancis M. Tyersfrancis.tyers@uit.noVeronika Vinczevinczev@inf.u-szeged.huEszter Simonsimon.eszter@nytud.mta.huJack Rueterjack.rueter@helsinki.fi<p>In this introduction we have tried to present concisely the history of language technology for Uralic languages up until today, and a bit of a desiderata from the point of view of why we organised this special issue. It is of course not possible to cover everything that has happened in a short introduction like this. We have attempted to cover the beginnings of the (Uralic) language-technology scene in 1980’s as far as it’s relevant to much of the current work, including the ones presented in this issue. We also go through the Uralic area by the main languages to survey on existing resources, to also form a systematic overview of what is missing. Finally we talk about some possible future directions on the pan-Uralic level of language technology management.</p>2016-03-27T00:00:00+01:00Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1656SUC-CORE: A Balanced Corpus Annotated with Noun Phrase Coreference2020-05-10T10:56:16+02:00Kristina Nilsson Björkenstamkristina.nilsson@ling.su.se<p>This paper describes SUC-CORE, a subset of the Stockholm Ume°a Corpus and the Swedish Treebank annotated with noun phrase coreference. While most coreference annotated corpora consist of exts of similar types within related domains, SUC-CORE consists of both informative and imaginative prose and covers a wide range of literary genres and domains. This allows for exploration of coreference cross different text types, but it also means that there are limited amounts of data within each type. Future work on coreference resolution for Swedish should include making more annotated data vailable for the research community.</p>2013-09-16T00:00:00+02:00Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1655Investigations of Synonym Replacement for Swedish2020-02-06T14:39:17+01:00Robin Keskisärkkärobin.keskisarkka@liu.seArne Jönssonarnjo@ida.liu.se<p>We present results from an investigation on automatic synonym replacement for Swedish. Three different methods for choosing alternative synonyms were evaluated: (1) based on word frequency, (2) based on word length, and (3) based on level of synonymy. These three strategies were evaluated in terms of standardized readability metrics for Swedish, average word length, proportion of long words, and in relation to the ratio of errors in relation to replacements. The results show an improvement in readability for most strategies, but also show that erroneous substitutions are frequent.</p>2013-12-19T00:00:00+01:00Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1653Stagger: an Open-Source Part of Speech Tagger for Swedish2020-05-10T10:56:18+02:00Robert Östlingrobert@ling.su.se<p>This work presents Stagger, a new open-source part of speech tagger for Swedish based on the Averaged Perceptron. By using the SALDO morphological lexicon and semi-supervised learning in the form of Collobert andWeston embeddings, it reaches an accuracy of 96.4% on the standard Stockholm-Umeå Corpus dataset, making it the best single part of speech tagging system reported for Swedish. Accuracy increases to 96.6% on the latest version of the corpus, where the annotation has been revised to increase consistency. Stagger is also evaluated on a new corpus of Swedish blog posts, investigating its out-of-domain performance.</p>2013-09-16T00:00:00+02:00Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1651Transition-Based Techniques for Non-Projective Dependency Parsing2020-05-10T10:56:21+02:00Marco Kuhlmannmarco.kuhlmann@lingfil.uu.seJoakim Nivrejoakim.nivre@lingfil.uu.se<p>We present an empirical evaluation of three methods for the treatment of non-projective structures in transition-based dependency parsing: pseudo-projective parsing, non-adjacent arc transitions, and online reordering. We compare both the theoretical coverage and the empirical performance of these methods using data from Czech, English and German. The results show that although online reordering is the only method with complete theoretical coverage, all three techniques exhibit high precision but somewhat lower recall on non-projective dependencies and can all improve overall parsing accuracy provided that non-projective dependencies are frequent enough. We also find that the use of non-adjacent arc transitions may lead to a drop in accuracy on projective dependencies in the presence of long-distance non-projective dependencies, an effect that is not found for the two other techniques.</p>2010-10-01T00:00:00+02:00Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1650Named Entity Recognition in Bengali 2020-05-10T10:56:24+02:00Asif Ekbalekbal@cl.uni-heidelberg.deSivaji Bandyopadhyayasif.ekbal@gmail.com<p>This paper reports about a multi-engine approach for the development of a Named Entity Recognition (NER) system in Bengali by combining the classifiers such as Maximum Entropy (ME), Conditional Random Field (CRF) and Support Vector Machine (SVM) with the help of weighted voting techniques. The training set consists of approximately 272K wordforms, out of which 150K wordforms have been manually annotated with the four major named entity (NE) tags, namely Person name, Location name, Organization name and Miscellaneous name. An appropriate tag conversion routine has been defined in order to convert the 122K wordforms of the IJCNLP-08 NER Shared Task on South and South East Asian Languages (NERSSEAL)1 data into the desired forms. The individual classifiers make use of the different contextual information of the words along with the variety of features that are helpful to predict the various NE classes. Lexical context patterns, generated from an unlabeled corpus of 3 million wordforms in a semi-automatic way, have been used as the features of the classifiers in order to improve their performance. In addition, we propose a number of techniques to post-process the output of each classifier in order to reduce the errors and to improve the performance further. Finally, we use three weighted voting techniques to combine the individual models. Experimental results show the effectiveness of the proposed multi-engine approach with the overall Recall, Precision and F-Score values of 93.98%, 90.63% and 92.28%, respectively, which shows an improvement of 14.92% in F-Score over the best performing baseline SVM based system and an improvement of 18.36% in F-Score over the least performing baseline ME based system. Comparative evaluation results also show that the proposed system outperforms the three other existing Bengali NER systems.</p>2010-02-02T00:00:00+01:00Copyright (c) 0 https://nejlt.ep.liu.se/article/view/1649Entry Generation by Analogy – Encoding New Words for Morphological Lexicons2020-05-10T10:56:26+02:00Krister Lindénkrister.linden@helsinki.fi<p>Language software applications encounter new words, e.g., acronyms, technical terminology, loan words, names or compounds of such words. To add new words to a lexicon, we need to indicate their base form and inflectional paradigm. In this article, we evaluate a combination of corpus-based and lexicon-based methods for assigning the base form and inflectional paradigm to new words in Finnish, Swedish and English finite-state transducer lexicons. The methods have been implemented with the open-source Helsinki Finite-State Technology (Lindén & al., 2009). As an entry generator often produces numerous suggestions, it is important that the best suggestions be among the first few, otherwise it may become more efficient to create the entries by hand. By combining the probabilities calculated from corpus data and from lexical data, we get a more precise combined model. The combined method has 77-81 % precision and 89-97 % recall, i.e. the first correctly generated entry is on the average found as the first or second candidate for the test languages. A further study demonstrated that a native speaker could revise suggestions from the entry generator at a speed of 300-400 entries per hour.</p>2009-05-18T00:00:00+02:00Copyright (c) 2009 Krister Lindénhttps://nejlt.ep.liu.se/article/view/1374The SweLL Language Learner Corpus2020-05-10T10:55:55+02:00Elena Volodinaelena.volodina@svenska.gu.seLena Granstedtlena.granstedt@umu.seArild Matssonarild.matsson@gu.seBeáta Megyesibeata.megyesi@lingfil.uu.seIldikó Pilánildiko.pilan@gmail.comJulia Prenticejulia.prentice@svenska.gu.seDan Roséndan.rosen@svenska.gu.seLisa Rudebecklisa.rudebeck@su.seCarl-Johan Schenströmcarl-johan.schenstrom@gu.seGunlög Sundberggunlog.sundberg@su.seMats Wirénmats.wiren@ling.su.se<p>The article presents a new language learner corpus for Swedish, SweLL, and the methodology from collection and pesudonymisation to protect personal information of learners to annotation adapted to second language learning. The main aim is to deliver a well-annotated corpus of essays written by second language learners of Swedish and make it available for research through a browsable environment. To that end, a new annotation tool and a new project management tool have been implemented, – both with the main purpose to ensure reliability and quality of the final corpus. In the article we discuss reasoning behind metadata selection, principles of gold corpus compilation and argue for separation of normalization from correction annotation.</p>2019-12-20T00:00:00+01:00Copyright (c) 2019 Elena Volodina, Lena Granstedt, Arild Matsson, Beáta Megyesi, Ildikó Pilán, Julia Prentice, Dan Rosén, Lisa Rudebeck, Carl-Johan Schenström, Gunlög Sundberg, Mats Wirénhttps://nejlt.ep.liu.se/article/view/1037The Interplay Between Loss Functions and Structural Constraints in Dependency Parsing2020-05-10T10:55:58+02:00Robin Kurtzrobin.kurtz@liu.seMarco Kuhlmannmarco.kuhlmann@liu.se<p>Dependency parsing can be cast as a combinatorial optimization problem with the objective to find the highest-scoring graph, where edge scores are learnt from data. Several of the decoding algorithms that have been applied to this task employ structural restrictions on candidate solutions, such as the restriction to projective dependency trees in syntactic parsing, or the restriction to noncrossing graphs in semantic parsing. In this paper we study the interplay between structural restrictions and a common loss function in neural dependency parsing, the structural hingeloss. We show how structural constraints can make networks trained under this loss function diverge and propose a modified loss function that solves this problem. Our experimental evaluation shows that the modified loss function can yield improved parsing accuracy, compared to the unmodified baseline.</p>2019-12-20T00:00:00+01:00Copyright (c) 2019 Robin Kurtz, Marco Kuhlmannhttps://nejlt.ep.liu.se/article/view/1036The Koala Part-of-Speech Tagset for Written Swedish2020-05-10T10:56:00+02:00Yvonne Adesamyvonne.adesam@gu.seGerlof Boumagerlof.bouma@gu.se<p>We present the Koala part-of-speech tagset for written Swedish. The categorization takes the Swedish Academy grammar (SAG) as its main starting point, to fit with the current descriptive view on Swedish grammar. We argue that neither SAG, as is, nor any of the existing part-of-speech tagsets, meet our requirements for a broadly applicable categorization. Our proposal is outlined and compared to the other descriptions, and motivations for both the tagset as a whole as well as decisions about individual tags are discussed.</p>2019-12-20T00:00:00+01:00Copyright (c) 2019 Yvonne Adesam, Gerlof Boumahttps://nejlt.ep.liu.se/article/view/218Part of Speech Tagging: Shallow or Deep Learning?2020-05-10T10:56:02+02:00Robert Östlingrobert@ling.su.se<p>Deep neural networks have advanced the state of the art in numerous fields, but they generally suffer from low computational efficiency and the level of improvement compared to more efficient machine learning models is not always significant. We perform a thorough PoS tagging evaluation on the Universal Dependencies treebanks, pitting a state-of-the-art neural network approach against UDPipe and our sparse structured perceptron-based tagger, efselab. In terms of computational efficiency, efselab is three orders of magnitude faster than the neural network model, while being more accurate than either of the other systems on 47 of 65 treebanks.</p>2018-06-19T00:00:00+02:00Copyright (c) 2018 Robert Östling