Policy-focused Stance Detection in Parliamentary Debate Speeches

Legislative debate transcripts provide citizens with information about the activities of their elected representatives, but are difficult for people to process. We propose the novel task of policy-focused stance detection, in which both the policy proposals under debate and the position of the speakers towards those proposals are identified. We adapt a previously existing dataset to include manual annotations of policy preferences, an established schema from political science. We evaluate a range of approaches to the automatic classification of policy preferences and speech sentiment polarity, including transformer-based text representations and a multi-task learning paradigm. We find that it is possible to identify the policies under discussion using features derived from the speeches, and that incorporating motion-dependent debate modelling, previously used to classify speech sentiment, also improves performance in the classification of policy preferences. We analyse the output of the best performing system, finding that discriminating features for the task are highly domain-specific, and that speeches that address policy preferences proposed by members of the same party can be among the most difficult to predict.


Introduction
Transcripts of legislative debates provide access to information concerning the policies that are publicly supported or opposed by politicians. They are of interest to political scientists, the media, the politicians themselves, and citizens who wish to monitor the activities of their representatives.
However, such documents are complex and difficult for people to process. Transcripts of debates in the United Kingdom (UK) Parliament are so hard for ordinary people to make sense of that parliamentary monitoring website www.theyworkforyou.com publishes manually annotated versions of the transcripts. These include crowd-sourced explanations of the debated proposals, as well as policy-focused aggregations of the voting records of parliamentarians. The large quantity and esoteric nature of the data in the parliamentary record (known as Hansard) motivates the need for automatic analysis of its contents.
Previous work in the domain of legislative debate transcripts has focused on either (a) sentiment polarity classification (Bhavan et al., 2019;Burfoot et al., 2011;Thomas et al., 2006), or (b) policy identification (Abercrombie and Batista-Navarro, 2018b; Abercrombie et al., 2019) in isolation. As far as we are aware, these two tasks have not previously been combined in this do-main, despite the fact that: (1) the information yielded is complementary, and perhaps even necessary, for practical use (i.e., without analysis of debated policies, the target of sentiment in the speeches is unknown); and (2) these two tasks rely on features derived from shared information, which could assist with the learning of parameters for both tasks in a multi-task learning setting.
Borrowing the concept of policy preferences from political science, we compare approaches to automatically determining the policy preference that is under discussion in each debate, and whether each speaker supports or opposes it.
Our contributions Building on the work of Abercrombie et al. (2019); Abercrombie and Batista-Navarro (2020), we combine policy preference identification and speech-level sentiment polarity analysis to formulate the task of policy-focused speech stance detection for the domain of legislative debate speeches, in which the position of each speaker in a debate is identified in relation to the proposal under discussion. Unlike prior work, we thus obtain interpretable analysis of the positions taken by MPs with respect to the policies presented in parliamentary debates.
To this end, we add a set of manually annotated policy preference labels to a large existing English language corpus of UK parliamentary debates, creating the first dataset to be labelled with both topics (policy preferences) and positions (sentiment) in this domain. We make the enhanced corpus available to the research community.
We use this dataset for the evaluation of approaches to the classification of policy-focused speaker stance. We test classification systems comprising combinations of single-and multi-task learning paradigms, different debate structure models, and varying approaches to text representation and machine learning methods. Our results represent initial benchmarks for this task.
Research questions In this paper, we address the following questions: RQ1 To what extent do humans agree on the policy preference labelling task? We compare agreement between our annotations with those reported in previous work in both political science (Lacewell and Werner, 2013;Mikhaylov et al., 2008) and natural language processing (Abercrombie et al., 2019). The latter found that agreement was comparable for labels applied to debate motions and the manifestos for which the scheme was originally designed, a finding which we re-examine on this new dataset. Hypothesis H1: Policy preference labels are as reliable for debate motions as party-political election manifestos.
RQ2 How well do machine learning classifiers perform on the combined task of policy-focused stance detection? We test a number of approaches against a majority class baseline. These include fine-tuning pre-trained contextual word embeddings, which we compare to a simple bag-ofwords model, and a multi-task learning approach designed to take advantage of mutually beneficial information, which we compare to tackling the constituent tasks independently. Hypothesis H2a: Classification of policyfocused stance will benefit from use of contextual BERT embeddings. Hypothesis H2b: Classification of policyfocused stance will benefit from concurrent classification of policy preferences and speaker sentiment using a multi-task approach.

Background
House of Commons debates As the superior legislative chamber in the UK Parliament, the House of Commons (HoC) draws the attention of the public, the media, and the academic sector, and was therefore chosen as the focus of this study. Debates in the HoC consist of an opening motion (proposal), the content of which usually does not provide clues to the policy that is proposed (see, for example, Figure 1a). We found 75.8 per cent of debate motions in the corpus to contain insufficient information to manually determine a policy preference.
A number of Members of Parliament (MPs) then respond to the motion, when invited to do so by the Speaker (the chief presiding officer of the House). An individual MP may make multiple utterances during a given debate. Following previous work (Abercrombie and Batista-Navarro, 2020;Salah, 2014;Thomas et al., 2006), we consider a speech to be the concatenation of all their utterances in that debate. In many cases, the motion is voted on by MPs in a division. As in previous work, we use the record of these votes as labels for sentiment and stance polarity classification. Figure 1: Examples from TheyWorkForYou of (a) a debate motion labelled by an annotator with code 110: European Union: Negative; and two utterances made in response to the motion by speakers who voted (b) aye (support) and (c) no (oppose).
Policy preferences The concept of policy preferences is widely used in political science (Budge et al., 2001) to categorize the positions of politicians. The Manifesto Project (MARPOR: https:// manifestoproject.wzb.eu) have developed a set of policy preference codes organised under seven 'domains'. The current coding scheme comprises 74 policy preference codes, almost all of which are 'positional', encoding a positive or negative position towards a policy issue (Mikhaylov et al., 2008). We use these codes as labels for the policy preferences expressed in the debate motions. In the example in Figure 1a, the policy preference label applied to this debate by annotators (see §4.1) is 110: European Union: Negative.
Sentiment and stance detection While use of terminology varies and overlaps in the literature, stance detection can be viewed as a form of sentiment classification. From this perspective, it consists of determining the sentiment polarity of a piece of text towards a predetermined 'given target of interest' (Mohammad et al., 2016). In the case of parliamentary debates, for each example speech, we seek to determine (1) the nature of its target-the policy preference under debate-and (2) the position or sentiment expressed towards it-support or opposition. We consider the combined policy preference and speech sentiment labels to represent the speaker's stance on a particular policy. For instance, in the example in Figure 1, the stance of speech extracts (b) and (c) are European Union: Negative-support and European Union: Negative-oppose, respectively.  Thomas et al., 2006), and the UK Parliament (Abercrombie and Batista-Navarro, 2018b, 2020; Bhavan et al., 2019;Salah, 2014;Sawhney et al., 2020). In these works-and in common with ours-speaker sentiment is assumed to be analogous to vote outcome. However, in the task undertaken in these previous works, the nature of the targets-the Bills or motions under debateis not identified. The related task of stance detection-in which the target of sentiment is (pre-)determined-has been applied to such domains as social media (e.g. Augenstein et al., 2016a,b;Hardalov et al., 2021;Li et al., 2021;Mohammad et al., 2016), online debate forums (e.g. Hardalov et al., 2021;Hasan and Ng, 2013;Somasundaran and Wiebe, 2010;Sridhar et al., 2015), and news articles (Ferreira and Vlachos, 2016;Schiller et al., 2021). For a recent survey, see Küçük and Can (2020).

Related work
In most of this work the target is pre-chosen by the user or the system. In the political domain, this has been framed as agreement detection in which two pieces of text are compared (Menini and Tonelli, 2016;Menini et al., 2017), or classification of support or attack towards pre-defined policies (Menini et al., 2018). While Vamvas and Sennrich (2020) carry out stance detection on the positions expressed by Swiss politicians, they do not perform automatic identification of the policies discussed, only conducting binary in favour/against classi-fication in a similar vein to the sentiment/position classification work discussed above.
More similarly to this work, Bar-Haim et al. (2017) used a supervised approach to identify both the stances of extracts from Wikipedia articles and the targets of those stances from a closed list of 'controversial topics'. However, this labelling scheme does not cover the policy positions proposed in parliament.
A common framework for stance detection is the SDQC (Support-Deny-Query-Comment) annotation scheme of Zubiaga et al. (2016). While potentially suitable for our data (support and deny are equivalent to our support and oppose labels), application of this framework would require manual annoation of each instance in the dataset with the more fine-grained labels. Instead, we follow the majority of work on legislative debates (e.g. Abercrombie and Batista-Navarro, 2018a; Thomas et al., 2006;Salah, 2014) in taking advantage of pre-existing vote-derived binary labels at the speech level, and thus only requiring the addition of policy preference labels for each debate.
In most of the reviewed work, stance targets are explicitly selected by the authors of the task (e.g. Donald Trump (Augenstein et al., 2016a,b), Richard Nixon and John F. Kennedy (Menini et al., 2018), or atheism (Mohammad et al., 2016). Unlike these, we frame target selection as a multiclass topic classification problem, making use of an existing schema validated by political scientists.
Document classification is an active area of research for tasks such as identification of news and Wikipedia categories (Zhang et al., 2015). For classification of HoC debates, Abercrombie and Batista-Navarro (2018b) used 'policy' labels crowdsourced by the parliamentary monitoring website https://www. publicwhip.org.uk/ but found this framework limited as it could not be easily scaled up from the small existing labelled dataset. Abercrombie et al. (2019) created a manually annotated dataset of policy preferences in debate motions, and achieved promising results in classifying debate motions according to the MARPOR coding scheme. However, this corpus is unsuitable for our purposes as: (1) it does not include speeches made in response to the motions; and (2) the motions in this dataset are all substantive-that is, they 'express an opinion about something' (Rogers and Walters, 2015), and tend to be of a highly partisan nature, leading to debates in which the stance of MPs can be trivially predicted from their party affiliations. For this study, we seek a mixture of motion types, more represenative of the Hansard record as a whole. Additionally, while they classified debate motions with policy preference labels using textual features derived from the motions themselves, many of the motions in Hansard-and in the corpus used in this study-contain little in the way of informative textual content (Figure 1a is a typical example). Rather than the motions, we therefore rely on features derived from the response speeches, which we use as input for the classification of both motions and speeches.
Multi-task learning approaches have been taken to many tasks, including part-of-speech tagging, chunking, and named entity recognition (Collobert and Weston, 2008). While such approaches have been applied to sentiment classification of customer reviews (Yu and Jiang, 2016), we are not aware of any uses of multi-task learning in the legislative debate domain. The most common approach to multi-task learning-which we compare with the single task paradigm-is that of hard parameter sharing, first proposed by Caruana (1993).

Data
ParlVote (Abercrombie and Batista-Navarro, 2020) is a large corpus (34,010 examples) of HoC debate speeches made between from 1997 and 2019. Each example speech consists of the concatenated utterances of an individual speaker in a given debate, and is presented with the debate motion to which it responds, as well as the vote of the speaker (in support or opposition to the motion), and metadata associated with the debate and the speakers. We adapted this corpus to include an additional, manually annotated policy preference label for each example. As capitalization can be informative in this domain (for example in the terms of address 'Friend', 'Lady', 'Gentleman'), we did not lowercase the text.

Annotation
We adapted the ParlVote annotation guidelines to include the new codes used in the updated MARPOR Coding Scheme version 5 (Werner et al., 2015). We make our guidelines available at https://tinyurl. com/y5twunrm.
The first author of this paper annotated each debate motion following these guidelines. Included in the guidelines were instructions to code examples featuring the following types of motions with the label 000: No meaningful category applies: • Business of the House motions, Programme motions, other timetabling and procedural motions, and motions to sit in private. Although MPs may use such motions politically, on the face of it they are concerned simply with the running of Parliament, rather than policy.
• Debates with divisions that are not on the motion in question. In many cases the division held at the end of the debate is held on some other point that has been brought up during the debate, such as an amendment introduced by the Speaker.
• Motions that appear to fit several codes, such as Finance Bills, Local Finance Bills, and Bills concerning the budgets of e.g., Police forces. Within the area of budgetary Bills is the exception of motions debates concerning approval of European Union (EU) Finance Bills, which tend to be positive or negative about the EU.
• Motions concerning constituency boundary changes.
We excluded all examples given this label from the dataset used for the experiments reported below as they cover a wide range of topics and/or do not fit into any of the Manifesto Project codes. While 56 of the policy preference codes were used as labels by the annotators, we also excluded all examples with policy preference codes that occur fewer than 100 times in the dataset, leaving 34 codes used in the classification experiments. This left 23,181 example speeches given by 1,321 unique MPs given in response to 1,215 different debates. Each example has a manually annotated policy preference label and a vote-derived speech stance polarity label. Of these, 305.1: Political Authority: Party Competence is the most common, with 4,926 labelled examples (see Appendix A).
Each instance in the corpus also retains it's support/oppose label from the original ParlVote corpus, which we use to label the stance taken in each speech towards the policy under debate.

Inter-annotator agreement
In order to validate the new motion policy preference labels, we recruited a second annotator to label a randomly selected subsection of the corpus. After annotation, comparison, and discussion of some initial training examples, she labelled 108 motions (8.9% of the total). On this subset, we calculated a Cohen's kappa agreement score of 0.38, which can be interpreted as representing 'fair' (Landis and Koch, 1977) or 'poor' (Fleiss et al., 1981) agreement. This is comparable to other studies of annotation using the Manifesto Project codes (Lacewell and Werner, 2013;Mikhaylov et al., 2008), and similar to agreement on election manifestos for which the labelling scheme was originally designed (Abercrombie et al., 2019). The level of agreement highlights that this is a non-trivial task on which agreement between different human annotators is difficult to achieve. Despite this issue of annotation reproducibility, these labels are considered to be valid by political scientists-as evidenced by Volkens et al. (2015), who found 230 articles that use this annotated data in the eight journals they examined. With comparable interannotator agreement, we consider them to be the best available labelling scheme for our task.
We make the adapted dataset, ParlVote+, available for the research community at: https://tinyurl. com/y22rrta7. 1 There, we also provide a full data statement, following the guidelines of Bender and Friedman (2018).

Method
We investigate approaches to determining, for each example in the dataset, (a) the policy preference expressed in the debate motion, and (b) the sentiment (position) expressed in the speech towards that motion: support (positive) or oppose (negative).
We compare the performance of systems comprised of combinations of the following: • Learning paradigms (see Figure 2): -Single tasks: inputs are processed separately for the two tasks, as in previous work.
-Multi-task learning: we use a 'hard parameter sharing' framework (Ruder, 2017), in which the network shares inputs and parameters in one hidden layer and trains two further task-specific layers separately.
• Debate models: -Motion-independent: all examples are trained and evaluated together.
-Motion-dependent: Abercrombie and Batista-Navarro (2018a) showed that Government-proposed motions tend to be positive and those tabled by opposing parties negative, and that this could be used as a proxy for the polarity of the motions. We classify examples from debates initiated by members of the governing and opposition parties separately.

• Text representations:
-Bag-of-words (BOW): we used term frequency-inverse document frequency (tf-idf) scores of terms in the dataset to select unigram features (as previous work suggests that the addition of higher n-gram features does not improve perfomance in this domain (Abercrombie and 1 Note this URL links to an anonymised Google Drive folder. Link to a permanent data repository will be provided on acceptance. We use Google's BERT tokenizer, 3 and pad the texts to the maximum input of 512 tokens, then fine-tune the top 3 layers of the BERT model. The (fine-tuned) final layer of BERT embeddings is then used as input to one of the following neural classifiers.
• Machine learning classification algorithms. We used neural networks of two hidden layers, with the second of these separated into two taskspecific layers in the multi-task learning setting (see Figure 2). We used Adam optimization with a learning rate of 1 * 10 −5 , a batch size of 32 and, with the BOW input only, a dropout rate of 0.5 for each layer. 4 For binary (speech sentiment) and multiclass (motion policy preference), we used sigmoid and softmax activation layers,  Table 1: Macro-averaged F1 scores for classification of policy preference (multiclass), speech sentiment (binary), and policy-focused stance using motion-independent (Ind.) and motion-dependent (Dep.) debate models. Stance scores are reported as both the mean of the policy preference and sentiment scores and the absolute F1 score. The highest F1 scores for each task are highlighted in bold text.
respectively. We used early stopping and tested on the model that performed best on the validation set. Hyperparameters were chosen based on optimisation experiments, the results of which are presented in Appendix B.
We compared the following classes of network: -Multi-layer perceptron (MLP): we used a network with hidden layers of 512 nodes and ReLU activation.
-Convolutional neural network (CNN): a network of one-dimensional convolutional layers with 512 filters, convolution windows spanning three tokens, and max pooling.
We used a randomly sampled 80/10/10 split of the data. The experiments can be reproduced using our python notebook, which we make available with all code and data at https://tinyurl.com/y62jrkyt.

Results
We evaluated the systems described above against the majority class for each task. Slight differences in these baseline scores in the motion-dependent and independent settings arise from variations in the class distributions in the test sets in these settings. Due to the class imbalances in the dataset, we report the macroweighted F1 score as the evaluation metric.

Overall results
Results are presented in Table 1. Here, policy-focused stance represents the sentiment polarity of speakers towards the policy preference under debate. We report two measures of this for each system configuration: (1) the mean of the F1 scores for policy preference identification and sentiment classification, and (2) the absolute F1 where only examples for which both predicted labels match the true class labels are considered to be correct.
Most of the tested system configurations outperformed the naive baselines. In most cases, the motion dependent models performed better than those that did not take into account this aspect of debate structure. Overall, contrary to our hypotheses, neither BERT nor the multi-task learning paradigm improved performance over the BOW and single-task set-ups. BERTbased systems tended to perform poorly on policy preference identification in the motion-dependent setting, perhaps due to the low number of examples per class combined the with loss of information due to BERT's maximum sequence length. The MLP classifier performed better than the CNN in nearly all scenarios. The highest overall F1 score for the combined tasks (67.4 mean, 45.2 absolute) was obtained by using single task learning with BOW and MLP in the motion-dependent setting. It is notable that the policy preference detection scores (using BOW) are comparable to those obtained by Abercrombie et al. (2019), despite using completely different input texts, having no access to the content of the motions themselves.

Results using shorter input speeches
The lower, poorer performance of BERT text representations in all settings is perhaps due to its the 512 token sequence input limit. With the mean number of tokens per speech in the ParlVote corpus over 700, in many cases, much potentially important information cannot be included when using this framework. Bearing this in mind, in order to test the potential of BERT for this task, we also ran the single task MLP classifier on a subset of the data consisting solely of the 13, 162 speeches in the dataset that consist of 512 tokens or fewer (caluclated using the scikit-learn tokenizer). Results of these experiments are shown in Table 2. F1 scores here are lower than when using the full  Table 2: Macro-averaged F1 scores for classification of policy preference (multiclass), speech sentiment (binary) and policy-focused stance (mean of these scores) using BOW and BERT-based text representations in the single-task-MLP classification setting on shorter speeches of 512 tokens or fewer.  dataset due to the smaller size of the training set. However, the fact that under these conditions use of BERT outperforms BOW, shows the importance of providing BERT with the full speech, and indicates that where this is possible fine-tuning on BERT should lead to improved performance over the BOW model.

Results by policy preference class
Examining the performance of one of the best performing system configurations-the single-task-BOW-MLP-motion-dependent system-for each (true) policy preference label (Table 3), there are a wide variety of scores for each task. Each policy preference class received between four and 21 predicted labels in the classifier output ( = 10.4). Labels with contrastive pairs did not necessarily seem to be more difficult to predict than individual class labels, with, for example 104: Military: Positive obtaining one of the highest F1 scores for policy preference detection. Similarly, code 411: Technology and Infrastructure: Positive is in the Economy domain, which contains a number of fairly similar codes. However, this code concerns a well defined topic, and has no directly con-trastive partner class, and obtained the highest scores overall. This suggests that the model can struggle to differentiate between the closely related, but opposing policy preference classes.
264 examples (22.1% of errors) were classified incorrectly for both policy preference and stance, 520 (43.6%) for policy preference only, and 410 (34.3%) for stance only. Figure 3 shows the predicted policy preference labels with respect to the true labels assigned by the annotators. Where mis-classifications occur, the classifier does not tend to prefer closely related labels, with more than double the number of out-of-domain (69.9%) to in-domain (31.1%) mis-classifcations. This suggests considerable overlap of language use in policy domains such as 4: Economy and 5: Welfare and Quality of Life, where issues relating to both may frequently be discussed in the same debates, and on which the annotators frequently disagreed.

System output analysis
To gain an understanding of the challenges involved in improving classification performance on these tasks, we examined in closer detail the output of the single-  Table 4: Mean sentiment scores for all speeches, supportive (+)/oppositional (-) speeches, replies to Government/opposition party motions, responses to own/other party motions, and all combinations of these three factors. Figure 3: True policy preference labels and the labels predicted by the classifier.

Features of speech polarity
In these experiments, we found that performance was improved by modelling debate structure in the motiondependent setting. This supports the findings of Abercrombie and Batista-Navarro (2018a), who observed that the textual features that discriminated between supportive and oppositional speeches were not typically positive or negative when used in other domains.
To investigate how sentiment is manifested in this domain, we first calculated the general-domain sentiment scores of the tokens in each speech example in the test set on a scale of [−1, 1] by looking up the terms in the sentiment lexicon SentiWordNet 3.0 (Baccianella et al., 2010). These scores are shown in Table 4.
The mean sentiment of speeches overall is very slightly negative (-0.01), according to the lexicon. Overall however, there is little difference between supportive and oppositional speeches in the polarity of language used. This is also the case for speeches given in different scenarios, such as in response to Government/opposition motions, by speakers addressing motions proposed by members with their own or with different party affiliations, or any combinations of these factors. This demonstrates once again that terms used in parliamentary debate speeches do not usually express the same sentiments that they may be expecteed to do in general usage.
To examine which terms in the speeches do indicate sentiment, we obtained the permutation importance scores of each unigram in the input vocabulary. That is, for feature in the feature set N, we calculated the permuation feature importance as the difference between performance (in this case, the F1 score) using the original datset D and a corrupted version˜, in which has been randomly shuffled (Breiman, 2001). We consider features with higher scores to be more important to the model. A sample of the most important   features (the top 20) in each setting according to this metric is shown in Table 5 Comparing (the lemmas of) these terms with their SentiWordNet scores (means over all word senses), it seems that the features that are indicative of support or opposition are not those that would typically be used for subjective expression in general English usage. Rather, many are parliamentary terms, such as forms of address, and other proper nouns. This is particularly true for speeches addressing opposition-proposed motions.

Party affiliations
As MPs usually vote along party lines, it would be possible to achieve good sentiment classification results by setting a classifier to make predictions on that simple basis alone. On the other hand, we also know that MPs are more free to 'rebel' against their parties in their speeches than in their voting behaviour (Proksch and Slapin, 2015). To investigate how this effects sentiment polarity classification, we compared the performance of rebel MPs-those voting against a motion proposed by their own party or in support of one proposed by an-  other party-and loyal MPs. This produced F1 scores of 77% and 66% respectively. The lower performance on loyal voters may suggest that, on occasion, speakers may use language that goes some way towards supporting the position of their opponents, while ultimately voting with their parties, and that these cases may be harder to detect than outright rebellions. The frequency distribution plots in Figure 4 present a closer look at this. They show the predicted probabilites of examples being assigned to the positive class. We compare the probability distributions for correctly and incorrectly predicted testset examples. These densities are shown in three settings: all predictions, intraparty speeches (made in response to motions proposed by an MP with the same party affiliation), and interparty responses (replies to a member of another party).
There are a number of clear patterns in the distributions. Overall, the system tends to make more confident predictions for examples that it predicts correctly (that is, it outputs probabilities towards 0.0 for negative and 1.0 for positive examples), and is less confident about examples that it predicts incorrectly (closer to 0.5), as might be expected. In the intra-party setting, the model outputs high probablities that it assigns to the positive class (correctly, more often than not). Meanwhile, negative predictions (usually incorrect) are made with probabilites that tend towards 0.5 (that is, with low certainty). For inter-party response speeches, this pattern is reversed, albeit not to as dramatic an extent. This may be due to situations in which, for example, multiple opposition parties collaborate against the Government, which introduce some noise into this analysis. Ultimately, the patterns seen here suggest that the language used in the speeches may often say more about the speakers' party affiliations than it does about about the nuances of individual speaker stance.

Input speech length
The length of speeches does not seem to greatly affect classification, with examples that are classified correctly, partially correctly, and completely incorrectly having similar distributions of token numbers (see Table 6).
Some previous work has excluded speeches of fewer than 50 tokens under the assumption that they are unlikely to contain enough information to express sentiment (Abercrombie and Batista-Navarro, 2018a;Salah, 2014). There are 2,941 such speeches in ParlVote, which are fairly balanced between the positive and negative classes (53/47%) and a very similar distibution of policy preference labels as the main dataset. In the experiments, 67.8% of these shorter examples were classified correctly for speech sentiment (compared with 69.5% of examples of any length), and 42.6% of examples < 50 classified correctly on both tasks (48.1% for the whole dataset). With examples of both very short speeches (such as two-word speeches like 'Hear hear', 'Under Labour'-both negative stance) and the longest speech examples classified correctly, it seems that speech length is not an important factor in performance for the BOW-based systems.

Discussion and conclusion
Policy-focused stance detection of parliamentary speeches is a challenging task, which we have framed as combined binary and multiclass classification. For this, we enhanced an existing dataset with an additional set of policy preference labels. While inter-annotator agreement on policy preference labels is modest, it is similar to that reported in previous work on both parliamentary debates and election manifestos. To address the issue of low annotator agreement, and the fact that classifiers frequently misclassify speeches across policy domains, future work could take a perspectivist approach to annotator disagreement (Basile et al., 2021a,b), and consider reframing the task as a multiclass and multilabel problem, in which more than one policy preference code may be valid per speech. Notwithstanding this issue, and despite the large number of classes in the policy prediction task, and the fact that the input features we used were based only on the content of speeches (not the motions or titles, as in previous work (Abercrombie et al., 2019)), we have been able to obtain reasonable results, comfortably beating the majority class baselines.
Modelling of the structure of parliamentary debates in the form of motion-dependent classification was seen to improve performance on speech sentiment classification in prior work. In this study, we found that it is not only consistently superior for speech sentiment classification, but also improves the identification of policy preferences, the topics under discussion. We have shown that the differences between supportive and opposing speeches do not derive from generally sentiment bearing words, but from the relationships between the speaker, the MP who proposes the motion in question, and the party affiliations of both actors.
The application of multi-task learning did not, in most configurations, improve system performances. However, we used a fairly simple framework in which just one of the network's hidden layers was shared with one further hidden layer per classification task. There is therefore plenty of scope for further experimentation with more complex architectures for this approach.
In these experiments, fine-tuning on BERT embeddings led to considerably worse performances. Considering the widespread successes of this approach, this also warrants further investigation. With recent work suggesting that, for real-world tasks and datasets, pretraining the embeddings on in-domain data may be necessary (Xia et al., 2020), a more domain-specific approach may be desirable.
While other work on sentiment and stance detection in the domain of parliamentary debates has effectively overlooked the targets of those opinions, we have combined approaches to sentiment and topic detection to formulate a task with potential for real-world application. Although there remains much room for improvement in classification performance, we have shown that the task of policy-focused speech stance detection can be feasibly automated, even with simple features and neural architectures. Although we have focussed our annotation effort and analysis on debates from the UK Parliament, the proposed approach is generalisable to other legislatures, or indeed any political debates that feature proposed motions and supporting and opposing documents.
In future work, we will focus on refining the annotation scheme in order to obtain greater labelling consistency and improved classification performance, as well as adapting the methods for the legislative debate domain.