Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Authors

DOI:

https://doi.org/10.3384/nejlt.2000-1533.2026.5940

Abstract

This review gives an overview of evaluation methods for task-oriented dialogue systems, discussing the constructs, metrics and operationalisations used in previous work and highlighting the challenges in the context of dialogue system evaluation. The objective of this review is to encourage a more critical approach when evaluating dialogue systems. To that end, a systematic review of four databases was conducted (ACL, ACM, IEEE and Web of Science), which after screening resulted in 122 studies. Those studies were carefully analysed for the constructs and methods they proposed for evaluation. Four of the most occurring constructs (satisfaction, correctness, quality, and efficiency) are discussed as an example of how constructs are operationalised and measured in research. Additionally, recent developments regarding large language models are discussed for their applicability in the context of evaluation of dialogue systems. Furthermore, considerations and concerns about validity and reliability are discussed in relation to the found constructs and metrics. To improve consistency in evaluation approaches, future work should take a critical and systematic approach to the operationalisation and specification of the used constructs. To work towards this aim, this review ends with a research agenda for dialogue system evaluation and suggestions for outstanding questions.

Downloads

Published

2026-03-03