Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Anouck Braggaar; Christine Liebrecht; Emiel van Miltenburg; Emiel Krahmer

doi:10.3384/nejlt.2000-1533.2026.5940

Authors

Anouck Braggaar Tilburg University https://orcid.org/0000-0002-7284-0442
Christine Liebrecht Tilburg University https://orcid.org/0000-0002-6621-2212
Emiel van Miltenburg Tilburg University https://orcid.org/0000-0002-7143-8961
Emiel Krahmer Tilburg University https://orcid.org/0000-0002-6304-7549

DOI:

https://doi.org/10.3384/nejlt.2000-1533.2026.5940

Abstract

This review gives an overview of evaluation methods for task-oriented dialogue systems, discussing the constructs, metrics and operationalisations used in previous work and highlighting the challenges in the context of dialogue system evaluation. The objective of this review is to encourage a more critical approach when evaluating dialogue systems. To that end, a systematic review of four databases was conducted (ACL, ACM, IEEE and Web of Science), which after screening resulted in 122 studies. Those studies were carefully analysed for the constructs and methods they proposed for evaluation. Four of the most occurring constructs (satisfaction, correctness, quality, and efficiency) are discussed as an example of how constructs are operationalised and measured in research. Additionally, recent developments regarding large language models are discussed for their applicability in the context of evaluation of dialogue systems. Furthermore, considerations and concerns about validity and reliability are discussed in relation to the found constructs and metrics. To improve consistency in evaluation approaches, future work should take a critical and systematic approach to the operationalisation and specification of the used constructs. To work towards this aim, this review ends with a research agenda for dialogue system evaluation and suggestions for outstanding questions.

Evaluating Task-oriented Dialogue Systems: A Systematic Review of Measures, Constructs and their Operationalisations

Authors

DOI:

Abstract

Downloads

Published

Issue

Section

License

Make a Submission