Prompting large language models for translation quality evaluation: A systematic review of strategies, challenges, and future directions
Abstract
Large-scale pre-trained language models are increasingly employed to evaluate machine translation quality, aiming to overcome the limitations of traditional metrics in capturing semantic nuance, discourse coherence, and cultural context. Despite recent advances, large language model-based evaluation systems continue to face persistent challenges in prompt formulation, scoring consistency, cross-linguistic adaptability, and interpretability. This systematic review analyzes 63 peer-reviewed studies published between 2020 and 2025, mapping the evolution of prompt-based strategies across multilingual and multi-domain translation evaluation tasks. Three core limitation are identified: lack of reproducibility, semantic misalignment, and insufficient cultural adaptability. To address these issues, this paper proposes a conceptual three-dimensional framework for prompt design, comprising semantic attribution for interpretability, cultural mapping for contextual adaptability, and prompt regularization for cross-task robustness. The proposed framework offers a foundation for constructing more transparent, generalizable, and culturally responsive evaluation systems, and supports the advancement of human–artificial intelligence collaboration in translation quality assessment and cross-cultural communication.