TITLE:
A Review of Agent Data Evaluation: Status, Challenges, and Future Prospects as of 2025
AUTHORS:
Shaohan Wang
KEYWORDS:
Agent Evaluation, Large Language Model, Benchmarks, Process-Oriented Evaluation
JOURNAL NAME:
Journal of Software Engineering and Applications,
Vol.18 No.9,
September
17,
2025
ABSTRACT: With the rapid advancement of large language models (LLMs), agents capable of autonomous perception, decision-making, and action have emerged as a frontier paradigm in artificial intelligence. These entities are transitioning from academic research to complex real-world applications. However, the rapid iteration of agent capabilities poses severe challenges to evaluation methodologies—particularly in assessing their core competencies in data processing and evaluation. As of 2025, the field of agent data evaluation exhibits a dynamic yet fragmented landscape. Traditional static dataset-based evaluations are no longer sufficient to measure agent performance in open, dynamic environments. The research community is actively shifting toward more interactive and realistic benchmarking paradigms. Despite the emergence of innovative benchmarks such as ToolBench and MLAgentBench, there remains a widespread lack of unified evaluation standards, widely accepted metric systems, and mature methodologies. This paper systematically reviews the state of agent data evaluation in 2025, tracing the evolution from traditional metrics to emerging process-oriented ones. Building upon this, we delve into the methodology of dataset and benchmark design, with particular attention to key elements in experimental design, such as controlled experiments, sample size determination, and statistical analysis. Furthermore, we analyze the core challenges facing the field, including the “realism gap” between evaluation and real-world tasks, the scalability dilemma of automated evaluation, and the increasingly prominent issues of data privacy and security. Our findings indicate that although potential technologies such as differential privacy and federated learning exist, dedicated privacy-preserving frameworks for agent evaluation remain in their infancy. Finally, this report outlines future research directions, emphasizing the urgent need to establish unified evaluation frameworks, develop process-oriented evaluation metrics, and formulate standardized privacy and security auditing protocols—aiming to provide a scientific foundation for building more robust, trustworthy, and responsible agent systems.