J A B B Y A I

Loading

A gradient blue to green background features a white flowchart with rectangular boxes connected by arrows, ending in a hexagonal “STOP” sign and a check mark on the right side.

Autonomous AI agents are transforming the way we approach multi-step decision-making processes, streamlining tasks like web browsing, video editing, and file management. By applying advanced machine learning, they automate workflows, optimize performance, and reduce the need for human input. 

However, these systems struggle in complex, dynamic environments. A key challenge lies in balancing exploitation, using known strategies for immediate gains, with exploration, which involves seeking new strategies that could yield long-term benefits. Additionally, they often have difficulty adapting to unpredictable changes in conditions and objectives, as well as generalizing knowledge across contexts, limiting their ability to transfer learned strategies between domains. 

In response, we developed ExACT, an approach for teaching AI agents to explore more effectively, enabling them to intelligently navigate their environments, gather valuable information, evaluate options, and identify optimal decision-making and planning strategies. ExACT combines two key techniques: Reflective-MCTS (R-MCTS) and Exploratory Learning.

R-MCTS builds on the traditional Monte Carlo Tree Search (MCTS) algorithm, introducing features like contrastive reflection and a multi-agent debate function. Through contrastive reflection, the agent refines its decision-making by comparing expected outcomes with actual results, allowing it to learn from both its successes and mistakes. The multi-agent debate function provides various evaluations of a given state, where multiple agents offer contrasting perspectives to help provide a balanced and reliable assessment.

Exploratory Learning trains agents to navigate environments effectively. Together, these techniques show strong computational scalability during both training and testing, as demonstrated on VisualWebArena—a benchmark for evaluating multimodal autonomous language agents (Figure 1). 

Evaluation demonstrates the compute scaling properties of GPT-4o during both training and testing. The assessment includes two scenarios: (1) applying the GPT-4o-based R-MCTS agent to all 234 tasks from the Classifieds category in VisualWebArena (left), and (2) testing fine-tuned GPT-4o on 169 previously unseen tasks from Classifieds without using search algorithms (right).
Figure 1. Evaluation demonstrates the compute scaling properties of GPT-4o during both training and testing. The assessment includes two scenarios: (1) applying the GPT-4o-based R-MCTS agent to all 234 tasks from the Classifieds category in VisualWebArena (left), and (2) testing fine-tuned GPT-4o on 169 previously unseen tasks from Classifieds without using search algorithms (right).

R-MCTS extends the classic MCTS by enabling real-time improvements in decision-making. Shown in Figure 2, an iterative feedback loop allows R-MCTS to learn from past experiences, avoid prior mistakes, and focus on more effective actions in similar contexts.

Overview of the R-MCTS process in ExACT. 
Figure 2. Overview of the R-MCTS process in ExACT. 

Evaluating R-MCTS

R-MCTS demonstrates state-of-the-art performance across all VisualWebArena environments, surpassing the previous best-performing method, Search Agent, with improvements ranging from 6% to 30% (Table 1). Additionally, as of January 2025, it holds the second position on the OSWorld leaderboard and demonstrates state-of-the-art performance in the blind test setting, where there is no prior access to the test environment, reflecting its advanced capabilities (Table 2). 

Rank Model Score
1 GPT-4o + ExACT 33.70
2 GPT-4o + Search 26.40
3 GPT-4o + WebDreamer 23.60
4 GPT-4o + ICAL 23.40
5 GPT-4o 19.78
6 Llama-3-70B + Search 16.70
Table 1. The VisualWebArena leaderboard highlights R-MCTS as achieving state-of-the-art performance as of December 2024. 
Rank Model Blind Test Score
1 learn-by-interact w/ Claude-3.5-sonnet 🗶 22.50
2 ExACT w/ GPT-4o ✔ 16.60
3 GPT-4 ✔ 12.24
4 GPT-4o ✔ 11.36
5 GPT-4 Vision (0409) ✔ 10.82
6 learn-by-interact w/ Gemini-1.5-pro ✔ 10.30
Table 2. The OSWorld leaderboard for the category of A11y tree inputs shows that ExACT with GPT-4o ranks second and demonstrates state-of-the-art performance in the blind test setting, as of December 2024.

How Exploratory Learning works

Exploratory Learning enables agents to dynamically search and adjust their computational resources during testing without depending on MCTS. In contrast to Imitation Learning, which centers on training models using the optimal actions identified through search, Exploratory Learning focuses on cultivating the agent’s ability to navigate its environment by teaching it to evaluate states, explore different pathways, and efficiently backtrack from unpromising paths to identify more favorable alternatives. 

In contrast to Imitation Learning, Exploratory Learning uses the entire search trajectory for training.
Figure 3. In contrast to Imitation Learning, Exploratory Learning uses the entire search trajectory for training.

Evaluating Exploratory Learning

We conducted experiments using GPT-4o fine-tuned with Exploratory Learning in the VisualWebArena environment. Results demonstrate the following key benefits: 

  • Improved performance: GPT-4o achieves performance improvement, comparable with scaling test-time compute with MCTS, even without search.
  • Test-time compute scaling: GPT-4o performs better when given more actions per task, leading to improved decision-making and task completion, which increased from 5% to 12.4%. 
  • Improved generalization on unseen tasks: Exploratory Learning helps fine-tuned GPT-4o handle unseen tasks more effectively than agents trained with Imitation Learning or no additional training.

The following video provides a detailed demonstration of how R-MCTS and Exploratory Learning function.

Continued exploration

Advancing autonomous AI agents is key to enabling them to handle complex, multi-step tasks with greater precision and adaptability. ExACT represents a significant step toward creating agents that can perform complex decision-making before taking action, leading to improved performance, but challenges remain. How can AI agents improve decision-making in real-world scenarios, where they may be constrained by time or resources? How can they learn effectively and efficiently from environmental feedback? We are currently investigating these questions, and we invite you to explore them with us by building on the ExACT framework. Access the ExACT code at our GitHub repository (opens in new tab)

The post ExACT: Improving AI agents’ decision-making via test-time compute scaling appeared first on Microsoft Research.

Leave a Comment