LLM-Mediated Guidance of MARL Systems

Cooperative MARL research has developed techniques to effectively optimize collective return in simulated environments (Rashid et al., 2020; Yuan et al., 2023; Albrecht et al., 2024). This enables the deployment of multi-agent systems (MAS) that can efficiently solve complex tasks, particularly in tasks that factorize into parallel subtasks and/or take place in the physical world (e.g., robotics) and can benefit from spatially-scattered agents (Calvaresi et al., 2021). However, what if the reward function is misspecified? This can happen because the reward is difficult to define in a way that avoids reward hacking (Skalse et al., 2022). Alternatively, what if the test time environment or system goals change slightly? We would like a user to be able to steer a MARL system towards more desirable behaviour (human-in-the-loop). These are all key challenges that arise in real-world domains. In addition, we do not want to assume the user is a MARL expert. Ideally, the user could steer the system in an intuitive and simple way. Therefore, we consider steering a MAS using natural language. The user issues high-level strategies that an LLM then translates into actions to communicate with the MAS. While examples of humans intervening and controlling static programs/interfaces via LLMs are pervasive (Hong et al., 2023), we know of fewer examples controlling single-agent learning systems and no examples controlling MA learning systems.

Integrating LLMs with RL presents exciting opportunities for enhancing agent performance, particularly in complex MA environments. Instruction-aligned models with advanced reasoning and planning capabilities are well-suited for this task. Prompted correctly, these models provide real-time, context-aware strategies, guiding agents through challenges where traditional RL methods struggle, especially in environments with large action/observation spaces or sparse rewards, particularly during early training. We envision a future where LLM-RL combinations can manage increasingly dynamic environments, with LLMs handling complex interactions and dynamically changing observation and action spaces. Our research explores this potential in MARL. We allow users to quickly ’fine-tune’ a base MARL system by guiding the agents using free-form natural language or rulebased interventions in the training process. This adaptation helps the system align more closely with the user’s bespoke task requirements, ensuring that agents develop behaviours tailored to the challenges of the environment. We have specifically chosen the Aerial Wildfire Suppression (AWS) environment from the HIVEX suite (Siedler, 2025) 1, as it offers a relevant and intricate problem to solve.

The AWS environment presents dynamic and high-stakes cooperative scenarios, where the unpredictability of wildfire spread creates an evolving challenge. Factors such as wind direction, humidity, terrain slope, and temperature—hidden from the agents—add layers of complexity. Solving this environment requires seamless collaboration among agents, where strategic coordination is essential to containing fires. With AWS, users engage in a problem simulating real-world wildfire management. The combination of a physically and visually rich simulation, open-ended scenarios and environmental conditions makes AWS a demanding environment and a great challenge.

In this work, we test whether combining current MARL and LLM techniques can allow users to steer and guide a MARL system towards more desirable behaviour in the challenging AWS environment. We consider two users: the simple Rule-Based (RB) Controller and a more sophisticated Natural Language (NL) Controller. The NL Controller simulates how humans might interact with the MAS, i.e., in free-form natural language. We compare these against our baseline, a setup with no test-time interventions. We summarize our core contributions as follows:

  • Rule-Based and Natural Language Controller Generated Interventions: We implement a novel system where rule-based and natural language-based interventions demonstrate the ability to enhance decision-making and coordination in dynamic settings like AWS.
  • Adaptive and Dynamic Guidance: Our approach moves beyond static curriculum-based methods, providing real-time, adaptive interventions that respond to the evolving states of agents and environments, improving both long-term strategy and immediate decisionmaking.
  • AWS Environment: We apply our method to the HIVEX AWS environment, simulating coordinated aerial wildfire suppression, showcasing the effectiveness of LLM-mediated interventions in managing complex and dynamic tasks in a MA environment.
  • Accelerated Learning and Improved Coordination: Our results demonstrate that interventions, especially during early training, accelerate learning to reach expert-level performance more efficiently.

Drone based Reforestation Environment

The Aerial Wildfire Suppression environment includes two types of controllers: Natural Language-based and Rule-Based. Controller interventions are passed to the LLM-Mediator, temporarily providing actions and overwriting the agents’ learned policy actions.

Demo

AWS Process Diagram

The default setup consists of three agents controlling individual aeroplanes. Each agent receives both feature vector and visual observations. Agents’ actions include steering left, right, or releasing water. Rewards are given for extinguishing burning trees; smaller rewards are given for wetting living trees and picking up water. A negative reward is given for crossing the environment boundary. The LLM-Mediator interprets RB and NL Controller interventions, assigning tasks to any agent for the next 300 steps and overwriting its policy actions.

AWS Environment

(1) Water Collection Area, (2) Agent-controlled Wildfire Suppression Aeroplanes, (3) Human Natural Language Controller Input Field, (4) Village. Environment Features: Wind, overcast, temperature and humidity map sample.

Results