Collaborator(s): Ian Gemp
Accepted at: Autonomous Agents and Multiagent Systems (AAMAS) 2025
at The 16th Workshop on Optimization and Learning in Multiagent Systems
Talk(s): AAMAS 2025 OptLeanMAS-25
Paper: link
Repository: link
Cooperative MARL research has developed techniques to effectively optimize collective return in
simulated environments (Rashid et al., 2020; Yuan et al., 2023; Albrecht et al., 2024). This enables
the deployment of multi-agent systems (MAS) that can efficiently solve complex tasks, particularly
in tasks that factorize into parallel subtasks and/or take place in the physical world (e.g., robotics)
and can benefit from spatially-scattered agents (Calvaresi et al., 2021). However, what if the reward
function is misspecified? This can happen because the reward is difficult to define in a way that
avoids reward hacking (Skalse et al., 2022). Alternatively, what if the test time environment or system goals change slightly? We would like a user to be able to steer a MARL system towards more
desirable behaviour (human-in-the-loop). These are all key challenges that arise in real-world domains. In addition, we do not want to assume the user is a MARL expert. Ideally, the user could steer
the system in an intuitive and simple way. Therefore, we consider steering a MAS using natural language. The user issues high-level strategies that an LLM then translates into actions to communicate
with the MAS. While examples of humans intervening and controlling static programs/interfaces via
LLMs are pervasive (Hong et al., 2023), we know of fewer examples controlling single-agent learning systems and no examples controlling MA learning systems.
Integrating LLMs with RL presents exciting opportunities for enhancing agent performance, particularly in complex MA environments. Instruction-aligned models with advanced reasoning and planning capabilities are well-suited for this task. Prompted correctly, these models provide real-time,
context-aware strategies, guiding agents through challenges where traditional RL methods struggle, especially in environments with large action/observation spaces or sparse rewards, particularly
during early training. We envision a future where LLM-RL combinations can manage increasingly
dynamic environments, with LLMs handling complex interactions and dynamically changing observation and action spaces. Our research explores this potential in MARL. We allow users to quickly
’fine-tune’ a base MARL system by guiding the agents using free-form natural language or rulebased interventions in the training process. This adaptation helps the system align more closely
with the user’s bespoke task requirements, ensuring that agents develop behaviours tailored to the
challenges of the environment. We have specifically chosen the Aerial Wildfire Suppression (AWS) environment from the HIVEX suite (Siedler, 2025) 1, as it offers a relevant and intricate problem to
solve.
The AWS environment presents dynamic and high-stakes cooperative scenarios, where the unpredictability of wildfire spread creates an evolving challenge. Factors such as wind direction, humidity,
terrain slope, and temperature—hidden from the agents—add layers of complexity. Solving this environment requires seamless collaboration among agents, where strategic coordination is essential to
containing fires. With AWS, users engage in a problem simulating real-world wildfire management.
The combination of a physically and visually rich simulation, open-ended scenarios and environmental conditions makes AWS a demanding environment and a great challenge.
In this work, we test whether combining current MARL and LLM techniques can allow users to steer and guide a MARL system towards more desirable behaviour in the challenging AWS environment.
We consider two users: the simple Rule-Based (RB) Controller and a more sophisticated Natural
Language (NL) Controller. The NL Controller simulates how humans might interact with the MAS,
i.e., in free-form natural language. We compare these against our baseline, a setup with no test-time
interventions. We summarize our core contributions as follows:
- Rule-Based and Natural Language Controller Generated Interventions: We implement a novel system where rule-based and natural language-based interventions demonstrate the ability to enhance decision-making and coordination in dynamic settings like AWS.
- Adaptive and Dynamic Guidance: Our approach moves beyond static curriculum-based methods, providing real-time, adaptive interventions that respond to the evolving states of agents and environments, improving both long-term strategy and immediate decisionmaking.
- AWS Environment: We apply our method to the HIVEX AWS environment, simulating coordinated aerial wildfire suppression, showcasing the effectiveness of LLM-mediated interventions in managing complex and dynamic tasks in a MA environment.
- Accelerated Learning and Improved Coordination: Our results demonstrate that interventions, especially during early training, accelerate learning to reach expert-level performance more efficiently.
Drone based Reforestation Environment
Demo
AWS Process Diagram
AWS Environment
(1) Water Collection Area, (2) Agent-controlled Wildfire Suppression Aeroplanes, (3) Human Natural Language Controller Input Field, (4) Village. Environment Features: Wind, overcast, temperature and humidity map sample.
Results