Before Humans Join the Team: Diagnosing Coordination Failures in Healthcare Robot Team Simulation

1 Cornell University, 2 University of Maryland, College Park, 3 Imperial College London 4 New York University
Study Design Pipeline

We designed a controlled test case in a healthcare setting that simulates real-world complexity, serving as a testbed to examine how hierarchical MARS systems operate under high-stakes conditions. Our exploration goes beyond surfacing coordination patterns by analyzing how three factors shape system-level performance: contextual knowledge, communication structures, and model reasoning. κ = 1 indicates the inclusion of contextual and procedural knowledge, while κ = 0 corresponds to its absence. σ = 1 denotes an enhanced communication structure, and σ = 0 reflects its absence. ω specifies the underlying model, either GPT-4o-2024-08-06 or o3-2025-04-16.

Abstract

As humans move toward collaborating with coordinated robot teams, understanding how these teams coordinate and fail is essential for building trust and ensuring safety. However, exposing human collaborators to coordination failures during early-stage development is costly and risky, particularly in high-stakes domains such as healthcare. We adopt an agent-simulation approach in which all team roles, including the supervisory manager, are instantiated as LLM agents, allowing us to diagnose coordination failures before humans join the team. Using a controllable healthcare scenario, we conduct two studies with different hierarchical configurations to analyze coordination behaviors and failure patterns. Our findings reveal that team structure, rather than contextual knowledge or model capability, constitutes the primary bottleneck for coordination, and expose a tension between reasoning autonomy and system stability. By surfacing these failures in simulation, we prepare the groundwork for safe human integration. These findings inform the design of resilient robot teams with implications for process-level evaluation, transparent coordination protocols, and structured human integration.

Study 1: Evaluation
Study 1: Contextual Knowledge

We developed a knowledge base (KB) containing contextual and procedural knowledge as a shared resource, analogous to organizational documentation, to ground MARS team behavior and decision-making. We evaluated the effectiveness of this contextual knowledge on MARS performance at both the manager and subordinate levels across seven dimensions. Our analysis shows that five critical failure modes persist even with a detailed KB. Annotated example traces are attached below in section "Coordination Failure Modes" below. These findings indicate that while sufficient contextual knowledge is necessary, system structure remains the primary bottleneck for achieving robust coordination.

Study 2: Structure and Reasoning
Study 2: Reasoning Behavior Cards

We identify four major themes in MARS coordination patterns, each comprising several sub-themes. To contextualize these sub-themes, we annotate each with ‘✓’ or ‘✗’ to indicate whether its implications are positive or negative within our test scenario. We also report the frequency of each sub-theme across 20 traces for both GPT-4o and o3. We find distinct behavioral profiles which underscore trade-offs between reasoning and non-reasoning models. For each sub-theme, we provide representative examples along with accompanying comments (green box: [What went well:], red box: [What went wrong:]) in section "Reasoning Behavior Cards" below.

Coordination Failure Modes

Reasoning Behavior Cards

BibTeX

@misc{bai2025masmarscoordinationfailures,
        title={From MAS to MARS: Coordination Failures and Reasoning Trade-offs in Hierarchical Multi-Agent Robotic Systems within a Healthcare Scenario}, 
        author={Yuanchen Bai and Zijian Ding and Shaoyue Wen and Xiang Chang and Angelique Taylor},
        year={2025},
        eprint={2508.04691},
        archivePrefix={arXiv},
        primaryClass={cs.RO},
        url={https://arxiv.org/abs/2508.04691}, 
  }