Publications

You can also find my articles on my Google Scholar profile.

Symbolic Policy Distillation for Interpretable Reinforcement Learning

Published in Mechanistic Interpretability workshop @ NeurIPS 2025, 2025

Deep reinforcement learning (RL) policies based on deep neural networks (DNNs) achieve strong performance but are often opaque, hindering transparency, interpretability, and safe deployment. Interpretable policy distillation seeks to transfer knowledge from these black-box DNN policies into simpler, human-understandable forms. While prior work has extensively studied performance retention, fidelity to the original DNN policies has remained underexplored, which is crucial for ensuring that the distilled policies faithfully capture the underlying decision-making logic. To address this gap, we propose GM-DAGGER, a novel data aggregation method that employs a geometric mean loss to preserve fidelity without compromising performance. Building on this, we introduce Symbolic Policy Interpretable Distillation (SPID), a framework that distills DNN policies into symbolic analytical equations via symbolic regression. Through extensive experiments across six environments and five deep RL algorithms, we show that SPID achieves superior preservation of both performance and fidelity, while providing interpretable policies that provide mechanistic insights into policy behavior and training dynamics.

Recommended citation: Li, Peilang, Umer Siddique and Yongcan Cao. "Symbolic Policy Distillation for Interpretable Reinforcement Learning." Mechanistic Interpretability workshop @ NeurIPS 2025.
Download Paper

ReCollab: Retrieval-Augmented LLMs for Cooperative Ad-hoc Teammate Modeling

Published in LAW@NeurIPS 2025, 2025

Ad-hoc teamwork (AHT) requires agents to infer the behavior of previously unseen teammates and adapt their policy accordingly. Conventional approaches often rely on fixed probabilistic models or classifiers, which can be brittle under partial observability and limited interaction. Large language models (LLMs) offer a flexible alternative: by mapping short behavioral traces into high-level hypotheses, they can serve as world models over teammate behavior. We introduce CoLLAB, a language-based framework that classifies partner types using a behavior rubric derived from trajectory features, and extend it to ReCoLLAB, which incorporates retrieval-augmented generation (RAG) to stabilize inference with exemplar trajectories. In the cooperative Overcooked environment, CoLLAB effectively distinguishes teammate types, while ReCoLLAB consistently improves adaptation across layouts, achieving Pareto-optimal trade-offs between classification accuracy and episodic return. These findings demonstrate the potential of LLMs as behavioral world models for AHT and highlight the importance of retrieval grounding in challenging coordination settings.

Recommended citation: Wallace, Conor, Umer Siddique, and Yongcan Cao. "ReCollab: Retrieval-Augmented LLMs for Cooperative Ad-hoc Teammate Modeling." Language, Agent, and World Models for Reasoning and Planning Workshop at NeurIPS 2025.
Download Paper

Autonomous Target-Enclosing Guidance via Deep Reinforcement Learning

Published in AIAA SCITECH 2026, 2025

This paper proposes a deep reinforcement learning (RL) based solution for the target enclosing problem using a non-holonomic unmanned aerial vehicle (UAV) operating under limited sensing and control lag. Instead of using the mathematical or analytical controllers, our approach enables an end-to-end learning-based agent to autonomously interact with the environment and develop enclosing strategies that ensure safety and containment around a stationary target. A carefully designed reward function guides the learning agent by combining three reward components: a quadratic distance-based reward penalizing deviation from the desired enclosing radius, a state-dependent velocity reward that promotes stability during radial transitions, and an acceleration penalty to enforce a smooth trajectory. Thus, the agent learns to trade off between maintaining an optimal enclosing geometry and mitigating aggressive control responses, especially in the presence of autopilot lag. Our results demonstrate that the learned policy through our carefully designed reward function achieves reliable target enclosing performance while matching or outperforming the analytical controllers.

Recommended citation: Siddique, Umer, Praveen Kumar Ranjan, Abhinav Sinha and Yongcan Cao. "Autonomous Target-Enclosing Guidance via Deep Reinforcement Learning." AIAA SCITECH 2026.
Download Paper

MODIFLY: A Scalable End-to-end Multi-Agent Simulation for Unmanned Aerial Vehicles

Published in The 26th International Workshop on Multi-Agent-Based Simulation (MABS) (ALA) @ AAMAS 2025, 2025

Multi-agent unmanned aerial vehicle (UAV) systems have emerged as a promising solution for complex applications such as industrial automation, surveillance, and disaster response. However, the application of multi-agent UAV coordination remains challenging due to the lack of consideration of real-world constraints such as communication link degradation, scalability issues, and the need for realistic training environments. Existing simulation platforms often lack the fidelity and flexibility required to bridge the gap between simulation and deployment. To address these limitations, we propose MODIFLY, a scalable, cross-platform, end-to-end simulation platform tailored for multi-agent UAV control. MODIFLY introduces dynamic communication modeling, including link degradation, to accurately simulate real-world UAV operations. It supports distributed execution across multiple UAVs, seamless coordination, real-time monitoring, and user input capture. MODIFLY uniquely integrates real drones with virtual environments, allowing UAVs to interact with simulated obstacles and peer ones for hybrid reality testing. Additionally, the platform is designed to facilitate reinforcement learning (RL) research by providing compatibility with popular libraries like OpenAI Gym and PettingZoo, supporting both single-agent and multi-agent RL environments. MODIFLY offers an intuitive interface for real-time parameter tuning and performance analysis, making it a versatile tool for researchers and practitioners to develop and validate UAV coordination strategies under realistic conditions. Download paper here

Recommended citation: Cofield, Jeremy, Umer Siddique, and Yongcan Cao. "MODIFLY: A Scalable End-to-end Multi-Agent Simulation for Unmanned Aerial Vehicles." The 26th International Workshop on Multi-Agent-Based Simulation (MABS) (ALA) @ AAMAS 2025
Download Paper

Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning

Published in Adaptive and Learning Agents (ALA) @ AAMAS 2025, 2025

In this paper, we consider the problem of learning independent fair policies in cooperative multi-agent reinforcement learning (MARL). Our objective is to Fairness is an important aspect of decision-making in multi-objective reinforcement learning (MORL), where policies must ensure both optimality and equity across multiple, potentially conflicting objectives. While single-policy MORL methods can learn fair policies for fixed user preferences using welfare functions such as the generalized Gini welfare function (GGF), they fail to provide the diverse set of policies necessary for dynamic or unknown user preferences. To address this limitation, we formalize the fair optimization problem in multi-policy MORL, where the goal is to learn a set of Pareto-optimal policies that ensure fairness across all possible user preferences. Our key technical contributions are threefold: (1) We show that for concave, piecewise-linear welfare functions (e.g., GGF), fair policies remain in the convex coverage set (CCS), which is an approximated Pareto front for linear scalarization. (2) We demonstrate that non-stationary policies, augmented with accrued reward histories, and stochastic policies improve fairness by dynamically adapting to historical inequities. (3) We propose three novel algorithms, which include integrating GGF with multi-policy multi-objective Q-Learning (MOQL), state-augmented multi-policy MOQL for learning non-statoinary policies, and its novel extension for learning stochastic policies. To validate the performance of the proposed algorithms, we perform experiments in various domains and compare our methods against the state-of-the-art MORL baselines. The empirical results show that our methods learn a set of fair policies that accommodate different user preferences. Download paper here

Recommended citation: Siddique, Umer, Peilang Li, and Yongcan Cao. "Learning Fair Pareto-Optimal Policies in Multi-Objective Reinforcement Learning." Adaptive and Learning Aegnts (ALA) @ AAMAS 2025
Download Paper

Towards Fair and Efficient Policy Learning in Cooperative Multi-Agent Reinforcement Learning

Published in AAMAS 2025 (Extended Abstract), 2024

In this paper, we consider the problem of learning independent fair policies in cooperative multi-agent reinforcement learning (MARL). Our objective is to design multiple policies simultaneously that optimize a welfare function for fairness. To achieve this objective, we propose a novel Fairness-Aware multi-agent Proximal Policy Optimization (FAPPO) algorithm, which enables each agent to independently learn its policy while optimizing a welfare function. Unlike standard approaches that focus on maximizing a performance metric such as rewards, FAPPO focuses on fairness in an independent learning setting, where each agent estimates its local value function. When inter-agent communication is allowed, we further introduce an attention-based variant of FAPPO (AT-FAPPO), which incorporates a self-attention mechanism to facilitate communication and coordination among agents. This variant allows agents to share relevant information during training, leading to more fair outcomes. To show the effectiveness of the proposed methods, we conduct experiments in various environments and show that our approach outperforms existing methods both in terms of efficiency and equity.

Recommended citation: Siddique, Umer, Peilang Li, and Yongcan Cao. "Towards Fair and Efficient Policy Learning in Cooperative Multi-Agent Reinforcement Learning." AAMAS 2025 (Extended Abstract).
Download Paper

From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation

Published in Deployable AI Workshop @ AAAI 2025, 2024

Deep reinforcement learning (RL) has shown remarkable success in complex domains, however, the inherent black box nature of deep neural network policies raises significant challenges in understanding and trusting the decision-making processes. While existing explainable RL methods provide local insights, they fail to deliver global understanding of the model, particularly in high-stakes applications. To overcome this limitation, we propose a novel model-agnostic approach that bridges the gap between explainability and interpretability by leveraging Shapley values to transform complex deep RL policies into transparent representations. The proposed approach offers two key contributions: a novel approach employing Shapley values to policy interpretation that goes beyond local explanations, and general framework applicable to both off-policy and on-policy algorithms. We evaluate our approach with three existing deep RL algorithms and validate its performance in two classic control environments. The results demonstrate that our approach not only preserves the original models’ effectiveness but also generates more stable interpretable policies. Download paper here

Recommended citation: Li, Peilang, Umer Siddique, and Yongcan Cao. "From Explainability to Interpretability: Interpretable Policies in Reinforcement Learning Via Model Explanation." Deployable AI Workshop @ AAAI. 2025.
Download Paper

Fairness in Traffic Control: Decentralized Multi-agent Reinforcement Learning with Generalized Gini Welfare Functions

Published in MALTA Workshop @ AAAI 2025, 2024

In this paper, we address the issue of learning fair policies in decentralized cooperative multi-agent reinforcement learning (MARL), with a focus on traffic light control systems. We show that standard MARL methods that optimize the expected rewards often lead to unfair treatment across different intersections. To overcome this limitation, we aim to design control policies that optimize a generalized Gini welfare function that explicitly encodes two aspects of fairness: efficiency and equity. Specifically, we propose three novel adaptations of MARL baselines that enable agents to learn decentralized fair policies, where each agent estimates its local value function while contributing to welfare optimization. We validate our approaches through extensive experiments across six traffic control environments with varying complexities and traffic layouts. The results demonstrate that our proposed methods consistently outperform existing MARL approaches both in terms of efficiency and equity. Download paper here

Recommended citation: Siddique, Umer, Peilang Li, and Yongcan Cao. "Fairness in Traffic Control: Decentralized Multi-agent Reinforcement Learning with Generalized Gini Welfare Functions." MALTA Workshop @ AAAI. 2025.
Download Paper

From Fair Solutions to Compromise Solutions in Multi-Objective Deep Reinforcement Learning

Published in Neural Computing and Applications (NCAA), 2024

In this paper, we focus on multi-objective reinforcement learning (RL) where the expected vector returns are aggregated with a concave function. For this generic framework, which includes notably fair optimization in the multi-user setting and compromise optimization in the multi-criteria setting, we present several contributions. After a discussion of its theoretical properties (e.g., need to resort to stochastic policies), we prove a general performance bound that justifies learning a policy using discounted rewards, even if a policy optimal for the average reward is desired. We extend several deep RL algorithms for our problem, notably our adaptation of DQN can learn stochastic policies. In addition, to illustrate the generality of our framework, we consider in the multi-user setting a novel extension of fair optimization in deep RL where users have different entitlements. Our experimental results validate our propositions and also demonstrate its superiority to reward engineering in single-objective RL.

Recommended citation: Qian, Junqi, Umer Siddique, Guanbao Yu, and Paul Weng. "From Fair Solutions to Compromise Solutions in Multi-Objective Deep Reinforcement Learning." Neural Computing and Applications (NCAA). 2024.
Download Paper

Adaptive Event-triggered Reinforcement Learning Control for Complex Nonlinear Systems

Published in Arxiv, 2024

In this paper, we propose an adaptive event-triggered reinforcement learning control for continuous-time nonlinear systems, subject to bounded uncertainties, characterized by complex interactions. Specifically, the proposed method is capable of jointly learning both the control policy and the communication policy, thereby reducing the number of parameters and computational overhead when learning them separately or only one of them. By augmenting the state space with accrued rewards that represent the performance over the entire trajectory, we show that accurate and efficient determination of triggering conditions is possible without the need for explicit learning triggering conditions, thereby leading to an adaptive non-stationary policy. Finally, we provide several numerical examples to demonstrate the effectiveness of the proposed approach. Download paper here

Recommended citation: Siddique, Umer, Abhinav Sinha, and Yongcan Cao. "Adaptive Event-triggered Reinforcement Learning Control for Complex Nonlinear Systems." arXiv preprint arXiv:2409.19769 (2024).
Download Paper

Opponent Transformer: Modeling Opponent Policies as a Sequence Problem

Published in Coordination and Cooperation for Multi-Agent Reinforcement Learning Methods Workshop @ RLC, 2024

The ability of an agent to understand the intentions of others in a multi-agent system, also called opponent modeling, is critical for the design of effective local control policies. One main challenge is the unavailability of other agents’ episodic trajectories at execution. To address the challenge, we propose a new approach that explicitly models the episodic trajectories of others. In particular, the proposed approach is to cast the opponent modeling problem as a sequence modeling problem via conditioning a transformer model on the sequence of the agent’s local trajectory and predicting each opponent agent’s trajectory. To evaluate the effectiveness of the proposed approach, we conduct experiments using a set of multi-agent environments that capture both cooperative and competitive payoff structures. The results show that the proposed method can provide better opponent modeling capabilities while achieving competitive or superior episodic returns. Download paper here

Recommended citation: Wallace, Conor, Umer Siddique, and Yongcan Cao. "Opponent Transformer: Modeling Opponent Policies as a Sequence Problem." Coordination and Cooperation for Multi-Agent Reinforcement Learning Methods Workshop @ RLC. 2024.
Download Paper

Towards Fair and Equitable Policy Learning in Cooperative Multi-Agent Reinforcement Learning

Published in Coordination and Cooperation for Multi-Agent Reinforcement Learning Methods Workshop @ RLC, 2024

In this paper, we consider the problem of learning independent fair policies in cooperative multi-agent reinforcement learning (MARL). The objective is to design multiple policies simultaneously that can optimize a welfare function for fairness. To achieve this objective, we propose a novel Fairness-Aware multi-agent Proximal Policy Optimization (FAPPO) algorithm, which learns individual policies for all agents separately and optimizes a welfare function to ensure fairness among them, in contrast to optimizing the discounted rewards. The proposed approach is shown to learn fair policies in the independent learning setting, where each agent estimates its local value function. When inter-agent communication is allowed, we further introduce an attention-based variant of FAPPO (AT-FAPPO) by incorporating a self-attention mechanism for inter-agent communication. This variant enables agents to communicate and coordinate their actions, potentially leading to more fair solutions by leveraging the ability to share relevant information during training. To show the effectiveness of the proposed methods, we conduct experiments in two environments and show that our approach outperforms previous methods both in terms of efficiency and equity. Download paper here

Recommended citation: Siddique, Umer, Peilang Li, and Yongcan Cao. "Towards Fair and Equitable Policy Learning in Cooperative Multi-Agent Reinforcement Learning." Coordination and Cooperation for Multi-Agent Reinforcement Learning Methods Workshop @ RLC. 2024.
Download Paper

Offline Reinforcement Learning with Failure Under Sparse Reward Environments

Published in 3rd International Conference on Computing and Machine Intelligence (ICMI), 2024

This paper presents a new reinforcement learning approach that leverages failed experiences in sparse reward environments. Unlike traditional reinforcement learning methods that rely on successful experiences or expert demonstrations, the proposed approach utilizes failed experiences to guide the policy update during the learning process. The primary objective behind this work is to develop a method that can efficiently utilize failed experiences to guide the search direction as directional cues from successful experiences may be limited in sparse environments. To achieve this objective, we introduce a new objective function that aims to maximize the dissimilarity between the RL agent’s actions and actions from failed experiences. This discrepancy serves as a valuable indicator to guide the agent’s exploration. In other words, our method focuses on achieving the desired objectives by leveraging failed experiences to provide a significant opportunity for the agent to refine its policy. We further employ hindsight experience replay (HER) to enhance the directional search by creating and achieving potential subgoals that align with the primary objectives. To assess the effectiveness of our method, we conduct experiments on three sparse reward environments. Our findings demonstrate that the proposed approach significantly enhances the agent’s learning efficiency and improves robustness to variations in demonstration quality compared to conventional reinforcement learning techniques.

Recommended citation: Wu, Mingkang, Umer Siddique, Abhinav Sinha, and Yongcan Cao. "Offline Reinforcement Learning with Failure Under Sparse Reward Environments." 3rd International Conference on Computing and Machine Intelligence (ICMI). 2024.
Download Paper

On Deep Reinforcement Learning for Target Capture Autonomous Guidance

Published in AIAA Guidance, Navigation, and Control Conference, 2024

This paper explores the prospects of motion planning of autonomous vehicles using deep reinforcement learning (DRL). We are particularly interested in a goal-directed setting where the need is to design an optimal guidance strategy for a pursuing autonomous vehicle (the pursuer, which is also the DRL agent) to capture an adversary (the target). To this end, we first formulate the target capture guidance problem as a Markov Decision Process (MDP) wherein the kinematics of relative motion between the vehicles constitute the MDP, and the pursuer’s lateral acceleration (chosen as its steering control to account for turn constraints) is the action of DRL agent. We show that a multifaceted reward function motivated by the collision conditions is sufficient and effective in designing the reinforcement learning action that enables the pursuer to capture the target regardless of the latter’s motion. We then empirically evaluate the performance of the trained agent in various target capture scenarios. Download paper here

Recommended citation: Siddique, Umer, Abhinav Sinha, and Yongcan Cao. "On Deep Reinforcement Learning for Target Capture Autonomous Guidance." AIAA SCITECH 2024 Forum. 2024.
Download Paper

Fair Deep Reinforcement Learning with Generalized Gini Welfare Functions

Published in Adaptive Learning Agents Workshop @ AAMAS, 2023

Learning fair policies in reinforcement learning (RL) is important when the RL agent’s actions may impact many users. In this paper, we investigate a generalization of this problem where equity is still desired, but some users may be entitled to preferential treatment. We formalize this more sophisticated fair optimization problem in deep RL, provide some theoretical discussion of its difficulties, and explain how existing deep RL algorithms can be adapted to tackle it. Our algorithmic innovations notably include a state-augmented DQN-based method for learning stochastic policies, which also applies to the usual fair optimization setting without any preferential treatment. We empirically validate our propositions and analyze the experimental results on several application domains. This paper was selected as the best paper at the Adaptive Learning Agents Workshop @ AAMAS 2023. Download paper here

Recommended citation: Yu, Guanbao, Umer Siddique, and Paul Weng. "Fair Deep Reinforcement Learning with Generalized Gini Welfare Functions." International Conference on Autonomous Agents and Multiagent Systems. Cham: Springer Nature Switzerland, 2023.
Download Paper

Fair deep reinforcement learning with preferential treatment

Published in European Conference on Artificial Intelligence (ECAI), 2023

Learning fair policies in reinforcement learning (RL) is important when the RL agent may impact many users. We investigate a variant of this problem where equity is still desired, but some users may be entitled to preferential treatment. In this paper, we formalize this more sophisticated fair optimization problem in deep RL using generalized fair social welfare functions (SWF), provide a theoretical discussion to justify our approach, explain how deep RL algorithms can be adapted to tackle it, and empirically validate our propositions on several domains. Our contributions are both theoretical and algorithmic, notably: (1) We obtain a general bound on the suboptimality gap in terms of SWF-optimality using average reward of a policy SWF-optimal for the discounted reward, which notably justifies using standard deep RL algorithms, even for the average reward; (2) Our algorithmic innovations include a state-augmented DQN-based method for learning either deterministic or stochastic policies, which also applies to the usual fair optimization setting without any preferential treatment Download paper here

Recommended citation: Yu, Guanbao, Umer Siddique, and Paul Weng. "Fair Deep Reinforcement Learning with Preferential Treatment." ECAI. 2023.
Download Paper

Fairness in Preference-based Reinforcement Learning

Published in MFPL @ International Conference on Machine Learning, 2023

In this paper, we address the issue of fairness in preference-based reinforcement learning (PbRL) in the presence of multiple objectives. The main objective is to design control policies that can optimize multiple objectives while treating each objective fairly. Toward this objective, we design a new fairness-induced preference-based reinforcement learning or FPbRL. The main idea of FPbRL is to learn vector reward functions associated with multiple objectives via new welfare-based preferences rather than reward-based preference in PbRL, coupled with policy learning via maximizing a generalized Gini welfare function. Finally, we provide experiment studies on three different environments to show that the proposed FPbRL approach can achieve both efficiency and equity for learning effective and fair policies. Download paper here

Recommended citation: Siddique, Umer, Abhinav Sinha, and Yongcan Cao. "Fairness in Preference-based Reinforcement Learning." ICML 2023 Workshop The Many Facets of Preference-Based Learning. 2023.
Download Paper

Learning fair policies in decentralized cooperative multi-agent reinforcement learning

Published in International Conference on Machine Learning (ICML), 2021

In this paper, we consider the problem of learning fair policies in (deep) cooperative multi-agent reinforcement learning (MARL). We formalize it in a principled way as the problem of optimizing a welfare function that explicitly encodes two important aspects of fairness: efficiency and equity. We provide a theoretical analysis of the convergence of policy gradient for this problem. As a solution method, we propose a novel neural network architecture, which is composed of two sub-networks specifically designed for taking into account these two aspects of fairness. In experiments, we demonstrate the importance of the two sub-networks for fair optimization. Our overall approach is general as it can accommodate any (sub)differentiable welfare function. Therefore, it is compatible with various notions of fairness that have been proposed in the literature (e.g., lexicographic maximin, generalized Gini social welfare function, proportional fairness). Our method is generic and can be implemented in various MARL settings: centralized training and decentralized execution, or fully decentralized. We evaluate our method on a set of fair cooperative MARL benchmarks, where we show that it outperforms the state-of-the-art methods in terms of fairness and performance. Download paper here

Recommended citation: Zimmer, Matthieu, et al. "Learning fair policies in decentralized cooperative multi-agent reinforcement learning." International Conference on Machine Learning. PMLR, 2021.
Download Paper

Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards

Published in International Conference on Machine Learning (ICML), 2020

As the operations of autonomous systems generally affect simultaneously several users, it is crucial that their designs account for fairness considerations. In contrast to standard (deep) reinforcement learning (RL), in this paper, we investigate the problem of learning a policy that treats its users equitably. In this paper, we formulate this novel RL problem, in which an objective function, which encodes a notion of fairness that we formally define, is optimized. For this problem, we provide a theoretical discussion where we examine the case of discounted rewards and that of average rewards. During this analysis, we notably derive a new result in the standard RL setting, which is of independent interest: it states a novel bound on the approximation error with respect to the optimal average reward of that of a policy optimal for the discounted reward. Since learning with discounted rewards is generally easier, this discussion further justifies finding a fair policy for the average reward by learning a fair policy for the discounted reward. We propose three novel deep RL adaptations to learn fair policies. We evaluate these methods on a set of fair multi-objective deep RL benchmarks, where we show that they outperform the state-of-the-art methods in terms of fairness and performance. Download paper here

Recommended citation: Siddique, Umer, Paul Weng, and Matthieu Zimmer. "Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards." International Conference on Machine Learning. PMLR, 2020.
Download Paper | Download Slides