Skip to content
David Yunhao Liu edited this page Mar 17, 2021 · 5 revisions

Introduction

MADDAPG, it might be more straightforward to be written as "MA-DDPG", illustrating it as the multi-agent version of DDPG. The corresponding ideology was summarized as "decentralized execution, centralized training." No existing implementation open-sourced on GitHub was found utilizing the Stable Baseline 3 (a.k.a. SB3) which wields PyTorch as the AI library. Therefore, we create this project and aim to implement a robust and adaptable version of MADDPG with SB3.

MADDAPG, 或许写作“MA-DDPG”可以更加直观,体现出它是一个多智能体 版本的DDPG。其思想可以被总结为“分立执行、集中训练”。 我们发现,现有的在GitHub上开源的实现没有利用 使用了PyTorch作为底层的 Stable Baseline 3 (SB3)作为框架的。因此,我们开设了这个项目,旨在实现一个 健壮和通用的SB3版本MADDPG模型。

DDPG

The DDPG we used as the base algorithm is actually the TD3 algorithm.

TD3 is a widely-used extension of DDPG, which uses some tricks to enhance the performance.

SB3中推荐的DDPG实现实质上是一个 TD3算法。 它对论文中的DDPG算法进行了一些工程上的优化,利用了一些 技巧 以增强表现。

Multi-agent Concerns

TD3 is a single-agent model. Therefore, some tricks are used to tackle the issue, including:

TD3 是一个单智能体模型。因此,我们做了如下的更改从而使其匹配多智能体的设定:

  1. Customized both A&C networks;

    实现了我们自己的Actor Critic网络。 每一个agent都有独立的、基于自己observation space的Actor网络;每一个agent也都有 独立的、基于自己observation space和全局所有agent的Critic网络。

  2. Expand Q which is of shape (Batch Size, 1) to shape (Batch Size, 1, _N_); N stands for the number of agents;

    拓展了_Q_的维度,从(Batch Size, 1)拓展到了(Batch Size, 1, _N_)N表示智能体个数

  3. Use a wrapper to wrap the MA environment so that it can be trained as if it is a single-agent one, and

使用一个包装器来包装多智能体环境,将各个智能体的observation spaceS和 action spaceS聚合为单个的observation和action space,从而实现“集中训练”

  1. Customized the replay buffer to store interactions with an extra agent dimension.

    定制了replay buffer的维度,增加了一个代表agent number的额外维度

Start Up

We recommend you to read the following pages which explain the implementations and the ideologies underlying the project. We hope those can be helpful.

建议先以下几页解释代码和思想的文档。希望他们可以有帮助

  1. TD3 in SB3

    Implementation of TD3 in Stable-Baselines3 (a.k.a. SB3)

    TD3 在 Stable-Baselines3 中 的实现,将有助于帮助理解为什么我们需要做上面那些更改

    Two sections are contained:

    包含两部分内容:TD3的Off Policy基类Off Policy的基类

  2. Our Implementation

    This explains our implementations to you (and of course, why we do so) We also include some justification for our codes to illustrate we strictly follow the idea of the original paper.

    我们的代码实现的解析,并且试图证明我们的实现还原了论文中的思想。

  3. Adapt to Other Environments

    If you want to set up your own multi-agent environment, this page can be your guideline, demonstrating you how to construct and adapt your environment.

    如果需要其他的Multi Agent的环境,如何构建和适配。

Collaboration

You can switch the environment for the demonstration to other self-implemented MA ENV. Also, it is welcomed if you can help us to refactor our codes, add unit tests or change the API to make it compatible with the SB3.

We don't have specific requirements on coding styles or documentations. Yet, we believe it can benefit other contributors if we can maintain a good consistent style.

可以切换不同的多智能体环境。如果能帮助重构代码、 添加UT或者更改API使得跟SB3相匹配的话,我们十分欢迎。 对于代码风格和文档,我们没有特殊要求,不过, 相信保持良好一致的风格对其他贡献者会有很大的 帮助。

Code structure

  • maddpg.py is the core implementation of the algorithm, which encapsulates a TD3 object inside and invoke the corresponding TD3 methods for training and evaluating.

    核心实现,封装了一个TD3对象,调用了对应的TD3方法从而进行训练和评估。

  • env_wrapper.py implements the wrapper for multi-agent environments and the vectorized multi-agent environment which supports multiple multi-agent wrapper at the same time.

    包装器。调整维度和space的shape。对于输入,将多个智能体的space进行聚合,成为一个 space的整体。在交互过程中,将TD3提供的action进行拆分,分别交付给智能体实现, 从而实现论文的一个核心要素————“分立训练”。

  • ma_policy.py implements the actor-critic algorithm for multi-agent settings, which is corresponding to the core of the paper.

    Actor-critic算法,对应着论文中另一个核心思想————“集中训练”

  • main.py implements a demo based on the Pong-Duel environment from ma-gym

    一个基于ma-gymPong-Duel模型的DEMO。

  • The rest files which are ended with _test suffix are unit tests, all based on the unittest module.

    剩下的是单测文件,利用了UNITTEST模块。

Dimension issues

1. Prerequisite

  1. For the environments:

    Observation space can be either a list of gym.Box, each of the element representing one agent's observation space, or a gym.Box, whose 1st dimension is for the agent number.

    Say, if each agent's observation is of shape (O1, O2, O3, ...). The overall observation space shall be (N, O1, O2, O3, ...) or a list of (O1, O2, O3, ...) whose length is N.

    Requirements on the action space shall be the same.

  2. The action space can be discrete. In that case, the environment must present a LIST of gym.Discrete, each element representing one agent. Moreover, a mapper MUST be provided. The mapper shall be of type: Callable[torch.Tensor] Any. This maps the corresponding output of each actor-network to what can be accepted by the agent.

    For example, the Pong-Duel action is a list of integer varying among 0, 1, and 2. However, the actor-network generates a Tensor of shape (3,) for each agent. Therefore, the mapper for this env is:

    lambda actions: torch.round(actions * 1.5 + 1).flatten().tolist()

2. Observation space aggregation

This is implemented inside env_wrapper.py.

N agents can have a list of N gym.Box or form a single gym.Box where the 1st dimension represents the agent indices. 

For the second scenario, the input Box will be treated as the aggregated observation space while for the first, the Boxes will be merged together to form a Box like that in the second case. 

3. Action space aggregation

Three cases are allowed. 

  1. a list of N gym.Discrete
  2. a list of N gym.Box
  3. a single gym.Box whose 1st dimension is the indices of the N agents

The 3rd case is what we will transfer the first two into. As for the 1st, since the TD3 only allows for continuous action space, it will be transferred to a Box of shape (N,) and the lower and upper bounds being -1 and 1 respectively. It is the caller's responsibility to think of a mapper from (-1, 1) to the discrete actions. 

In the 2nd case, the boxes will be merged into one like that being input in case 3. The first dimension will be the agent index will the rest preserves the shape of each agent observation. Thus, we require all the agents' observations shall be of the same shape. 

4. Action

The observation aggregated in section 2 will then be dismantled among its 1st dimension so that each actor is only exposed to its own observation. Thus, the shape of each actor-network input is one dimension fewer than the shape of the aggregated observation space. 

Their output will be stacked together to meets the action space defined in section 3. 

5. Q, rewards, and dones

In order not to mess up the critics, the Q is designed to be of shape (1, N). So the rewards will be aggregated. Yet, the signals representing whether an episode has been completed (a. k. a. dones) will not follow the same idea. Here, as long as one of the agents completes the episode, every other agent will be reset. Thus, the does are just of shape (1,). 

What if I want some agent being kicked out instead of ceasing the entire episode? 

  1. When one agent is kicked out, instead of marking it as "done", make it respond to no more interactions
  2. Give it a negative reward.