Awesome Transformers for Sequential Decision Making

Recent progress on Transformers have made researchers to re-think sequential decision making.

This repo tracks literature and additional online resources on transformers for reinforcement learning and more general sequential decision making problems. We provide a short summary of each paper. Though we have tried our best to include all relevant works, it's possible that we might have missed your work. Please feel free to create an issue if you want your work to be added.

While we were preparing this repo, we noticed the Awesome-Decision-Transformer repo that also covers decision transformer literature. Awesome-Reinforcement-Learning does not provide paper summaries but lists the experiment environment used in each paper. We believe both repos are helpful for beginners to get started on Transformers for RL. If you find these resources to be useful, please follow and star both repos!

Awesome Transformers for Sequential Decision Making

Papers

🆕 ArXiv

IS CONDITIONAL GENERATIVE MODELING ALL YOU NEED FOR DECISION-MAKING?

[paper]

How Crucial is Transformer in Decision Transformer?

[paper]

🆕 NeurIPS'22

Transformer-based Working Memory for Multiagent Reinforcement Learning with Action Parsing

[paper]

Pre-Trained Language Models for Interactive Decision-Making

[paper]

Masked Autoencoding for Scalable and Generalizable Decision Making

# MaskDP

[paper]

One limitation of DT is the requirement of reward-labeled dataset. In this paper, the authors borrow idea from masked language modeling for sequential decision making and develop a method called MaskDP to pretrain transformers to predict sequential decisions without reward-labeled datasets. They show that both goal-reaching and offline RL can be achieved by different masking strategy at inference time. However, offline RL is slightly more complex than simple goal-reaching as the goal is to achieve maximum return, so the authors also add a critic head and an actor head on the pretrained transformer backbone.

UniMASK: Unified Inference in Sequential Decision Problems

# Uni[MASK]

[paper]

This paper also investigates pretraining for sequential decision making and, similar to MaskDP, the authors point out that many sequential decision making tasks can be achieved by different masking schemes. Together with MaskDP, Uni[MASK] could inspire a new paradigm for sequential decision making.

You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environment

# ESPER

[Paper]

Issues of transformers in stochastic environment. The proposed method learns to cluster trajectories and conditions on average cluster returns.

Behavior Transformers: Cloning k modes with one stone

# BeT

[Paper] [Code]

The authors proposed Behavior Transformer to model unlabeled demonstration data with multiple modes. It introduces action correction to predict multi-modal continuous actions.

On the Effect of Pre-training for Transformer in Different Modality on Offline Reinforcement Learning

[paper]

Multi-Game Decision Transformers

[Paper]

Similar to GATO, this paper studies the applying a single transformer-based RL agent to play multiple games.

Bootstrapped Transformer for Offline Reinforcement Learning

[Paper]

To address the offline data limitation, this paper uses the learned dynamics model to generate data. It’s a data augmentation method. It uses trajectory transformer as the model.

Relational Reasoning via Set Transformers: Provable Efficiency and Applications to MARL

[Paper]

One of the first works succssfully applying transformers in the RL settings. This work aims to replace LSTM used in online RL with Transformers. The authors observed that training large-scale transformers in RL settings is unstable. Thus they proposed the Gate Transformer-XL architecture and showed that the novel architecture outperformed LSTMs in the DMLab-30 benchmark with a good training stability.

Representation Matters: Offline Pretraining for Sequential Decision Making

ICML'21 [Paper]

Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation

NeurIPS'21 [Paper]

Decision Transformer: Reinforcement Learning via Sequence Modeling

NeurIPS'21 [Paper] [Code]
A seminal work that proposed a supervised learning framework based on transformers for sequential decision making tasks. It tackles RL as a sequence generation task. Given a pre-collected sequence decision making dataset, the Decision Transformer (DT) is trained to generated the action sequence that can lead to the expected return-to-go, which is used as the input to the transformer model.

Offline Reinforcement Learning as One Big Sequence Modeling Problem

NeurIPS'21 [Paper] [Code]
This is another seminal work on applying transformers to RL and it was concurrent to Decision Transformer. The authors proposed Trajectory Transformer (TT) that combines transformers and beam search as a model-based approach for offline RL.

Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation

ICLR'21 [Paper] This paper introduces a distillation procedure that transfers learning progress from a large capacity learner model to a small capacity actor model. The proposed method can reduce the inference latency of the deployed RL agent.

Generalized Decision Transformer for Offline Hindsight Information Matching

ICLR'22 [Paper] [Code]
The paper derived a RL problem formulation called Hindsight Information Matching (HIM) from many recently proposed RL algorithms that use future trajectory information to accelerate the learning of a conditional policy. The authors discussed three HIM variations including Generalized DT, Categorical DT, and Bi-Directional DT.

Scene Transformer: A unified architecture for predicting multiple agent trajectories

ICLR'22 [Paper]

RvS: what is essential for offline RL via supervised learning?

ICLR'22 [Paper] [Code]

DT solves reinforcement learning through supervised learning. It was hypothesized that the large model capacity of transformers could lead to better policies. The authors of this paper challenged the hypothesis and showed that a simple two-layer feedforward MLP led to similar performance with transformer-based methods. The findings of this paper imply that current designs of transformer-based reinforcement learning algorithms may not fully leverage the potential advantages of transformers.

Online Decision Transformer

ICML'22 [Paper] [Code]

This work combines offline pretraining and online finetuning.

Prompting Decision Transformer for Few-Shot Policy Generalization

ICML'22 [Paper] [Code]

The authors introduced prompt to DT for few-shot policy learning.

Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning

ICML'22 [Paper] [Code]

This work combines VAE and TT for policy learning in stochastic environment.

Can Wikipedia help offline reinforcement learning?

arXiv [Paper] [Code]

Training transformers on RL datasets from scratch could lead to slow convergence. This paper studies whether it’s possible to transfer knowledge from vision and language domains to offline RL tasks. The authors show that wikipedia pretraining can improve the convergence by 3-6x.

A Generalist Agent

arXiv [Paper]

A transformer-based RL agent (GATO) is trained on multi-modal data to perform robot manipulation, chat, play Atari games, caption images simultaneously. The agent will determine by itself what to output based on its context.

Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers

ICLR'22 Generalizable Policy Learning in the Physical World Workshop [Paper]

Applied random masking to pretrain transformers for RL.

Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning

ICML'22 [Paper]

This work combines online RL and offline SL. The online phase is used for both RL training and data collection. In the offline phase, only successful trajectories are used for SL. The authors show that this approach performs well in sparse-reward settings. The authors tested DT for the SL phase and found that it was brittle and performed worse than a simple BC. This result show that the DT training stability requires more research.

Deep Reinforcement Learning with Swin Transformer

arXiv [Paper]

This paper studies replacing the convolutional neural networks used in online RL with Swin Transformer and show that it leads to better performance.

Efficient Planning in a Compact Latent Action Space

arXiv [Paper]

This work combines VQ-VAE with TT to allow efficient planning in the latent space.

Going Beyond Linear Transformers with Recurrent Fast Weight Programmers

arXiv [Paper]

A new transformer architecture is proposed and experiments on RL show large improvement over LSTM in several Atari games.

GPT-critic: offline reinforcement learning for end-to-end task-oriented dialogue systems

ICLR'22 [Paper]

GPT-2 trained in an offline RL manner for dialogue generation.

Offline pre-trained multi-agent decision transformer: one big sequence model tackles all smac tasks

arXiv [Paper]

The authors studies offline pre-training and online finetuning in the MARL setting. The authors show that offline pretraining significantly improves sample efficiency.

Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL

arXiv [Paper]

The original DT completely requires on supervised learning to learn a value-conditioned behavior policy. By increasing the condition value, DT could obtain greater returns than the maximum return in the offline dataset. However, the RCSL framework does not tap into trajectory stitching, i.e., combining sub-trajectories of multiple sub-optimal trajectories to obtain an optimal trajectory. In this paper, the authors combine Q-learning and DT. The estimated Q-values are used to relabel the return-to-gos in the training data.

StARformer: Transformer with State-Action-Reward Representations for Robot Learning

arXiv [Paper]

Proposed a transformer architecture for robot learning representations.

Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning

arXiv [Paper]

Multi-task offline RL problems. The value function is modeled as a distribution.

Transfer learning with causal counterfactual reasoning in Decision Transformers

arXiv [Paper]

The authors leverage the casual knowledge of a source environment's structure to generate a set of counterfactual environments to improve the agent's adaptability in new environments.

Transformers are Adaptable Task Planners

arXiv [Paper]

Prompt-based task planning.

Transformers are Meta-Reinforcement Learners

arXiv [Paper]

Applied transformers for meta-RL.

Transformers are Sample Efficient World Models

arXiv [Paper] [Code]

With the goal of improving sample efficiency of RL methods, the authors build a transformer model on Atari environments. They borrowed ideas from VQGAN and DALL-E to map raw image pixels to a much smaller amount of image tokens, which are used as the input to autoregressive transformers. After training the transformer world model, the RL agents then learns exclusively from the model imaginations.

Hierarchical Decision Transformer

arXiv [Paper]

The original DT highly depends on a carefully chosen return-to-go as the initial input to condition on. To address this challenge, this work proposed to predict subgoals (or options) to replace the return-to-go. Two transformers are trained together while one is used for predicting the subgoals and the the other one is used to predict actions conditioned on the subgoals. Through experiments on D4RL, the authors show that this hierarchical approach can outperform the original DT, especially in tasks that invovle long episodes.

PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training

arXiv [Paper]

A generative transformer-based architecture for pretraining with robot data in a self-supervised manner.

When does return-conditioned supervised learning work for offline reinforcement learning?

arXiv [Paper]

Focusing on offline reinforcement learning, the authors provide a study on the capabilities and limitations of return-conditioned supervised learning (RCSL). The authors found that RCSL requires assumptions stronger than the dynamic programming to return optimal policies. Specifically, the authors pointed out that RCSL requires nearly deterministic dynamics and proper condition values. The authors claim that RCSL alone is unlikely to be a general solution for offline RL problems. However, it may perform well with high quality behavior data.