Recent progress on Transformers have made researchers to re-think sequential decision making.
This repo tracks literature and additional online resources on transformers for reinforcement learning and more general sequential decision making problems. We provide a short summary of each paper. Though we have tried our best to include all relevant works, it's possible that we might have missed your work. Please feel free to create an issue if you want your work to be added.
While we were preparing this repo, we noticed the Awesome-Decision-Transformer repo that also covers decision transformer literature. Awesome-Reinforcement-Learning does not provide paper summaries but lists the experiment environment used in each paper. We believe both repos are helpful for beginners to get started on Transformers for RL. If you find these resources to be useful, please follow and star both repos!
- Awesome Transformers for Sequential Decision Making
- Papers
- 🆕 ArXiv
- 🆕 NeurIPS'22
- Transformer-based Working Memory for Multiagent Reinforcement Learning with Action Parsing
- Pre-Trained Language Models for Interactive Decision-Making
- Masked Autoencoding for Scalable and Generalizable Decision Making
- UniMASK: Unified Inference in Sequential Decision Problems
- You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environment
- Behavior Transformers: Cloning k modes with one stone
- On the Effect of Pre-training for Transformer in Different Modality on Offline Reinforcement Learning
- Multi-Game Decision Transformers
- Bootstrapped Transformer for Offline Reinforcement Learning
- Relational Reasoning via Set Transformers: Provable Efficiency and Applications to MARL
- Previous
- Stabilizing Transformers for Reinforcement Learning
- Representation Matters: Offline Pretraining for Sequential Decision Making
- Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation
- Decision Transformer: Reinforcement Learning via Sequence Modeling
- Offline Reinforcement Learning as One Big Sequence Modeling Problem
- Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation
- Generalized Decision Transformer for Offline Hindsight Information Matching
- Scene Transformer: A unified architecture for predicting multiple agent trajectories
- RvS: what is essential for offline RL via supervised learning?
- Online Decision Transformer
- Prompting Decision Transformer for Few-Shot Policy Generalization
- Addressing Optimism Bias in Sequence Modeling for Reinforcement Learning
- Can Wikipedia help offline reinforcement learning?
- A Generalist Agent
- Towards Flexible Inference in Sequential Decision Problems via Bidirectional Transformers
- Phasic Self-Imitative Reduction for Sparse-Reward Goal-Conditioned Reinforcement Learning
- Deep Reinforcement Learning with Swin Transformer
- Efficient Planning in a Compact Latent Action Space
- Going Beyond Linear Transformers with Recurrent Fast Weight Programmers
- GPT-critic: offline reinforcement learning for end-to-end task-oriented dialogue systems
- Offline pre-trained multi-agent decision transformer: one big sequence model tackles all smac tasks
- Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL
- StARformer: Transformer with State-Action-Reward Representations for Robot Learning
- Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning
- Transfer learning with causal counterfactual reasoning in Decision Transformers
- Transformers are Adaptable Task Planners
- Transformers are Meta-Reinforcement Learners
- Transformers are Sample Efficient World Models
- Hierarchical Decision Transformer
- PACT: Perception-Action Causal Transformer for Autoregressive Robotics Pre-Training
- When does return-conditioned supervised learning work for offline reinforcement learning?
- Deep Transformer Q-Networks for Partially Observable Reinforcement Learning
- Contextual Transformer for Offline Meta Reinforcement Learning
- MCTransformer: combining transformers and monte-carlo tree search for offline reinforcement learning
- Pretraining the vision transformer using self-supervised methods for vision based deep reinforcement learning
- Preference Transformer: Modeling Human Preferences using Transformers for RL
- Skill discovery decision transformer
- Decision transformer under random frame dropping
- Token turing machines
- SMART: self-supervised multi-task pretraining with control transformers
- Hyper-decision transformer for efficient online policy adaptation
- Multi-agent multi-game entity transformer
- Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels
- Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos
- Behavior Cloned Transformers are Neurosymbolic Reasoners
- Exploiting Transformer in Reinforcement Learning for Interpretable Temporal Logic Motion Planning
- Transformers for One-Shot Visual Imitation
- Self-Attentional Credit Assignment for Transfer in Reinforcement Learning
- IN-CONTEXT REINFORCEMENT LEARNING WITH ALGORITHM DISTILLATION
- Other Resources
- License
- Papers
# MaskDP
One limitation of DT is the requirement of reward-labeled dataset. In this paper, the authors borrow idea from masked language modeling for sequential decision making and develop a method called MaskDP to pretrain transformers to predict sequential decisions without reward-labeled datasets. They show that both goal-reaching and offline RL can be achieved by different masking strategy at inference time. However, offline RL is slightly more complex than simple goal-reaching as the goal is to achieve maximum return, so the authors also add a critic head and an actor head on the pretrained transformer backbone.
# Uni[MASK]
This paper also investigates pretraining for sequential decision making and, similar to MaskDP, the authors point out that many sequential decision making tasks can be achieved by different masking schemes. Together with MaskDP, Uni[MASK] could inspire a new paradigm for sequential decision making.
# ESPER
Issues of transformers in stochastic environment. The proposed method learns to cluster trajectories and conditions on average cluster returns.
# BeT
The authors proposed Behavior Transformer to model unlabeled demonstration data with multiple modes. It introduces action correction to predict multi-modal continuous actions.
On the Effect of Pre-training for Transformer in Different Modality on Offline Reinforcement Learning
Similar to GATO, this paper studies the applying a single transformer-based RL agent to play multiple games.
To address the offline data limitation, this paper uses the learned dynamics model to generate data. It’s a data augmentation method. It uses trajectory transformer as the model.
ICML'20 [Paper]
One of the first works succssfully applying transformers in the RL settings. This work aims to replace LSTM used in online RL with Transformers. The authors observed that training large-scale transformers in RL settings is unstable. Thus they proposed the Gate Transformer-XL architecture and showed that the novel architecture outperformed LSTMs in the DMLab-30 benchmark with a good training stability.
ICML'21 [Paper]
NeurIPS'21 [Paper]
NeurIPS'21 [Paper] [Code]
A seminal work that proposed a supervised learning framework based on transformers for sequential decision making tasks. It tackles RL as a sequence generation task. Given a pre-collected sequence decision making dataset, the Decision Transformer (DT) is trained to generated the action sequence that can lead to the expected return-to-go, which is used as the input to the transformer model.
NeurIPS'21 [Paper] [Code]
This is another seminal work on applying transformers to RL and it was concurrent to Decision Transformer. The authors proposed Trajectory Transformer (TT) that combines transformers and beam search as a model-based approach for offline RL.
ICLR'21 [Paper] This paper introduces a distillation procedure that transfers learning progress from a large capacity learner model to a small capacity actor model. The proposed method can reduce the inference latency of the deployed RL agent.
ICLR'22 [Paper] [Code]
The paper derived a RL problem formulation called Hindsight Information Matching (HIM) from many recently proposed RL algorithms that use future trajectory information to accelerate the learning of a conditional policy. The authors discussed three HIM variations including Generalized DT, Categorical DT, and Bi-Directional DT.
ICLR'22 [Paper]
DT solves reinforcement learning through supervised learning. It was hypothesized that the large model capacity of transformers could lead to better policies. The authors of this paper challenged the hypothesis and showed that a simple two-layer feedforward MLP led to similar performance with transformer-based methods. The findings of this paper imply that current designs of transformer-based reinforcement learning algorithms may not fully leverage the potential advantages of transformers.
This work combines offline pretraining and online finetuning.
The authors introduced prompt to DT for few-shot policy learning.
This work combines VAE and TT for policy learning in stochastic environment.
Training transformers on RL datasets from scratch could lead to slow convergence. This paper studies whether it’s possible to transfer knowledge from vision and language domains to offline RL tasks. The authors show that wikipedia pretraining can improve the convergence by 3-6x.
arXiv [Paper]
A transformer-based RL agent (GATO) is trained on multi-modal data to perform robot manipulation, chat, play Atari games, caption images simultaneously. The agent will determine by itself what to output based on its context.
ICLR'22 Generalizable Policy Learning in the Physical World Workshop [Paper]
Applied random masking to pretrain transformers for RL.
ICML'22 [Paper]
This work combines online RL and offline SL. The online phase is used for both RL training and data collection. In the offline phase, only successful trajectories are used for SL. The authors show that this approach performs well in sparse-reward settings. The authors tested DT for the SL phase and found that it was brittle and performed worse than a simple BC. This result show that the DT training stability requires more research.
arXiv [Paper]
This paper studies replacing the convolutional neural networks used in online RL with Swin Transformer and show that it leads to better performance.
arXiv [Paper]
This work combines VQ-VAE with TT to allow efficient planning in the latent space.
arXiv [Paper]
A new transformer architecture is proposed and experiments on RL show large improvement over LSTM in several Atari games.
ICLR'22 [Paper]
GPT-2 trained in an offline RL manner for dialogue generation.
arXiv [Paper]
The authors studies offline pre-training and online finetuning in the MARL setting. The authors show that offline pretraining significantly improves sample efficiency.
Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL
arXiv [Paper]
The original DT completely requires on supervised learning to learn a value-conditioned behavior policy. By increasing the condition value, DT could obtain greater returns than the maximum return in the offline dataset. However, the RCSL framework does not tap into trajectory stitching, i.e., combining sub-trajectories of multiple sub-optimal trajectories to obtain an optimal trajectory. In this paper, the authors combine Q-learning and DT. The estimated Q-values are used to relabel the return-to-gos in the training data.
arXiv [Paper]
Proposed a transformer architecture for robot learning representations.
Switch Trajectory Transformer with Distributional Value Approximation for Multi-Task Reinforcement Learning
arXiv [Paper]
Multi-task offline RL problems. The value function is modeled as a distribution.
arXiv [Paper]
The authors leverage the casual knowledge of a source environment's structure to generate a set of counterfactual environments to improve the agent's adaptability in new environments.
arXiv [Paper]
Prompt-based task planning.
arXiv [Paper]
Applied transformers for meta-RL.
With the goal of improving sample efficiency of RL methods, the authors build a transformer model on Atari environments. They borrowed ideas from VQGAN and DALL-E to map raw image pixels to a much smaller amount of image tokens, which are used as the input to autoregressive transformers. After training the transformer world model, the RL agents then learns exclusively from the model imaginations.
arXiv [Paper]
The original DT highly depends on a carefully chosen return-to-go as the initial input to condition on. To address this challenge, this work proposed to predict subgoals (or options) to replace the return-to-go. Two transformers are trained together while one is used for predicting the subgoals and the the other one is used to predict actions conditioned on the subgoals. Through experiments on D4RL, the authors show that this hierarchical approach can outperform the original DT, especially in tasks that invovle long episodes.
arXiv [Paper]
A generative transformer-based architecture for pretraining with robot data in a self-supervised manner.
arXiv [Paper]
Focusing on offline reinforcement learning, the authors provide a study on the capabilities and limitations of return-conditioned supervised learning (RCSL). The authors found that RCSL requires assumptions stronger than the dynamic programming to return optimal policies. Specifically, the authors pointed out that RCSL requires nearly deterministic dynamics and proper condition values. The authors claim that RCSL alone is unlikely to be a general solution for offline RL problems. However, it may perform well with high quality behavior data.
OpenReview Submission to ICLR'23 [Paper]
Recurrent neural networks are often used for encoding an agent's history when solving POMDP tasks. This paper proposed to replace the recurrent neural networks with transformers. Results show that transformers can solve POMDP faster and more stably than methods based on recurrent neural networks.
Foundation Models for Deicsion Making Workshop, NeurIPS'22 [Paper]
This paper proposed an approach for learning context vectors that can be used as prompts for the transformers. With the prompts, the authors developed a contextual meta transformers that can leverage the prompt as the task context to improve he performance on unseen tasks.
MCTransformer: combining transformers and monte-carlo tree search for offline reinforcement learning
OpenReview Submission to ICLR'23 [Paper]
The authors combine transformers and MCTS for efficient online finetuning. MCTS is used as an effective approach to balance exploration and exploitation.
Pretraining the vision transformer using self-supervised methods for vision based deep reinforcement learning
OpenReview Submission to ICLR'23 [Paper]
This work replaces CNNs used in image-based RL agents with pre-trained Vision Transformers. Interestingly, the authors found Vision Transformers still perform similarly or worse than CNNs.
OpenReview Submission to ICLR'23 [Paper]
OpenReview Submission to ICLR'23 [Paper]
This work applies unsupervised skill discovery to DT. The skill embedding is used as an input to the DT. This can be thought as a hierarchical RL approach.
OpenReview Submission to ICLR'23 [Paper]
OpenReview Submission to ICLR'23 [Paper]
OpenReview Submission to ICLR'23 [Paper]
OpenReview Submission to ICLR'23 [Paper]
This work focuses on adapting DT to unseen novel tasks. An adaptation module is added to the DT with its parameters initialized by a hyper-network. When adapting to a new task, only the parameters of the adaptation module is finetuned. The results show that adapting the module leads to faster learning than the
OpenReview Submission to ICLR'23 [Paper]
arXiv'22 [Paper]
The authors compared ViTs and CNNs in image-based DRL tasks. They found that CNNs still perform better than ViTs.
arXiv'22 [Paper]
The authors apply semi-supervised imitation learning to enable agents to learn to act by watching online unlabeled videos.
arXiv'22 [Paper]
arXiv'22 [Paper]
CoRL'21 [paper]
IJCAI'20 [paper]
arXiv'22 [paper]
- Amazon Accessible RL SDK: an open-source Python package for sequential decision making with transformers.
- Stanford CS25: Decision Transformers Lecture
- Benchmark Environments
This repo is released under Apache License 2.0.