For many RL problems an informative reward function can improve the learning process of a sparse reward environment, though it can be much harder to define. RL With Trajectory Feedback is an addon for regular RL problems, which enables to use an observer feedback on agent's performance to learn an informative reward function. First, we showed that an informative reward function (per state, action pair) can be learned using a sequence reward function, which was represented by labeled sequences.
In addition, we showed that by using merely 200 user labeled short sequences, we could learn a reward function that brings unsolvable sparse-reward environment to converge.
The full project book can be found in: Project Book
- python 3.7 (or newer)
- python3.7-tk (or newer) For GUI features (e.g. show windows for train online)
- install requirements:
pip install -r requirements.txt
The main script for experiments on rooms environment is scripts/run_rooms.py
.
The environment is implemented in src/envs/random_rooms.py
For further information on all flags of this scripts run the following command:
python scripts/run_rooms.py --help
- Train with sparse reward:
python scripts/run_rooms.py --config examples/rooms/sparse.yaml
- Train with dense reward:
python scripts/run_rooms.py --config examples/rooms/dense.yaml
- Train with Online RP training, using "Perfect User" reward:
python scripts/run_rooms.py --config examples/rooms/perfect_user.yaml
- Train with Online RP training, using "Discrete User" (with Noise) reward:
python scripts/run_rooms.py --config examples/rooms/discrete_user.yaml
As mentioned in the Project books, one of the experiments was to compare pure sparse reward environment Vs Online Reward Predictor that was trained with accumulated dense reward Online:
In the following graph we can see that the Reward Predictor converged during the agent training
When we reachd the Manual experiment, we have labled 200 windows, and trained the RP with them Offline. Then we loaded the pre-trained RP at the begining of the agent RL training and used it as the intrinsic reward. The following graph shows the success rate of this agent Vs the baseline:
Sparse Reward environment:
Sparse Reward Environment + "Perfect User" Online trained RP:
Sparse Reward Environment + "Discrete User" (+ noise) Online trained RP:
RL with trajectory feedback was designed by Uri Gadot and Hagay Michaeli