Code for "Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes" (NeurIPS 2021) by Alex Nam/Scott Fleming/Emma Brunskill https://openreview.net/pdf?id=jgze2dDL9y8
Conda packages and versions used for generating the reported results are shared in conda.yml (Note not all the packages may be needed.)
To run the known observation belief encoder
-
cartpole (can adjust observation cost in cofig_file) python known_obs/code/main_known.py -p with environment.config_file=cartpole_ver3.yaml
-
mountain hike (can adjust observation cost in config file) python known_obs/code/main_known.py -p with environment.config_file=mountainHike_ver2.yaml
To run the default DVRL belief encoder (also need to manually set env_id in code/envs.py 'make_env' -- review inline comments)
Source code: https://github.com/maximilianigl/DVRL
-
cartpole (need to set obs_cost in custom_cartpole/envs/AdvancedCartPole.py obs_cost) python ./code/main.py -p with environment.config_file=cartpole_ver2.yaml algorithm.use_particle_filter=True log.filename='temp/'
-
mountain (need to set obs_cost in custom_mountain/envs/hike.py obs_cost) python ./code/main.py -p with environment.config_file=mountainHike_ver3.yaml algorithm.use_particle_filter=True log.filename='temp/'
To run Sepsis with POMCP/MCTS
POMDPy source code: https://github.com/pemami4911/POMDPy
Empirical model built from 1M random interactions is saved in "./POMDPy/examples/sepsis/model_256.obj"
- Observe-then-Plan (can change cost to any <= 0 value, init_idx specifies the start patient state)
''' assume the transition and reward estimates are learned in advance and use a copy of the model parameter estimates for planning. transitions need to be unzipped from transitions.npy.zip inside acno_mdp/POMDPy/pompdy directory '''
Set "observe_then_plan = True" in L183 run_pomcp(self, epoch, eps, temp=None, observe_then_plan=True) in main/POMDPy/pomdpy/agent.py python pomcp.py --init_idx 256 --cost -0.1 --is_mdp 0
- ACNO-POMCP (observe while planning)
''' starts with uniform initialization of transition models and updates the transition for every observed tuple '''
Set "observe_then_plan = False" in L183 run_pomcp(self, epoch, eps, temp=None, observe_then_plan=True) in main/POMDPy/pomdpy/agent.py. This will set the transition model parameters to uniform over all possible next states and update the observed tuple counts.
python pomcp.py --init_idx 256 --cost -0.1 --is_mdp 0
- MCTS (always observing POMCP, not included in the main paper)
python pomcp.py --init_idx 256 --cost -0.05 --is_mdp 1
- For running POMCP with the true model parameters (e.g., stepping actions in the true environment instead of imaginging with the learned model parameters), modify L343 in POMDPy/examples/sepsis/sepsis.py to "_true = True" so the actions are executed in a copy of the sepsis environment. Otherwise, use the same command as ACNO-POMCP
python pomcp.py --init_idx 256 --cost -0.1 --is_mdp 0
- DRQN
Source code: https://github.com/Bigpig4396/PyTorch-Deep-Recurrent-Q-Learning-DRQN
python drqn.py
Generating plots for continuous domains uses the same code as DVRL and plots for sepsis can be replicated using sepsis_res/plot.py.