Code for "Reinforcement Learning with State Observation Costs in Action-Contingent Noiselessly Observable Markov Decision Processes" (NeurIPS 2021) by Alex Nam/Scott Fleming/Emma Brunskill
Conda packages and versions used for generating the reported results are shared in conda.yml (Note not all the packages may be needed.)
To run the known observation belief encoder
cartpole (can adjust observation cost in cofig_file) python known_obs/code/ -p with environment.config_file=cartpole_ver3.yaml
mountain hike (can adjust observation cost in config file) python known_obs/code/ -p with environment.config_file=mountainHike_ver2.yaml
To run the default DVRL belief encoder (also need to manually set env_id in code/ 'make_env' -- review inline comments)
Source code:
cartpole (need to set obs_cost in custom_cartpole/envs/ obs_cost) python ./code/ -p with environment.config_file=cartpole_ver2.yaml algorithm.use_particle_filter=True log.filename='temp/'
mountain (need to set obs_cost in custom_mountain/envs/ obs_cost) python ./code/ -p with environment.config_file=mountainHike_ver3.yaml algorithm.use_particle_filter=True log.filename='temp/'
To run Sepsis with POMCP/MCTS
POMDPy source code:
Empirical model built from 1M random interactions is saved in "./POMDPy/examples/sepsis/model_256.obj"
- Observe-then-Plan (can change cost to any <= 0 value, init_idx specifies the start patient state)
''' assume the transition and reward estimates are learned in advance and use a copy of the model parameter estimates for planning. transitions need to be unzipped from inside acno_mdp/POMDPy/pompdy directory '''
Set "observe_then_plan = True" in L183 run_pomcp(self, epoch, eps, temp=None, observe_then_plan=True) in main/POMDPy/pomdpy/ python --init_idx 256 --cost -0.1 --is_mdp 0
- ACNO-POMCP (observe while planning)
''' starts with uniform initialization of transition models and updates the transition for every observed tuple '''
Set "observe_then_plan = False" in L183 run_pomcp(self, epoch, eps, temp=None, observe_then_plan=True) in main/POMDPy/pomdpy/ This will set the transition model parameters to uniform over all possible next states and update the observed tuple counts.
python --init_idx 256 --cost -0.1 --is_mdp 0
- MCTS (always observing POMCP, not included in the main paper)
python --init_idx 256 --cost -0.05 --is_mdp 1
- For running POMCP with the true model parameters (e.g., stepping actions in the true environment instead of imaginging with the learned model parameters), modify L343 in POMDPy/examples/sepsis/ to "_true = True" so the actions are executed in a copy of the sepsis environment. Otherwise, use the same command as ACNO-POMCP
python --init_idx 256 --cost -0.1 --is_mdp 0
Source code:
Generating plots for continuous domains uses the same code as DVRL and plots for sepsis can be replicated using sepsis_res/