Skip to content

Sequential Decision Making without sampling and planning.

Notifications You must be signed in to change notification settings

dykim1222/seqdecmak

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Toy Problem for Sequential Decision Making without sampling and planning.

  • Problem set up: The objective is to maximize the cumulative sum of rewards. We assume environment transition is parametrized by some parameter theta. Once we know what theta is, we know the environment exactly (again, once we know theta). The uncertainty in theta makes the problem not suitable for optimal control methods or dynamic programming. We also ban 'future-sampling', so standard reinforcement learning algorithms are not allowed here.

11/12/2019

  • Dynamic programming (DP) is not applicable since we are not allowed to sample the future.
  • It's not obvious to derive a parallel theorem to Bellman equations as in DP or reinforcement learning (RL).
  • To visualize histogram for rewards for policies.
  • To study behavior as horizon T increases and when theta is deterministic.
  • Visualize the transition between exploration & exploitation: might lead us the way how to implement/quantify exp-exp; further turning this into an objective function.

11/27/2019

  • Brute-force search for the optimal policies.
  • Given a prior on theta, update posterior belief by Bayes' Rule.
  • At some point an optimal policy starts more exploiting and less exploring!
  • Write an equation similar to Bellman equation in dynamic programming (DP).
  • Assuming the equation holds, solve by DP.
  • The optimal policies agree.

About

Sequential Decision Making without sampling and planning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages