-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDP #171
Conversation
Wow, this is great!! I have to spend so time with the code but the implementation looks outstanding. The notebooks are very helpful too. This is a really big contribution to open source economics. |
Great stuff! Few points on the 'modified policy iteration' that may or may not be helpful (hopefully they are). I caution that they are all based on my experience with a slightly different implementation (instead of just one Q my own codes differentiate between endogenous and exogenous states). From speed perspective --- and you may already do this, I was not sure from the pseudocode and don't write python myself so didn't want to go digging --- once you calculate the policy indexes sigma, rather than calculate r(sigma) at each of your k iterations in step 3, you can calculate it once at the beginning of step 3 as a matrix and then just reuse for each of the k iterations. In my experience this provides a nono-trivial speed boost. Using an alorithm similar to but different from your current algorithm it is possible to contaminate everything with -Infs (if your final true v has some -Infs and your initial v0 doesn't). This has happened to me in practice. This is easily avoided by also checking that span(u-v) is finite before you do step 4. I'm not certain this will happen with your implementation, but might be worth checking. Another thing I would mention based on my experience an alternative implementation of your modified policy iteration algorithm. While the value function iteration is guaranteed to converge if you set iter_max high enough (thanks to contraction mapping property) this is not true of the modified policy iteration algorithm if your k is big relative to grid size (of S and A). This is something I have found by experience. It might therefore be worth adding an internal option that after a certain number of tries, if convergence has not yet been reached then the modified policy iteration algorithm will just switch over automatically over to doing value function iteration (ie. you might perhaps have iter_max/2 with modified policy iteration, and then from there to iter_max with value fn iteration). That way the user can be certain that if they just keep increasing iter_max they will eventually get convergence. (ps. I haven't contributed anything to quantecon, but I 'watch' and this is my first comment here) |
Dear @robertdkirkby: Thank you for your comments, and welcome to quantecon!
Let me explain what I do in the code.
This can happen when the input reward is such that for some
Thank you for reminding me of this, I have been forgetting that in the proof of convergence of modified policy iteration, I needed the assumption that |
I chose to set Any suggestions are welcome for this and for the other methods as well. |
Due to |
Thanks @oyamad
|
Thanks @robertdkirkby The "product formulation" assumes that constraints are expressed by setting I am afraid it would be problematic if we have some This is not very difficult. It is similar to the proof of convergence of policy iteration.
from the inequality
By this, we have |
|
The description of the algorithms in the notebook was misleading because of |
I did rebase and forced push. |
As another demo, I used A remark: |
Following the suggestion from @robertdkirkby , I added a method that checks that every state has at lease one action with finite reward. It also checks, in the case of state-action pairs, that for every state, there is at least one available action. |
Checks - every state has at least one action with finite reward - for the sa-pair formulation, every state has at least one available action
following sparse matrix support in MarkovChain
Following a discussion with @jstac, I renamed |
Everything looks great. Thanks again! I'll merge now and start writing the lecture. |
I finally finished coding the Markov Decision Process (MDP) class. There are a couple of issues to discuss.
An
MDP
instance can be created in two ways, in which the transition matrix iswhere S is the state space, A the action space, and SA the set of feasible state-action pairs. The former should be straightforward; the latter requires additional information about the correspondence between the indices of SA and the actual state-action pairs. See the following example:
Here, S = {0, 1}, A = {0, 1}, and SA = {(0, 0), (0, 1), (1, 0)} (action 1 is infeasible at state 1). The correspondence is represented by
s_indices
anda_indices
. I wonder whether this is an "optimal" implementation strategy.Complications are basically kept in
__init__
(so it is very long), to keep other methods simple and more or less intuitive.Any suggestions for improvement in particular on variable/method/function names (and of course for other things too) will be appreciated.
I tried to follow the ideas in ENH: make calls to
compute_fixed_point
allocate less memory #40 and ENH: moved the lucas_tree tuple to a new LucasTree class #41. I wonder if I did it right. The methodoperator_iteration
may be replaced withcompute_fixed_point
, whereas the latter does not exactly match here.Results of
solve
are stored inMDPSolveResult
. Are there other important informations to store there?The policy iteration algorithm is usually initialized with a policy function, while the
policy_iteration
method here requires an initial value function, to be consistent with the other solution methods.Deals only with infinite horizon.
beta=1
is not accepted. (Do we want a finite horizon algorithm?)I also added
random_mdp
. I hope this will be useful for performance tests.I also added some notebooks on
MDP
to QuantEcon.site:I have some more examples here.