-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid Action Mask [WIP] #453
Conversation
stable_baselines/acer/acer_simple.py
Outdated
if states is not None: | ||
td_map[self.train_model.states_ph] = states | ||
td_map[self.train_model.dones_ph] = masks | ||
td_map[self.polyak_model.states_ph] = states | ||
td_map[self.polyak_model.dones_ph] = masks | ||
td_map[self.train_model.action_mask_ph] = action_masks | ||
td_map[self.polyak_model.action_mask_ph] = action_masks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a fan of accessing the tensor like this, but I didn't see another way to do it cleanly besides make a member variable for it, but that duplicates data
@@ -344,12 +352,17 @@ def __init__(self, nvec, flat): | |||
""" | |||
self.flat = flat | |||
self.categoricals = list(map(CategoricalProbabilityDistribution, tf.split(flat, nvec, axis=-1))) | |||
self.action_mask_vector = action_mask_vector | |||
|
|||
def flatparam(self): | |||
return self.flat | |||
|
|||
def mode(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to make the action mask a "layer" in the policy, as opposed to it being done in distribution, which caused those particular tests to fail as they don't invoke a policy. I like this way a lot more, as it's much cleaner IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sounds good
|
||
def flatparam(self): | ||
return self.flat | ||
|
||
def mode(self): | ||
return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32) | ||
modes = [] | ||
for idx, p in enumerate(self.categoricals): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I do element wise multiplication instead of boolean_mask because I don't exactly understand how tf.boolean_mask applies to a batch of actions.
The way I originally had it, was in tf.boolean_mask(self.logits, self.action_mask_vector[0]), which worked for my env but I didn't know how it behaved with other envs.
This approach will work if you relu the output instead of TanH. If you tanh, and all your valid actions have negative values, the element wise multiplication will result in an invalid action being selected because 0 > {all negative valued valid actions}.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to think a bit more for that one...
stable_baselines/common/policies.py
Outdated
name="action_mask_ph") | ||
elif isinstance(ac_space, spaces.Box): | ||
self._action_mask_ph = tf.placeholder(dtype=tf.float32, shape=(n_batch, ac_space.shape[0]), | ||
name="action_mask_ph") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty self documenting. Different action spaces call for different dimension ed action masks.
def action_mask_ph(self): | ||
"""tf.Tensor: placeholder for valid actions, shape (self.n_batch, self.ac_space.n)""" | ||
return self._action_mask_ph | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
following suite with the rest of the class
stable_baselines/common/policies.py
Outdated
@@ -751,3 +786,23 @@ def register_policy(name, policy): | |||
raise ValueError("Error: the name {} is alreay registered for a different policy, will not override." | |||
.format(name)) | |||
_policy_registry[sub_class][name] = policy | |||
|
|||
|
|||
def create_dummy_action_mask(ac_space, num_samples): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utility functions that are used throughout to create / reshape the action_mask vectors. Used here, and in various algorithm _train_steps
A number of tests are still failing, and I haven't tested where to apply the mask w.r.t. the softmax for KL divergence, but this ensures everyone in the discussion is on the same page and knows the complete picture. A lot of my slow progress is due to my apprehensiveness to commit to changes in the dark and without feedback. |
Looking back at the problem, it seems very similar to the one encountered in NLP with the "Transformer" module (cf this good blog). Apparently, something slightly different is done: PS: the travis failure is due to a |
stable_baselines/a2c/a2c.py
Outdated
@@ -170,7 +171,7 @@ def setup_model(self): | |||
|
|||
self.summary = tf.summary.merge_all() | |||
|
|||
def _train_step(self, obs, states, rewards, masks, actions, values, update, writer=None): | |||
def _train_step(self, obs, states, rewards, masks, actions, values, update, writer=None, action_masks=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: missing entry in the docstring
stable_baselines/a2c/a2c.py
Outdated
if states is not None: | ||
td_map[self.train_model.states_ph] = states | ||
td_map[self.train_model.dones_ph] = masks | ||
|
||
if action_masks is None: | ||
action_masks = create_dummy_action_mask(self.train_model.ac_space, states.shape[0] * self.n_steps) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not using placeholder with default value here?
stable_baselines/a2c/a2c.py
Outdated
@@ -319,9 +330,10 @@ def run(self): | |||
""" | |||
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [], [], [], [], [] | |||
mb_states = self.states | |||
ep_infos = [] | |||
ep_infos = [[]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing ep_infos instead of new variable mb_action_masks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I tunnel visioned on the fact that the action_mask was in info. I'll make that change, which will clean up some stuff after runner.run and before _train_step.
stable_baselines/a2c/utils.py
Outdated
@@ -158,6 +158,11 @@ def linear(input_tensor, scope, n_hidden, *, init_scale=1.0, init_bias=0.0): | |||
return tf.matmul(input_tensor, weight) + bias | |||
|
|||
|
|||
def action_mask(input_tensor, mask_tensor, scope): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call that function action_mask
because this is a bit confusing, maybe apply_mask
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- missing docstring
@@ -344,12 +352,17 @@ def __init__(self, nvec, flat): | |||
""" | |||
self.flat = flat | |||
self.categoricals = list(map(CategoricalProbabilityDistribution, tf.split(flat, nvec, axis=-1))) | |||
self.action_mask_vector = action_mask_vector | |||
|
|||
def flatparam(self): | |||
return self.flat | |||
|
|||
def mode(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sounds good
|
||
def flatparam(self): | ||
return self.flat | ||
|
||
def mode(self): | ||
return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32) | ||
modes = [] | ||
for idx, p in enumerate(self.categoricals): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to think a bit more for that one...
stable_baselines/common/policies.py
Outdated
action_mask = np.reshape(action_mask, (num_samples, ac_space.nvec[0], ac_space.nvec[1]), dtype=np.float32) | ||
elif isinstance(ac_space, spaces.Discrete) or isinstance(ac_space, spaces.MultiBinary): | ||
action_mask = np.reshape(action_mask, (num_samples, ac_space.n), dtype=np.float32) | ||
elif isinstance(ac_space, spaces.Box): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does masking make sense for continuous actions? or is it only to avoid errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could make sense. Say the agent is controlling an n-dimensional robotic arm. This could be used to prevent collisions with itself.
But, it is to prevent errors as the step methods now require action_mask to be a variable in the graph. I figured why not allow for an action mask for continuous spaces, it doesn't add any additional overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Say the agent is controlling an n-dimensional robotic arm
Setting the action to zero does not prevent the arm from moving... (if the output is the position of the arm)
it is to prevent errors as the step methods now require action_mask to be a variable in the graph
Maybe this can be solve using a placeholder with default value (so you don't have to feed the graph with a value for that tensor)
I removed the I am still trying to figure out how to replace elements in input_tensor that is a zero in action_mask. I'm sure there is a way to do with with |
ok, I think there was a misunderstanding: the Transformer was just a remark, we can continue with the normal masking idea. |
There is one stray distribution test that is failing which is caused by explicit usage of the action_mask vector:
Is logits the output of |
I currently use the PPO2 algorithm. |
Have your environment return a 'valid_actions' array of 1's and 0's in the info dict in the step method |
Hi @H-Park , |
Yes, that is correct. |
My action_space=gym.spaces.MultiDiscrete([4]), policy Networks is MlpLnLstmPolicy |
I removed ac_space.nvec[1], I found that there are still a lot of invalid actions. |
Try it now, I confirmed it works for PPO, A2C and DQN. |
Invalid actions will still appear in my environment.😭 |
I write a test environment, please try it out. |
I'm looking into this now. The reason the out of bounds error is happening is because you are using MultiDiscrete with only 1 dimension, which technically is valid. I only handled the case where dim = 2. |
The problem is as expected: your valid action values are negative, meaning that when you apply the mask, the invalid actions get set to 0, and 0 > all negative numbers. This is not a problem on your side. I am working on fixing it. |
This runs, but diverges quickly because of the neglogp calculation involving -np.inf. I won't have time to figure out a way around that until sunday |
I get the following error. |
Added support for any dimension MultiDiscrete support. The models still diverge within a few hundred iterations though. Thank you for being my tester! I greatly appreciate it |
(which then causes invalid actions to be selected because all values, valid or not, are Nan/inf) |
I am also very grateful to you, I am reading your code and learning from your code. |
Please let me know if I can improve on anything, or if things don't make sense. I'm on holiday right now. If you have thoughts on how to solve the diverging issue, I'm all ears. I'm going to investigate it in the morning (about 10 hours from now) |
I was thinking before, if the action mask is added to the observation, it may solve the problem of infinite expansion of neglogp. In theory, agent will learn how to look at action masks. Applying the mask to neglogp will only calculate the probability distance of our allowed actions, and the masked actions will not be considered. |
I was thinking that too. If the masking is done explicitly in the policy(#453 (comment)) - which I was doing at iteration 1.5 of this project - these problems might evaporate. With masking in distributions for neglogp, the KL divergence will be a different value, and in reality will make the KL divergence (much) smaller, meaning that PPO more liberally updates the policy, because the loss will be greater, per: Though, I don't believe there is a guarantee that masking in the policy will in fact be our saving grace. I do believe it will at the very least reduce the learning discrepancies between masking and no masking. Note: The most definitive way to test this would be to have an environment that knows the worst move at every given state, and run two experiments. No masking, and mask the worst move. Does anyone have such an environment or can think of one? Quick neglogp test regarding masking: https://gist.github.com/H-Park/b82f23ce8bc31716bff46513ec4ee94c |
What do you think of this environment, I wrote a simple mouse to walk the maze. I have not prepared a readme in English. But you can use the browser's translation. If I remember correctly, PPO2 uses the CLIP method? |
Hi Guys, did we come to a solution for this neglogp issue? I have been using the invalid action masks in my application and it seems to train well but goes into an infinite loop while prediction because of predicting an invalid action although the action mask is rightly set to avoid that action. I checked the neglogp value and it is |
I don't have so much to add but I agree that using a hard action mask like this means that if the predicted action is actually invalid, it leads to unexpected behaviour sometimes, especially if the action entropy is low/zero (i.e. one action practically has all the probability of being chosen). I have had success with adding the action mask to the observation space instead, which while not a hard mask, leads to a much more stable training. During the evaluation of the agent, I have chosen to use my own policy where I just sample actions using the action probabilities and just select the first valid action. While I suppose that this PR should practically do the same thing, I had problems using it (maybe because of only a partial action mask). |
@kudhru That occurs when the policy diverges. Honestly, I don't know how to avoid this. I have been comparing two masking techniques.
Both approaches cause the policy to diverge every so often, so I am not sure where to go from here. It might be better to simply cut our losses and not use PPO for problems that require action masking. My senior capstone this spring and summer is exploring action masking. So hopefully I have some insights into all of this by the end. |
Thanks for the insight, @H-Park !
What do you think I should use instead of PPO? I mean, which algorithms are less probable to diverge (with action mask) as compared to PPO?
Thanks
Dhruv
… On Jan 29, 2020, at 11:03 AM, Hunter Park ***@***.***> wrote:
@kudhru <https://github.com/kudhru> That occurs when the policy diverges.
Honestly, I don't know how to avoid this. I have been comparing two masking techniques.
Masking in policy by providing the action mask as in input in the graph. This makes the KL divergence only really consider the valid actions (as the invalid ones are -100, compared to [-1, 1] for valid actions).
The method currently done in this PR, masking after the fact in distributions.py methods. This does not alter the policy's logits, but samples according to the action mask.
Both approaches cause the policy to diverge every so often, so I am not sure where to go from here. It might be better to simply cut our losses and not use PPO for problems that require action masking.
My senior capstone this spring and summer is exploring action masking. So hopefully I have some insights into all of this by the end.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#453?email_source=notifications&email_token=ADFF7WQKCYPBJ24OCOAURH3RAGZENA5CNFSM4IPDDIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKH62OY#issuecomment-579857723>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADFF7WQ6VSFBFTRQYLT33WLRAGZENANCNFSM4IPDDIOA>.
|
The working theory here about the policy diverging is it is caused by something between the discrepancy between the raw policy output (logits), the neglogp computed (which in turn influences the KL divergence, and then in turn the policy update), and the action that is actually sampled. A possible short term solution for you might be to use something simpler (aka no KL), such a deep q, however I do not remember the state I left DQN in for this PR. |
I took an internship offer where I'll be continuing this work, however it will be closed source. I'm unsure what the stable baseline's team wants to with this PR, as this is way too experimental to be merged. If I up the docs on my fork, would you add a link in your docs to it and we can archive this PR? |
Sounds good to me (a link from stable-baselines docs to your fork, with link to this PR as well as there was a good bunch of fruitful discussion here). At this point it is probably easier to add such link as a separate PR, and just close this one at that point. With v3 of stable-baselines lurking around we are not including more features to this one, anyways :). Best of luck with your work! Looking forward to seeing what is the correct / less-wrong way of doing this! |
Hi all, I hope our paper A Closer Look at Invalid Action Masking in Policy Gradient Algorithms (https://arxiv.org/abs/2006.14171) will bring additional insight on the effect of invalid action making and how it should be implemented :) And here is the corresponding blog post https://costa.sh/blog-a-closer-look-at-invalid-action-masking-in-policy-gradient-algorithms.html |
I saw the blog post on /r/reinforcementlearning earlier today. For my senior capstone and at work, I've been using a tuple observation space, with one of the observations being the mask, and then the policy explicitly uses it as a gate, so that way the invalid actions are actually invalid from a gradient standpoint. https://github.com/H-Park/yarll/blob/master/test/test_policies.py#L100 Been meaning to discuss with @Miffyli if it'd be wanted to add this feature to the rewrite. |
Are there any new developments on how to handle entropy drops with this PR or the rewrite? For my environment, at some point the entropy drops to
|
@Bensk1 Your best bet would be to use SB3 with a custom policy that masks via passing in the action mask as part of the input in a tuple or dictionary observation. However I don't know the current state of SB3 and if:
@Miffyli ^ Because SB3 is on track to support (if they don't already)
For the |
@H-Park |
I realize this doesn't answer your exact question - but the reality is there is no literature that explores and definitely compares masking methods from a policy update standpoint (to my knowledge at least, though I haven't looked in about a years time). That is the largest reason this PR was never completed, because however we decided to compute entropy had no mathematical or experimental backing. |
@Bensk1 If you are looking for some reference implementation to experiment with SB3, feel free to check out our code at ppo_partial_mask.py; we benchmark this script on some of our microrts environments here: Gym-microrts V1 Benchmark For simplicity we just put the action mask in the env's attribute instead of using a import gym
import gym_microrts
env = gym.make("MicrortsMining-v1")
/home/costa/Documents/work/go/src/github.com/vwxyzjn/gym-microrts/microrts/maps/10x10/basesWorkers10x10.xml
[Lai.rewardfunction.RewardFunctionInterface;@1ed6993a
env.observation_space
Out[4]: Box(0, 1, (10, 10, 27), int32)
env.action_space
Out[5]: MultiDiscrete([100 6 4 4 4 4 7 100])
obs = env.reset()
# you need to add the action mask to the env's attribute that has the same shape as the flattened action
env.action_mask.shape space.
Out[7]: (229,)
sum(env.action_space.nvec)
Out[8]: 229 Also a side note. I am not sure if it is the best idea to put the action mask inside of the So we implemented this full action mask with the following code, basically calculating the full action mask via # https://github.com/vwxyzjn/gym-microrts/blob/146e6b9255a9799abe4b85889264ba6e6a791b58/experiments/ppo.py#L231-L244
# 1. select source unit based on source unit mask
source_unit_mask = torch.Tensor(np.array(envs.env_method("get_unit_location_mask", player=1)))
multi_categoricals = [CategoricalMasked(logits=split_logits[0], masks=source_unit_mask)]
action_components = [multi_categoricals[0].sample()]
# 2. select action type and parameter section based on the
# source-unit mask of action type and parameters
source_unit_action_mask = torch.Tensor(
[envs.env_method("get_unit_action_mask", unit=action_components[0][i], player=1, indices=i)[0]
for i in range(envs.num_envs)])
split_suam = torch.split(source_unit_action_mask, envs.action_space.nvec.tolist()[1:], dim=1)
multi_categoricals = multi_categoricals + [CategoricalMasked(logits=logits, masks=iam) for (logits, iam) in zip(split_logits[1:], split_suam)]
invalid_action_masks = torch.cat((source_unit_mask, source_unit_action_mask), 1)
action = torch.stack([categorical.sample() for categorical in multi_categoricals]) We actually found this full action mask to be very powerful. So using a partial action mask, it is very sample-inefficient to learn anything and we had to use frame-skipping to speed up learning. But using the full action mask, the trained agents can actually perform well against existing bots in the full game, as shown here: Maybe in SB3 we should consider something similar to support this kind of conditioned action mask? |
@H-Park after reading through this thread, I found this paper: https://arxiv.org/abs/2006.14171. It empirically demonstrates dominance of and theoretically justifies invalid action masking over invalid action penalties in terms of scaling to larger action spaces with increasing number of invalid actions. Do you have any plans to continue work on this pull request again in view of these findings? I'm highly interested because I want to explore RL for card / board games, where you typically have only a fraction of the actions available to each agent (e.g. the cards in hand). |
@nmevenkamp Thanks for the link, nice find!!! @vwxyzjn I guess I didn't see your edit, nice work! @Miffyli Given this publication, would an implementation PR of it get merged in SB2 or should it go in SB3? |
SB2 now mainly takes bug-fixes and small enhancements, bigger ones like this should go to SB3. sb3-contrib is designed for more experimental features and something like this would be nice, although we need to figure out if it should be a separate algorithm (for now, as dict/tuple spaces are not supported just yet). |
Sorry to bump this old issue, but I don't find any place mentionning the action mask in SB3
Is this implemented in SB3 ? And if so is it only reserved to discrete and multidiscrete spaces or can I apply it to box action spaces aswell ? |
There is an active PR for dictionary observations for stable-baselines3, which will likely be merged in near future. This feature could then be used to create action space masking the way described in the quote, but out-of-box would only work for discrete spaces. Continuous spaces need more thinking, but could be included in contrib package. |
@Remideza Once the PR above gets merged, I'll go ahead and make a similar PR for SB3 using dictionary observations and the masking method outlined in the pre print paper above. It will be in sb3/contrib, however, as the feature is still experimental at best (the paper mentioned is not peer reviewed to my knowledge). |
Closing this PR in favor of SB3/SB3-contrib implementation. Big thanks to all participants to this discussion for interesting comments and arguments! :) |
This is about a month overdue, I'll go through some lines below and add comments.
Right now, a number of tests don't pass, but this is per @araffin request to do a draft PR.
closes #351