Invalid Action Mask [WIP] #453

H-Park · 2019-08-23T20:33:42Z

This is about a month overdue, I'll go through some lines below and add comments.

Right now, a number of tests don't pass, but this is per @araffin request to do a draft PR.

closes #351

H-Park · 2019-08-23T20:34:46Z

stable_baselines/acer/acer_simple.py

        if states is not None:
            td_map[self.train_model.states_ph] = states
            td_map[self.train_model.dones_ph] = masks
            td_map[self.polyak_model.states_ph] = states
            td_map[self.polyak_model.dones_ph] = masks
+        td_map[self.train_model.action_mask_ph] = action_masks
+        td_map[self.polyak_model.action_mask_ph] = action_masks


Not a fan of accessing the tensor like this, but I didn't see another way to do it cleanly besides make a member variable for it, but that duplicates data

H-Park · 2019-08-23T20:35:59Z

stable_baselines/common/distributions.py

@@ -344,12 +352,17 @@ def __init__(self, nvec, flat):
        """
        self.flat = flat
        self.categoricals = list(map(CategoricalProbabilityDistribution, tf.split(flat, nvec, axis=-1)))
+        self.action_mask_vector = action_mask_vector

    def flatparam(self):
        return self.flat

    def mode(self):


I decided to make the action mask a "layer" in the policy, as opposed to it being done in distribution, which caused those particular tests to fail as they don't invoke a policy. I like this way a lot more, as it's much cleaner IMO.

yes, sounds good

H-Park · 2019-08-23T20:38:57Z

stable_baselines/common/distributions.py


    def flatparam(self):
        return self.flat

    def mode(self):
-        return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
+        modes = []
+        for idx, p in enumerate(self.categoricals):


So, I do element wise multiplication instead of boolean_mask because I don't exactly understand how tf.boolean_mask applies to a batch of actions.

The way I originally had it, was in tf.boolean_mask(self.logits, self.action_mask_vector[0]), which worked for my env but I didn't know how it behaved with other envs.

This approach will work if you relu the output instead of TanH. If you tanh, and all your valid actions have negative values, the element wise multiplication will result in an invalid action being selected because 0 > {all negative valued valid actions}.

@araffin

I have to think a bit more for that one...

H-Park · 2019-08-23T20:39:53Z

stable_baselines/common/policies.py

+                                                      name="action_mask_ph")
+            elif isinstance(ac_space, spaces.Box):
+                self._action_mask_ph = tf.placeholder(dtype=tf.float32, shape=(n_batch, ac_space.shape[0]),
+                                                      name="action_mask_ph")


Pretty self documenting. Different action spaces call for different dimension ed action masks.

H-Park · 2019-08-23T20:40:12Z

stable_baselines/common/policies.py

+    def action_mask_ph(self):
+        """tf.Tensor: placeholder for valid actions, shape (self.n_batch, self.ac_space.n)"""
+        return self._action_mask_ph
+


following suite with the rest of the class

H-Park · 2019-08-23T20:41:19Z

stable_baselines/common/policies.py

@@ -751,3 +786,23 @@ def register_policy(name, policy):
        raise ValueError("Error: the name {} is alreay registered for a different policy, will not override."
                         .format(name))
    _policy_registry[sub_class][name] = policy
+
+
+def create_dummy_action_mask(ac_space, num_samples):


utility functions that are used throughout to create / reshape the action_mask vectors. Used here, and in various algorithm _train_steps

H-Park · 2019-08-23T20:46:34Z

A number of tests are still failing, and I haven't tested where to apply the mask w.r.t. the softmax for KL divergence, but this ensures everyone in the discussion is on the same page and knows the complete picture.

A lot of my slow progress is due to my apprehensiveness to commit to changes in the dark and without feedback.

araffin · 2019-08-27T16:57:17Z

Looking back at the problem, it seems very similar to the one encountered in NLP with the "Transformer" module (cf this good blog).

Apparently, something slightly different is done:
"Since we want these elements to be zero after the softmax, we set them to minus infinity"

PS: the travis failure is due to a tf.print somewhere

araffin · 2019-08-27T16:58:59Z

stable_baselines/a2c/a2c.py

@@ -170,7 +171,7 @@ def setup_model(self):

                self.summary = tf.summary.merge_all()

-    def _train_step(self, obs, states, rewards, masks, actions, values, update, writer=None):
+    def _train_step(self, obs, states, rewards, masks, actions, values, update, writer=None, action_masks=None):


minor: missing entry in the docstring

araffin · 2019-08-27T17:04:28Z

stable_baselines/a2c/a2c.py

        if states is not None:
            td_map[self.train_model.states_ph] = states
            td_map[self.train_model.dones_ph] = masks
-
+            if action_masks is None:
+                action_masks = create_dummy_action_mask(self.train_model.ac_space, states.shape[0] * self.n_steps)


why not using placeholder with default value here?

araffin · 2019-08-27T17:06:56Z

stable_baselines/a2c/a2c.py

@@ -319,9 +330,10 @@ def run(self):
        """
        mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [], [], [], [], []
        mb_states = self.states
-        ep_infos = []
+        ep_infos = [[]]


why changing ep_infos instead of new variable mb_action_masks?

I guess I tunnel visioned on the fact that the action_mask was in info. I'll make that change, which will clean up some stuff after runner.run and before _train_step.

araffin · 2019-08-27T17:09:06Z

stable_baselines/a2c/utils.py

@@ -158,6 +158,11 @@ def linear(input_tensor, scope, n_hidden, *, init_scale=1.0, init_bias=0.0):
        return tf.matmul(input_tensor, weight) + bias


+def action_mask(input_tensor, mask_tensor, scope):


I wouldn't call that function action_mask because this is a bit confusing, maybe apply_mask?

missing docstring

stable_baselines/common/distributions.py

araffin · 2019-08-27T17:13:28Z

stable_baselines/common/distributions.py

@@ -344,12 +352,17 @@ def __init__(self, nvec, flat):
        """
        self.flat = flat
        self.categoricals = list(map(CategoricalProbabilityDistribution, tf.split(flat, nvec, axis=-1)))
+        self.action_mask_vector = action_mask_vector

    def flatparam(self):
        return self.flat

    def mode(self):


yes, sounds good

araffin · 2019-08-27T17:13:43Z

stable_baselines/common/distributions.py


    def flatparam(self):
        return self.flat

    def mode(self):
-        return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
+        modes = []
+        for idx, p in enumerate(self.categoricals):


I have to think a bit more for that one...

araffin · 2019-08-27T17:15:25Z

stable_baselines/common/policies.py

+        action_mask = np.reshape(action_mask, (num_samples, ac_space.nvec[0], ac_space.nvec[1]), dtype=np.float32)
+    elif isinstance(ac_space, spaces.Discrete) or isinstance(ac_space, spaces.MultiBinary):
+        action_mask = np.reshape(action_mask, (num_samples, ac_space.n), dtype=np.float32)
+    elif isinstance(ac_space, spaces.Box):


does masking make sense for continuous actions? or is it only to avoid errors?

It could make sense. Say the agent is controlling an n-dimensional robotic arm. This could be used to prevent collisions with itself.

But, it is to prevent errors as the step methods now require action_mask to be a variable in the graph. I figured why not allow for an action mask for continuous spaces, it doesn't add any additional overhead.

Say the agent is controlling an n-dimensional robotic arm

Setting the action to zero does not prevent the arm from moving... (if the output is the position of the arm)

it is to prevent errors as the step methods now require action_mask to be a variable in the graph

Maybe this can be solve using a placeholder with default value (so you don't have to feed the graph with a value for that tensor)

H-Park · 2019-08-28T23:49:27Z

I removed the tf.print and it made 50 more tests pass. TRPO and PPO1 are not currently supported, so those tests are still failing as expected.

I am still trying to figure out how to replace elements in input_tensor that is a zero in action_mask. I'm sure there is a way to do with with tf.where and tf.cond, but I havent figured it out yet.

araffin · 2019-08-29T07:23:04Z

ok, I think there was a misunderstanding: the Transformer was just a remark, we can continue with the normal masking idea.
I left other comments before (especially the placeholder with default) that look more important to tackle for me ;)

H-Park · 2019-08-30T17:20:52Z

There is one stray distribution test that is failing which is caused by explicit usage of the action_mask vector:

        modes = []
        for idx, p in enumerate(self.categoricals):
            p.logits = tf.multiply(p.logits, **self.action_mask_vector[:, idx]**)
            modes.append(p.mode())
        return tf.cast(tf.stack(modes, axis=-1), tf.int32)

Is logits the output of proba_distribution_from_latent? If so, then the mask operation is being repeated for no reason.

ChengYen-Tang · 2019-09-07T13:16:33Z

I currently use the PPO2 algorithm.
How do I use an action mask in the repository version of @H-Park ?🥺

H-Park · 2019-09-10T16:13:17Z

Have your environment return a 'valid_actions' array of 1's and 0's in the info dict in the step method

ChengYen-Tang · 2019-09-11T09:41:20Z

Hi @H-Park ,
Thank you for your reply to my question.
My action_space=gym.spaces.MultiDiscrete([4])
If my next valid action has action 0 and action 3. So, I have to return 'valid_actions' [[1,0,0,1]] in the info dict in the step method, isn't it?

H-Park · 2019-09-11T13:57:49Z

Yes, that is correct.

ChengYen-Tang · 2019-09-12T09:39:46Z

My action_space=gym.spaces.MultiDiscrete([4]), policy Networks is MlpLnLstmPolicy
I get the error. 😭
Traceback (most recent call last): File "train.py", line 180, in <module> model = PPO2(MlpLnLstmPolicy, train_env, verbose=0, nminibatches=1, tensorboard_log=tensorboard_folder, **model_params) File "/home/selab/stable-baselines/stable_baselines/ppo2/ppo2.py", line 102, in __init__ self.setup_model() File "/home/selab/stable-baselines/stable_baselines/ppo2/ppo2.py", line 135, in setup_model n_batch_step, reuse=False, **self.policy_kwargs) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 738, in __init__ layer_norm=True, feature_extraction="mlp", **_kwargs) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 422, in __init__ scale=(feature_extraction == "cnn")) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 359, in __init__ n_batch, reuse=reuse, scale=scale) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 235, in __init__ scale=scale) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 125, in __init__ ac_space.nvec[1]), name="action_mask_ph") IndexError: index 1 is out of bounds for axis 0 with size 1

ChengYen-Tang · 2019-09-12T11:36:42Z

I removed ac_space.nvec[1], I found that there are still a lot of invalid actions.

H-Park · 2019-09-12T14:17:11Z

Try it now, I confirmed it works for PPO, A2C and DQN.

ChengYen-Tang · 2019-09-12T15:49:33Z

Invalid actions will still appear in my environment.😭
I want to determine one thing. When I return the valid_actions at the end of step 0 is [1,0,1], then I will not receive the action 1 signal in step 1, isn't it?
You said that you can run on PPO, is PPO1 or PPO2?

ChengYen-Tang · 2019-09-12T16:39:37Z

I write a test environment, please try it out.
https://github.com/ChengYen-Tang/TestENV

H-Park · 2019-09-12T18:26:04Z

I'm looking into this now. The reason the out of bounds error is happening is because you are using MultiDiscrete with only 1 dimension, which technically is valid. I only handled the case where dim = 2.

H-Park · 2019-09-12T20:05:01Z

The problem is as expected: your valid action values are negative, meaning that when you apply the mask, the invalid actions get set to 0, and 0 > all negative numbers. This is not a problem on your side. I am working on fixing it.

H-Park · 2019-09-12T20:26:18Z

This runs, but diverges quickly because of the neglogp calculation involving -np.inf. I won't have time to figure out a way around that until sunday

ChengYen-Tang · 2019-09-13T08:48:58Z

I get the following error.
Traceback (most recent call last): File "run.py", line 10, in <module> model.learn(total_timesteps = 10000) File "/home/selab/stable-baselines/stable_baselines/ppo2/ppo2.py", line 345, in learn = runner.run() File "/home/selab/stable-baselines/stable_baselines/ppo2/ppo2.py", line 481, in run action_mask=action_mask) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 523, in step action_mask = create_dummy_action_mask(self.ac_space, len(state)) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 794, in create_dummy_action_mask action_mask = np.ones((num_samples, ac_space.nvec), dtype=np.float32) File "/home/selab/.local/lib/python3.6/site-packages/numpy/core/numeric.py", line 214, in ones a = empty(shape, dtype, order) TypeError: only integer scalar arrays can be converted to a scalar index

H-Park · 2019-09-13T18:45:51Z

Added support for any dimension MultiDiscrete support. The models still diverge within a few hundred iterations though.

Thank you for being my tester! I greatly appreciate it

H-Park · 2019-09-13T18:58:12Z

(which then causes invalid actions to be selected because all values, valid or not, are Nan/inf)

ChengYen-Tang · 2019-09-14T03:42:59Z

I am also very grateful to you, I am reading your code and learning from your code.

H-Park · 2019-09-14T03:52:40Z

Please let me know if I can improve on anything, or if things don't make sense.

I'm on holiday right now. If you have thoughts on how to solve the diverging issue, I'm all ears. I'm going to investigate it in the morning (about 10 hours from now)

ChengYen-Tang · 2019-12-06T12:12:51Z

I was thinking before, if the action mask is added to the observation, it may solve the problem of infinite expansion of neglogp. In theory, agent will learn how to look at action masks.
What do you think?

Applying the mask to neglogp will only calculate the probability distance of our allowed actions, and the masked actions will not be considered.

H-Park · 2019-12-06T16:52:22Z

I was thinking that too. If the masking is done explicitly in the policy(#453 (comment)) - which I was doing at iteration 1.5 of this project - these problems might evaporate. With masking in distributions for neglogp, the KL divergence will be a different value, and in reality will make the KL divergence (much) smaller, meaning that PPO more liberally updates the policy, because the loss will be greater, per:

Though, I don't believe there is a guarantee that masking in the policy will in fact be our saving grace. I do believe it will at the very least reduce the learning discrepancies between masking and no masking.

Note: The most definitive way to test this would be to have an environment that knows the worst move at every given state, and run two experiments. No masking, and mask the worst move. Does anyone have such an environment or can think of one?

Quick neglogp test regarding masking: https://gist.github.com/H-Park/b82f23ce8bc31716bff46513ec4ee94c

ChengYen-Tang · 2019-12-06T17:59:20Z

Note: The most definitive way to test this would be to have an environment that knows the worst move at every given state, and run two experiments. No masking, and mask the worst move. Does anyone have such an environment or can think of one?

What do you think of this environment, I wrote a simple mouse to walk the maze. I have not prepared a readme in English. But you can use the browser's translation.

If I remember correctly, PPO2 uses the CLIP method?

kudhru · 2020-01-22T20:04:51Z

Hi Guys, did we come to a solution for this neglogp issue? I have been using the invalid action masks in my application and it seems to train well but goes into an infinite loop while prediction because of predicting an invalid action although the action mask is rightly set to avoid that action. I checked the neglogp value and it is nan. May be this is what is causing the infinite loop?

Jeppe-T-K · 2020-01-23T09:31:39Z

Hi Guys, did we come to a solution for this neglogp issue? I have been using the invalid action masks in my application and it seems to train well but goes into an infinite loop while prediction because of predicting an invalid action although the action mask is rightly set to avoid that action. I checked the neglogp value and it is nan. May be this is what is causing the infinite loop?

I don't have so much to add but I agree that using a hard action mask like this means that if the predicted action is actually invalid, it leads to unexpected behaviour sometimes, especially if the action entropy is low/zero (i.e. one action practically has all the probability of being chosen). I have had success with adding the action mask to the observation space instead, which while not a hard mask, leads to a much more stable training.

During the evaluation of the agent, I have chosen to use my own policy where I just sample actions using the action probabilities and just select the first valid action. While I suppose that this PR should practically do the same thing, I had problems using it (maybe because of only a partial action mask).

H-Park · 2020-01-29T17:03:00Z

@kudhru That occurs when the policy diverges.

Honestly, I don't know how to avoid this. I have been comparing two masking techniques.

Masking in policy by providing the action mask as in input in the graph. This makes the KL divergence only really consider the valid actions (as the invalid ones are -100, compared to [-1, 1] for valid actions).
The method currently done in this PR, masking after the fact in distributions.py methods. This does not alter the policy's logits, but samples according to the action mask.

Both approaches cause the policy to diverge every so often, so I am not sure where to go from here. It might be better to simply cut our losses and not use PPO for problems that require action masking.

My senior capstone this spring and summer is exploring action masking. So hopefully I have some insights into all of this by the end.

kudhru · 2020-01-29T17:06:31Z

Thanks for the insight, @H-Park ! What do you think I should use instead of PPO? I mean, which algorithms are less probable to diverge (with action mask) as compared to PPO? Thanks Dhruv

…

On Jan 29, 2020, at 11:03 AM, Hunter Park ***@***.***> wrote: @kudhru <https://github.com/kudhru> That occurs when the policy diverges. Honestly, I don't know how to avoid this. I have been comparing two masking techniques. Masking in policy by providing the action mask as in input in the graph. This makes the KL divergence only really consider the valid actions (as the invalid ones are -100, compared to [-1, 1] for valid actions). The method currently done in this PR, masking after the fact in distributions.py methods. This does not alter the policy's logits, but samples according to the action mask. Both approaches cause the policy to diverge every so often, so I am not sure where to go from here. It might be better to simply cut our losses and not use PPO for problems that require action masking. My senior capstone this spring and summer is exploring action masking. So hopefully I have some insights into all of this by the end. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#453?email_source=notifications&email_token=ADFF7WQKCYPBJ24OCOAURH3RAGZENA5CNFSM4IPDDIOKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKH62OY#issuecomment-579857723>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADFF7WQ6VSFBFTRQYLT33WLRAGZENANCNFSM4IPDDIOA>.

H-Park · 2020-01-29T17:26:57Z

The working theory here about the policy diverging is it is caused by something between the discrepancy between the raw policy output (logits), the neglogp computed (which in turn influences the KL divergence, and then in turn the policy update), and the action that is actually sampled.

A possible short term solution for you might be to use something simpler (aka no KL), such a deep q, however I do not remember the state I left DQN in for this PR.

H-Park · 2020-03-14T18:01:22Z

I took an internship offer where I'll be continuing this work, however it will be closed source. I'm unsure what the stable baseline's team wants to with this PR, as this is way too experimental to be merged. If I up the docs on my fork, would you add a link in your docs to it and we can archive this PR?

Miffyli · 2020-03-14T18:09:04Z

Sounds good to me (a link from stable-baselines docs to your fork, with link to this PR as well as there was a good bunch of fruitful discussion here). At this point it is probably easier to add such link as a separate PR, and just close this one at that point. With v3 of stable-baselines lurking around we are not including more features to this one, anyways :).

Best of luck with your work! Looking forward to seeing what is the correct / less-wrong way of doing this!

vwxyzjn · 2020-07-01T20:12:35Z

Hi all, I hope our paper A Closer Look at Invalid Action Masking in Policy Gradient Algorithms (https://arxiv.org/abs/2006.14171) will bring additional insight on the effect of invalid action making and how it should be implemented :) And here is the corresponding blog post https://costa.sh/blog-a-closer-look-at-invalid-action-masking-in-policy-gradient-algorithms.html

H-Park · 2020-07-01T22:44:19Z

I saw the blog post on /r/reinforcementlearning earlier today.

For my senior capstone and at work, I've been using a tuple observation space, with one of the observations being the mask, and then the policy explicitly uses it as a gate, so that way the invalid actions are actually invalid from a gradient standpoint.

https://github.com/H-Park/yarll/blob/master/test/test_policies.py#L100

Been meaning to discuss with @Miffyli if it'd be wanted to add this feature to the rewrite.

Bensk1 · 2020-11-03T11:35:17Z

Are there any new developments on how to handle entropy drops with this PR or the rewrite?

For my environment, at some point the entropy drops to nan and a masked action is chosen:

-----------------------------------
| approxkl           | nan        |
| clipfrac           | 0.20898438 |
| explained_variance | 0.663      |
| fps                | 84         |
| n_updates          | 430        |
| policy_entropy     | nan        |
| policy_loss        | nan        |
| serial_timesteps   | 55040      |
| time_elapsed       | 666        |
| total_timesteps    | 55040      |
| value_loss         | nan        |
-----------------------------------

H-Park · 2020-11-03T16:40:40Z

@Bensk1 Your best bet would be to use SB3 with a custom policy that masks via passing in the action mask as part of the input in a tuple or dictionary observation. However I don't know the current state of SB3 and if:

Allows a custom user defined policy
Allows for tuple and dictionary observations

@Miffyli ^

Because SB3 is on track to support (if they don't already) in policy action masking via offloading it onto the user, I don't see a reason to "finish" this PR as it would never be as good / encompassing as a user supplied mask - so it'd be best to leave action masking as "an exercise left to the reader". In reality it's very simple for the Discrete case.

masked_output = obs["mask"] * raw_output

For the MultiDiscete case, it gets more complicated, as there are multiple ways to mask.

Miffyli · 2020-11-03T16:43:54Z

@H-Park
SB3 supports custom policies to better extent than SB2 (partially thanks to PyTorch), but no tuple/dictionary observations as of yet. Indeed this feature should be moved on to SB3 if anywhere, and if anybody feels like it could first be added to sb3-contrib, although such support likely requires modifications to SB3 itself before it works well enough.

H-Park · 2020-11-03T16:47:50Z

I realize this doesn't answer your exact question - but the reality is there is no literature that explores and definitely compares masking methods from a policy update standpoint (to my knowledge at least, though I haven't looked in about a years time). That is the largest reason this PR was never completed, because however we decided to compute entropy had no mathematical or experimental backing.

vwxyzjn · 2020-11-03T17:10:22Z

@Bensk1 If you are looking for some reference implementation to experiment with SB3, feel free to check out our code at ppo_partial_mask.py; we benchmark this script on some of our microrts environments here: Gym-microrts V1 Benchmark

For simplicity we just put the action mask in the env's attribute instead of using a Tuple observation space as follows. So if you have an gym environment that implements this attribute, you could just run it with our ppo_partial_mask.py.

import gym
import gym_microrts
env = gym.make("MicrortsMining-v1")
/home/costa/Documents/work/go/src/github.com/vwxyzjn/gym-microrts/microrts/maps/10x10/basesWorkers10x10.xml
[Lai.rewardfunction.RewardFunctionInterface;@1ed6993a
env.observation_space
Out[4]: Box(0, 1, (10, 10, 27), int32)
env.action_space
Out[5]: MultiDiscrete([100   6   4   4   4   4   7 100])
obs = env.reset()
# you need to add the action mask to the env's attribute that has the same shape as the flattened action 
env.action_mask.shape space.
Out[7]: (229,)
sum(env.action_space.nvec)
Out[8]: 229

Also a side note. I am not sure if it is the best idea to put the action mask inside of the Tuple action space @Miffyli @H-Park , because some parts of the action mask might be conditioned on other parts. In the example of microrts, in order to generate an action, you need to select an unit, select its action types, and select its action parameters. We could compute the action mask for the unit selection given the observation, but there is no way to compute the action mask for the action types and parameters until the selected unit is known. This means it is only viable to provide a partial action mask through the Dict observation space.

So we implemented this full action mask with the following code, basically calculating the full action mask via env.get_unit_action_mask(selected_unit) after the source unit has been selected.

# https://github.com/vwxyzjn/gym-microrts/blob/146e6b9255a9799abe4b85889264ba6e6a791b58/experiments/ppo.py#L231-L244
# 1. select source unit based on source unit mask
source_unit_mask = torch.Tensor(np.array(envs.env_method("get_unit_location_mask", player=1)))
multi_categoricals = [CategoricalMasked(logits=split_logits[0], masks=source_unit_mask)]
action_components = [multi_categoricals[0].sample()]
# 2. select action type and parameter section based on the
#    source-unit mask of action type and parameters
source_unit_action_mask = torch.Tensor(
    [envs.env_method("get_unit_action_mask", unit=action_components[0][i], player=1, indices=i)[0]
for i in range(envs.num_envs)])
split_suam = torch.split(source_unit_action_mask, envs.action_space.nvec.tolist()[1:], dim=1)
multi_categoricals = multi_categoricals + [CategoricalMasked(logits=logits, masks=iam) for (logits, iam) in zip(split_logits[1:], split_suam)]
invalid_action_masks = torch.cat((source_unit_mask, source_unit_action_mask), 1)
action = torch.stack([categorical.sample() for categorical in multi_categoricals])

We actually found this full action mask to be very powerful. So using a partial action mask, it is very sample-inefficient to learn anything and we had to use frame-skipping to speed up learning. But using the full action mask, the trained agents can actually perform well against existing bots in the full game, as shown here:

Gym-microrts-s-V2-Benchmark

Maybe in SB3 we should consider something similar to support this kind of conditioned action mask?

H-Park · 2020-11-03T17:30:52Z

@vwxyzjn Thanks for the guide! I know PySC2 has the the mask exposed as variable. When it's your own environment and policy, you can mask however you want / makes sense for your particular set up.

Bensk1 · 2020-11-03T21:17:20Z

Thanks @H-Park and @vwxyzjn for your replies and guidance! I will definitely check out your ppo_partial_mask.py some time soon.

nmevenkamp · 2020-11-17T13:51:19Z

@H-Park after reading through this thread, I found this paper: https://arxiv.org/abs/2006.14171. It empirically demonstrates dominance of and theoretically justifies invalid action masking over invalid action penalties in terms of scaling to larger action spaces with increasing number of invalid actions.

Do you have any plans to continue work on this pull request again in view of these findings?

I'm highly interested because I want to explore RL for card / board games, where you typically have only a fraction of the actions available to each agent (e.g. the cards in hand).

H-Park · 2020-11-19T18:25:13Z

@nmevenkamp Thanks for the link, nice find!!! @vwxyzjn I guess I didn't see your edit, nice work!

@Miffyli Given this publication, would an implementation PR of it get merged in SB2 or should it go in SB3?

Miffyli · 2020-11-20T16:28:10Z

@H-Park

SB2 now mainly takes bug-fixes and small enhancements, bigger ones like this should go to SB3. sb3-contrib is designed for more experimental features and something like this would be nice, although we need to figure out if it should be a separate algorithm (for now, as dict/tuple spaces are not supported just yet).

Remideza · 2020-12-22T13:22:19Z

Sorry to bump this old issue, but I don't find any place mentionning the action mask in SB3

Have your environment return a 'valid_actions' array of 1's and 0's in the info dict in the step method

Is this implemented in SB3 ? And if so is it only reserved to discrete and multidiscrete spaces or can I apply it to box action spaces aswell ?

Miffyli · 2020-12-22T17:09:24Z

Is this implemented in SB3 ? And if so is it only reserved to discrete and multidiscrete spaces or can I apply it to box action spaces aswell ?

There is an active PR for dictionary observations for stable-baselines3, which will likely be merged in near future. This feature could then be used to create action space masking the way described in the quote, but out-of-box would only work for discrete spaces. Continuous spaces need more thinking, but could be included in contrib package.

H-Park · 2020-12-24T17:39:26Z

@Remideza Once the PR above gets merged, I'll go ahead and make a similar PR for SB3 using dictionary observations and the masking method outlined in the pre print paper above. It will be in sb3/contrib, however, as the feature is still experimental at best (the paper mentioned is not peer reviewed to my knowledge).

Miffyli · 2021-01-04T12:11:25Z

Closing this PR in favor of SB3/SB3-contrib implementation. Big thanks to all participants to this discussion for interesting comments and arguments! :)

H-Park commented Aug 23, 2019

View reviewed changes

araffin changed the title ~~begin action mask~~ Invalid Action Mask [WIP] Aug 23, 2019

H-Park commented Aug 23, 2019

View reviewed changes

araffin reviewed Aug 27, 2019

View reviewed changes

Miffyli mentioned this pull request Aug 18, 2020

Restrict value function search space #980

Closed

araffin mentioned this pull request Dec 24, 2020

[Feature Request] Invalid action masking DLR-RM/stable-baselines3#269

Closed

1 task

kronion mentioned this pull request Dec 24, 2020

[Feature Request] Invalid action masking Stable-Baselines-Team/stable-baselines3-contrib#16

Closed

1 task

Miffyli closed this Jan 4, 2021

		@@ -158,6 +158,11 @@ def linear(input_tensor, scope, n_hidden, *, init_scale=1.0, init_bias=0.0):
		return tf.matmul(input_tensor, weight) + bias


		def action_mask(input_tensor, mask_tensor, scope):

Invalid Action Mask [WIP] #453

Invalid Action Mask [WIP] #453

Conversation

H-Park commented Aug 23, 2019 • edited by araffin Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

H-Park commented Aug 23, 2019

araffin commented Aug 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

H-Park commented Aug 28, 2019 • edited Loading

araffin commented Aug 29, 2019

H-Park commented Aug 30, 2019 • edited Loading

ChengYen-Tang commented Sep 7, 2019

H-Park commented Sep 10, 2019

ChengYen-Tang commented Sep 11, 2019 • edited Loading

H-Park commented Sep 11, 2019

ChengYen-Tang commented Sep 12, 2019 • edited Loading

ChengYen-Tang commented Sep 12, 2019

H-Park commented Sep 12, 2019

ChengYen-Tang commented Sep 12, 2019

ChengYen-Tang commented Sep 12, 2019

H-Park commented Sep 12, 2019

H-Park commented Sep 12, 2019

H-Park commented Sep 12, 2019

ChengYen-Tang commented Sep 13, 2019

H-Park commented Sep 13, 2019

H-Park commented Sep 13, 2019

ChengYen-Tang commented Sep 14, 2019

H-Park commented Sep 14, 2019 • edited Loading

ChengYen-Tang commented Dec 6, 2019

H-Park commented Dec 6, 2019

ChengYen-Tang commented Dec 6, 2019 • edited Loading

kudhru commented Jan 22, 2020 • edited Loading

Jeppe-T-K commented Jan 23, 2020

H-Park commented Jan 29, 2020

kudhru commented Jan 29, 2020 via email

H-Park commented Jan 29, 2020

H-Park commented Mar 14, 2020

Miffyli commented Mar 14, 2020

vwxyzjn commented Jul 1, 2020 • edited Loading

H-Park commented Jul 1, 2020 • edited Loading

Bensk1 commented Nov 3, 2020

H-Park commented Nov 3, 2020 • edited Loading

Miffyli commented Nov 3, 2020

H-Park commented Nov 3, 2020

vwxyzjn commented Nov 3, 2020 • edited Loading

H-Park commented Nov 3, 2020

Bensk1 commented Nov 3, 2020

nmevenkamp commented Nov 17, 2020 • edited Loading

H-Park commented Nov 19, 2020

Miffyli commented Nov 20, 2020

Remideza commented Dec 22, 2020

Miffyli commented Dec 22, 2020 • edited Loading

H-Park commented Dec 24, 2020

Miffyli commented Jan 4, 2021

H-Park commented Aug 23, 2019 •

edited by araffin

Loading

H-Park commented Aug 28, 2019 •

edited

Loading

H-Park commented Aug 30, 2019 •

edited

Loading

ChengYen-Tang commented Sep 11, 2019 •

edited

Loading

ChengYen-Tang commented Sep 12, 2019 •

edited

Loading

H-Park commented Sep 14, 2019 •

edited

Loading

ChengYen-Tang commented Dec 6, 2019 •

edited

Loading

kudhru commented Jan 22, 2020 •

edited

Loading

vwxyzjn commented Jul 1, 2020 •

edited

Loading

H-Park commented Jul 1, 2020 •

edited

Loading

H-Park commented Nov 3, 2020 •

edited

Loading

vwxyzjn commented Nov 3, 2020 •

edited

Loading

nmevenkamp commented Nov 17, 2020 •

edited

Loading

Miffyli commented Dec 22, 2020 •

edited

Loading