Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid Action Mask [WIP] #453

Closed
wants to merge 1 commit into from
Closed

Invalid Action Mask [WIP] #453

wants to merge 1 commit into from

Conversation

H-Park
Copy link

@H-Park H-Park commented Aug 23, 2019

This is about a month overdue, I'll go through some lines below and add comments.

Right now, a number of tests don't pass, but this is per @araffin request to do a draft PR.

closes #351

if states is not None:
td_map[self.train_model.states_ph] = states
td_map[self.train_model.dones_ph] = masks
td_map[self.polyak_model.states_ph] = states
td_map[self.polyak_model.dones_ph] = masks
td_map[self.train_model.action_mask_ph] = action_masks
td_map[self.polyak_model.action_mask_ph] = action_masks
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a fan of accessing the tensor like this, but I didn't see another way to do it cleanly besides make a member variable for it, but that duplicates data

@@ -344,12 +352,17 @@ def __init__(self, nvec, flat):
"""
self.flat = flat
self.categoricals = list(map(CategoricalProbabilityDistribution, tf.split(flat, nvec, axis=-1)))
self.action_mask_vector = action_mask_vector

def flatparam(self):
return self.flat

def mode(self):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to make the action mask a "layer" in the policy, as opposed to it being done in distribution, which caused those particular tests to fail as they don't invoke a policy. I like this way a lot more, as it's much cleaner IMO.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sounds good

@araffin araffin changed the title begin action mask Invalid Action Mask [WIP] Aug 23, 2019

def flatparam(self):
return self.flat

def mode(self):
return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
modes = []
for idx, p in enumerate(self.categoricals):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I do element wise multiplication instead of boolean_mask because I don't exactly understand how tf.boolean_mask applies to a batch of actions.

The way I originally had it, was in tf.boolean_mask(self.logits, self.action_mask_vector[0]), which worked for my env but I didn't know how it behaved with other envs.

This approach will work if you relu the output instead of TanH. If you tanh, and all your valid actions have negative values, the element wise multiplication will result in an invalid action being selected because 0 > {all negative valued valid actions}.

@araffin

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to think a bit more for that one...

name="action_mask_ph")
elif isinstance(ac_space, spaces.Box):
self._action_mask_ph = tf.placeholder(dtype=tf.float32, shape=(n_batch, ac_space.shape[0]),
name="action_mask_ph")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty self documenting. Different action spaces call for different dimension ed action masks.

def action_mask_ph(self):
"""tf.Tensor: placeholder for valid actions, shape (self.n_batch, self.ac_space.n)"""
return self._action_mask_ph

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

following suite with the rest of the class

@@ -751,3 +786,23 @@ def register_policy(name, policy):
raise ValueError("Error: the name {} is alreay registered for a different policy, will not override."
.format(name))
_policy_registry[sub_class][name] = policy


def create_dummy_action_mask(ac_space, num_samples):
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

utility functions that are used throughout to create / reshape the action_mask vectors. Used here, and in various algorithm _train_steps

@H-Park
Copy link
Author

H-Park commented Aug 23, 2019

A number of tests are still failing, and I haven't tested where to apply the mask w.r.t. the softmax for KL divergence, but this ensures everyone in the discussion is on the same page and knows the complete picture.

A lot of my slow progress is due to my apprehensiveness to commit to changes in the dark and without feedback.

@araffin
Copy link
Collaborator

araffin commented Aug 27, 2019

Looking back at the problem, it seems very similar to the one encountered in NLP with the "Transformer" module (cf this good blog).

Apparently, something slightly different is done:
"Since we want these elements to be zero after the softmax, we set them to minus infinity"

PS: the travis failure is due to a tf.print somewhere

@@ -170,7 +171,7 @@ def setup_model(self):

self.summary = tf.summary.merge_all()

def _train_step(self, obs, states, rewards, masks, actions, values, update, writer=None):
def _train_step(self, obs, states, rewards, masks, actions, values, update, writer=None, action_masks=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: missing entry in the docstring

if states is not None:
td_map[self.train_model.states_ph] = states
td_map[self.train_model.dones_ph] = masks

if action_masks is None:
action_masks = create_dummy_action_mask(self.train_model.ac_space, states.shape[0] * self.n_steps)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not using placeholder with default value here?

@@ -319,9 +330,10 @@ def run(self):
"""
mb_obs, mb_rewards, mb_actions, mb_values, mb_dones = [], [], [], [], []
mb_states = self.states
ep_infos = []
ep_infos = [[]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing ep_infos instead of new variable mb_action_masks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I tunnel visioned on the fact that the action_mask was in info. I'll make that change, which will clean up some stuff after runner.run and before _train_step.

@@ -158,6 +158,11 @@ def linear(input_tensor, scope, n_hidden, *, init_scale=1.0, init_bias=0.0):
return tf.matmul(input_tensor, weight) + bias


def action_mask(input_tensor, mask_tensor, scope):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call that function action_mask because this is a bit confusing, maybe apply_mask?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • missing docstring

stable_baselines/common/distributions.py Outdated Show resolved Hide resolved
@@ -344,12 +352,17 @@ def __init__(self, nvec, flat):
"""
self.flat = flat
self.categoricals = list(map(CategoricalProbabilityDistribution, tf.split(flat, nvec, axis=-1)))
self.action_mask_vector = action_mask_vector

def flatparam(self):
return self.flat

def mode(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sounds good


def flatparam(self):
return self.flat

def mode(self):
return tf.cast(tf.stack([p.mode() for p in self.categoricals], axis=-1), tf.int32)
modes = []
for idx, p in enumerate(self.categoricals):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to think a bit more for that one...

action_mask = np.reshape(action_mask, (num_samples, ac_space.nvec[0], ac_space.nvec[1]), dtype=np.float32)
elif isinstance(ac_space, spaces.Discrete) or isinstance(ac_space, spaces.MultiBinary):
action_mask = np.reshape(action_mask, (num_samples, ac_space.n), dtype=np.float32)
elif isinstance(ac_space, spaces.Box):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does masking make sense for continuous actions? or is it only to avoid errors?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could make sense. Say the agent is controlling an n-dimensional robotic arm. This could be used to prevent collisions with itself.

But, it is to prevent errors as the step methods now require action_mask to be a variable in the graph. I figured why not allow for an action mask for continuous spaces, it doesn't add any additional overhead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say the agent is controlling an n-dimensional robotic arm

Setting the action to zero does not prevent the arm from moving... (if the output is the position of the arm)

it is to prevent errors as the step methods now require action_mask to be a variable in the graph

Maybe this can be solve using a placeholder with default value (so you don't have to feed the graph with a value for that tensor)

@H-Park
Copy link
Author

H-Park commented Aug 28, 2019

I removed the tf.print and it made 50 more tests pass. TRPO and PPO1 are not currently supported, so those tests are still failing as expected.

I am still trying to figure out how to replace elements in input_tensor that is a zero in action_mask. I'm sure there is a way to do with with tf.where and tf.cond, but I havent figured it out yet.

@araffin
Copy link
Collaborator

araffin commented Aug 29, 2019

ok, I think there was a misunderstanding: the Transformer was just a remark, we can continue with the normal masking idea.
I left other comments before (especially the placeholder with default) that look more important to tackle for me ;)

@H-Park
Copy link
Author

H-Park commented Aug 30, 2019

There is one stray distribution test that is failing which is caused by explicit usage of the action_mask vector:

        modes = []
        for idx, p in enumerate(self.categoricals):
            p.logits = tf.multiply(p.logits, **self.action_mask_vector[:, idx]**)
            modes.append(p.mode())
        return tf.cast(tf.stack(modes, axis=-1), tf.int32)

Is logits the output of proba_distribution_from_latent? If so, then the mask operation is being repeated for no reason.

@ChengYen-Tang
Copy link

I currently use the PPO2 algorithm.
How do I use an action mask in the repository version of @H-Park ?🥺

@H-Park
Copy link
Author

H-Park commented Sep 10, 2019

Have your environment return a 'valid_actions' array of 1's and 0's in the info dict in the step method

@ChengYen-Tang
Copy link

ChengYen-Tang commented Sep 11, 2019

Hi @H-Park ,
Thank you for your reply to my question.
My action_space=gym.spaces.MultiDiscrete([4])
If my next valid action has action 0 and action 3. So, I have to return 'valid_actions' [[1,0,0,1]] in the info dict in the step method, isn't it?

@H-Park
Copy link
Author

H-Park commented Sep 11, 2019

Yes, that is correct.

@ChengYen-Tang
Copy link

ChengYen-Tang commented Sep 12, 2019

My action_space=gym.spaces.MultiDiscrete([4]), policy Networks is MlpLnLstmPolicy
I get the error. 😭
Traceback (most recent call last): File "train.py", line 180, in <module> model = PPO2(MlpLnLstmPolicy, train_env, verbose=0, nminibatches=1, tensorboard_log=tensorboard_folder, **model_params) File "/home/selab/stable-baselines/stable_baselines/ppo2/ppo2.py", line 102, in __init__ self.setup_model() File "/home/selab/stable-baselines/stable_baselines/ppo2/ppo2.py", line 135, in setup_model n_batch_step, reuse=False, **self.policy_kwargs) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 738, in __init__ layer_norm=True, feature_extraction="mlp", **_kwargs) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 422, in __init__ scale=(feature_extraction == "cnn")) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 359, in __init__ n_batch, reuse=reuse, scale=scale) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 235, in __init__ scale=scale) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 125, in __init__ ac_space.nvec[1]), name="action_mask_ph") IndexError: index 1 is out of bounds for axis 0 with size 1

@ChengYen-Tang
Copy link

I removed ac_space.nvec[1], I found that there are still a lot of invalid actions.

@H-Park
Copy link
Author

H-Park commented Sep 12, 2019

Try it now, I confirmed it works for PPO, A2C and DQN.

@ChengYen-Tang
Copy link

Invalid actions will still appear in my environment.😭
I want to determine one thing. When I return the valid_actions at the end of step 0 is [1,0,1], then I will not receive the action 1 signal in step 1, isn't it?
You said that you can run on PPO, is PPO1 or PPO2?

@ChengYen-Tang
Copy link

I write a test environment, please try it out.
https://github.com/ChengYen-Tang/TestENV

@H-Park
Copy link
Author

H-Park commented Sep 12, 2019

I'm looking into this now. The reason the out of bounds error is happening is because you are using MultiDiscrete with only 1 dimension, which technically is valid. I only handled the case where dim = 2.

@H-Park
Copy link
Author

H-Park commented Sep 12, 2019

The problem is as expected: your valid action values are negative, meaning that when you apply the mask, the invalid actions get set to 0, and 0 > all negative numbers. This is not a problem on your side. I am working on fixing it.

@H-Park
Copy link
Author

H-Park commented Sep 12, 2019

This runs, but diverges quickly because of the neglogp calculation involving -np.inf. I won't have time to figure out a way around that until sunday

@ChengYen-Tang
Copy link

I get the following error.
Traceback (most recent call last): File "run.py", line 10, in <module> model.learn(total_timesteps = 10000) File "/home/selab/stable-baselines/stable_baselines/ppo2/ppo2.py", line 345, in learn = runner.run() File "/home/selab/stable-baselines/stable_baselines/ppo2/ppo2.py", line 481, in run action_mask=action_mask) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 523, in step action_mask = create_dummy_action_mask(self.ac_space, len(state)) File "/home/selab/stable-baselines/stable_baselines/common/policies.py", line 794, in create_dummy_action_mask action_mask = np.ones((num_samples, ac_space.nvec), dtype=np.float32) File "/home/selab/.local/lib/python3.6/site-packages/numpy/core/numeric.py", line 214, in ones a = empty(shape, dtype, order) TypeError: only integer scalar arrays can be converted to a scalar index

@H-Park
Copy link
Author

H-Park commented Sep 13, 2019

Added support for any dimension MultiDiscrete support. The models still diverge within a few hundred iterations though.

Thank you for being my tester! I greatly appreciate it

@H-Park
Copy link
Author

H-Park commented Sep 13, 2019

(which then causes invalid actions to be selected because all values, valid or not, are Nan/inf)

@ChengYen-Tang
Copy link

I am also very grateful to you, I am reading your code and learning from your code.

@H-Park
Copy link
Author

H-Park commented Sep 14, 2019

Please let me know if I can improve on anything, or if things don't make sense.

I'm on holiday right now. If you have thoughts on how to solve the diverging issue, I'm all ears. I'm going to investigate it in the morning (about 10 hours from now)

@ChengYen-Tang
Copy link

I was thinking before, if the action mask is added to the observation, it may solve the problem of infinite expansion of neglogp. In theory, agent will learn how to look at action masks.
What do you think?

Applying the mask to neglogp will only calculate the probability distance of our allowed actions, and the masked actions will not be considered.

@H-Park
Copy link
Author

H-Park commented Dec 6, 2019

I was thinking that too. If the masking is done explicitly in the policy(#453 (comment)) - which I was doing at iteration 1.5 of this project - these problems might evaporate. With masking in distributions for neglogp, the KL divergence will be a different value, and in reality will make the KL divergence (much) smaller, meaning that PPO more liberally updates the policy, because the loss will be greater, per:

ksnip_20191206-104612

Though, I don't believe there is a guarantee that masking in the policy will in fact be our saving grace. I do believe it will at the very least reduce the learning discrepancies between masking and no masking.

Note: The most definitive way to test this would be to have an environment that knows the worst move at every given state, and run two experiments. No masking, and mask the worst move. Does anyone have such an environment or can think of one?

Quick neglogp test regarding masking: https://gist.github.com/H-Park/b82f23ce8bc31716bff46513ec4ee94c

@ChengYen-Tang
Copy link

ChengYen-Tang commented Dec 6, 2019

Note: The most definitive way to test this would be to have an environment that knows the worst move at every given state, and run two experiments. No masking, and mask the worst move. Does anyone have such an environment or can think of one?

What do you think of this environment, I wrote a simple mouse to walk the maze. I have not prepared a readme in English. But you can use the browser's translation.

If I remember correctly, PPO2 uses the CLIP method?

@kudhru
Copy link

kudhru commented Jan 22, 2020

Hi Guys, did we come to a solution for this neglogp issue? I have been using the invalid action masks in my application and it seems to train well but goes into an infinite loop while prediction because of predicting an invalid action although the action mask is rightly set to avoid that action. I checked the neglogp value and it is nan. May be this is what is causing the infinite loop?

@Jeppe-T-K
Copy link

Hi Guys, did we come to a solution for this neglogp issue? I have been using the invalid action masks in my application and it seems to train well but goes into an infinite loop while prediction because of predicting an invalid action although the action mask is rightly set to avoid that action. I checked the neglogp value and it is nan. May be this is what is causing the infinite loop?

I don't have so much to add but I agree that using a hard action mask like this means that if the predicted action is actually invalid, it leads to unexpected behaviour sometimes, especially if the action entropy is low/zero (i.e. one action practically has all the probability of being chosen). I have had success with adding the action mask to the observation space instead, which while not a hard mask, leads to a much more stable training.

During the evaluation of the agent, I have chosen to use my own policy where I just sample actions using the action probabilities and just select the first valid action. While I suppose that this PR should practically do the same thing, I had problems using it (maybe because of only a partial action mask).

@H-Park
Copy link
Author

H-Park commented Jan 29, 2020

@kudhru That occurs when the policy diverges.

Honestly, I don't know how to avoid this. I have been comparing two masking techniques.

  1. Masking in policy by providing the action mask as in input in the graph. This makes the KL divergence only really consider the valid actions (as the invalid ones are -100, compared to [-1, 1] for valid actions).

  2. The method currently done in this PR, masking after the fact in distributions.py methods. This does not alter the policy's logits, but samples according to the action mask.

Both approaches cause the policy to diverge every so often, so I am not sure where to go from here. It might be better to simply cut our losses and not use PPO for problems that require action masking.

My senior capstone this spring and summer is exploring action masking. So hopefully I have some insights into all of this by the end.

@kudhru
Copy link

kudhru commented Jan 29, 2020 via email

@H-Park
Copy link
Author

H-Park commented Jan 29, 2020

The working theory here about the policy diverging is it is caused by something between the discrepancy between the raw policy output (logits), the neglogp computed (which in turn influences the KL divergence, and then in turn the policy update), and the action that is actually sampled.

A possible short term solution for you might be to use something simpler (aka no KL), such a deep q, however I do not remember the state I left DQN in for this PR.

@H-Park
Copy link
Author

H-Park commented Mar 14, 2020

I took an internship offer where I'll be continuing this work, however it will be closed source. I'm unsure what the stable baseline's team wants to with this PR, as this is way too experimental to be merged. If I up the docs on my fork, would you add a link in your docs to it and we can archive this PR?

@Miffyli
Copy link
Collaborator

Miffyli commented Mar 14, 2020

Sounds good to me (a link from stable-baselines docs to your fork, with link to this PR as well as there was a good bunch of fruitful discussion here). At this point it is probably easier to add such link as a separate PR, and just close this one at that point. With v3 of stable-baselines lurking around we are not including more features to this one, anyways :).

Best of luck with your work! Looking forward to seeing what is the correct / less-wrong way of doing this!

@vwxyzjn
Copy link

vwxyzjn commented Jul 1, 2020

Hi all, I hope our paper A Closer Look at Invalid Action Masking in Policy Gradient Algorithms (https://arxiv.org/abs/2006.14171) will bring additional insight on the effect of invalid action making and how it should be implemented :) And here is the corresponding blog post https://costa.sh/blog-a-closer-look-at-invalid-action-masking-in-policy-gradient-algorithms.html

@H-Park
Copy link
Author

H-Park commented Jul 1, 2020

I saw the blog post on /r/reinforcementlearning earlier today.

For my senior capstone and at work, I've been using a tuple observation space, with one of the observations being the mask, and then the policy explicitly uses it as a gate, so that way the invalid actions are actually invalid from a gradient standpoint.

https://github.com/H-Park/yarll/blob/master/test/test_policies.py#L100

Been meaning to discuss with @Miffyli if it'd be wanted to add this feature to the rewrite.

@Bensk1
Copy link

Bensk1 commented Nov 3, 2020

Are there any new developments on how to handle entropy drops with this PR or the rewrite?

For my environment, at some point the entropy drops to nan and a masked action is chosen:

-----------------------------------
| approxkl           | nan        |
| clipfrac           | 0.20898438 |
| explained_variance | 0.663      |
| fps                | 84         |
| n_updates          | 430        |
| policy_entropy     | nan        |
| policy_loss        | nan        |
| serial_timesteps   | 55040      |
| time_elapsed       | 666        |
| total_timesteps    | 55040      |
| value_loss         | nan        |
-----------------------------------

@H-Park
Copy link
Author

H-Park commented Nov 3, 2020

@Bensk1 Your best bet would be to use SB3 with a custom policy that masks via passing in the action mask as part of the input in a tuple or dictionary observation. However I don't know the current state of SB3 and if:

  • Allows a custom user defined policy
  • Allows for tuple and dictionary observations

@Miffyli ^

Because SB3 is on track to support (if they don't already) in policy action masking via offloading it onto the user, I don't see a reason to "finish" this PR as it would never be as good / encompassing as a user supplied mask - so it'd be best to leave action masking as "an exercise left to the reader". In reality it's very simple for the Discrete case.

masked_output = obs["mask"] * raw_output

For the MultiDiscete case, it gets more complicated, as there are multiple ways to mask.

@Miffyli
Copy link
Collaborator

Miffyli commented Nov 3, 2020

@H-Park
SB3 supports custom policies to better extent than SB2 (partially thanks to PyTorch), but no tuple/dictionary observations as of yet. Indeed this feature should be moved on to SB3 if anywhere, and if anybody feels like it could first be added to sb3-contrib, although such support likely requires modifications to SB3 itself before it works well enough.

@H-Park
Copy link
Author

H-Park commented Nov 3, 2020

I realize this doesn't answer your exact question - but the reality is there is no literature that explores and definitely compares masking methods from a policy update standpoint (to my knowledge at least, though I haven't looked in about a years time). That is the largest reason this PR was never completed, because however we decided to compute entropy had no mathematical or experimental backing.

@vwxyzjn
Copy link

vwxyzjn commented Nov 3, 2020

@Bensk1 If you are looking for some reference implementation to experiment with SB3, feel free to check out our code at ppo_partial_mask.py; we benchmark this script on some of our microrts environments here: Gym-microrts V1 Benchmark

For simplicity we just put the action mask in the env's attribute instead of using a Tuple observation space as follows. So if you have an gym environment that implements this attribute, you could just run it with our ppo_partial_mask.py.

import gym
import gym_microrts
env = gym.make("MicrortsMining-v1")
/home/costa/Documents/work/go/src/github.com/vwxyzjn/gym-microrts/microrts/maps/10x10/basesWorkers10x10.xml
[Lai.rewardfunction.RewardFunctionInterface;@1ed6993a
env.observation_space
Out[4]: Box(0, 1, (10, 10, 27), int32)
env.action_space
Out[5]: MultiDiscrete([100   6   4   4   4   4   7 100])
obs = env.reset()
# you need to add the action mask to the env's attribute that has the same shape as the flattened action 
env.action_mask.shape space.
Out[7]: (229,)
sum(env.action_space.nvec)
Out[8]: 229

Also a side note. I am not sure if it is the best idea to put the action mask inside of the Tuple action space @Miffyli @H-Park , because some parts of the action mask might be conditioned on other parts. In the example of microrts, in order to generate an action, you need to select an unit, select its action types, and select its action parameters. We could compute the action mask for the unit selection given the observation, but there is no way to compute the action mask for the action types and parameters until the selected unit is known. This means it is only viable to provide a partial action mask through the Dict observation space.

So we implemented this full action mask with the following code, basically calculating the full action mask via env.get_unit_action_mask(selected_unit) after the source unit has been selected.

# https://github.com/vwxyzjn/gym-microrts/blob/146e6b9255a9799abe4b85889264ba6e6a791b58/experiments/ppo.py#L231-L244
# 1. select source unit based on source unit mask
source_unit_mask = torch.Tensor(np.array(envs.env_method("get_unit_location_mask", player=1)))
multi_categoricals = [CategoricalMasked(logits=split_logits[0], masks=source_unit_mask)]
action_components = [multi_categoricals[0].sample()]
# 2. select action type and parameter section based on the
#    source-unit mask of action type and parameters
source_unit_action_mask = torch.Tensor(
    [envs.env_method("get_unit_action_mask", unit=action_components[0][i], player=1, indices=i)[0]
for i in range(envs.num_envs)])
split_suam = torch.split(source_unit_action_mask, envs.action_space.nvec.tolist()[1:], dim=1)
multi_categoricals = multi_categoricals + [CategoricalMasked(logits=logits, masks=iam) for (logits, iam) in zip(split_logits[1:], split_suam)]
invalid_action_masks = torch.cat((source_unit_mask, source_unit_action_mask), 1)
action = torch.stack([categorical.sample() for categorical in multi_categoricals])

We actually found this full action mask to be very powerful. So using a partial action mask, it is very sample-inefficient to learn anything and we had to use frame-skipping to speed up learning. But using the full action mask, the trained agents can actually perform well against existing bots in the full game, as shown here:

Gym-microrts-s-V2-Benchmark

Maybe in SB3 we should consider something similar to support this kind of conditioned action mask?

@H-Park
Copy link
Author

H-Park commented Nov 3, 2020

@vwxyzjn Thanks for the guide! I know PySC2 has the the mask exposed as variable. When it's your own environment and policy, you can mask however you want / makes sense for your particular set up.

@Bensk1
Copy link

Bensk1 commented Nov 3, 2020

Thanks @H-Park and @vwxyzjn for your replies and guidance! I will definitely check out your ppo_partial_mask.py some time soon.

@nmevenkamp
Copy link

nmevenkamp commented Nov 17, 2020

@H-Park after reading through this thread, I found this paper: https://arxiv.org/abs/2006.14171. It empirically demonstrates dominance of and theoretically justifies invalid action masking over invalid action penalties in terms of scaling to larger action spaces with increasing number of invalid actions.

Do you have any plans to continue work on this pull request again in view of these findings?

I'm highly interested because I want to explore RL for card / board games, where you typically have only a fraction of the actions available to each agent (e.g. the cards in hand).

@H-Park
Copy link
Author

H-Park commented Nov 19, 2020

@nmevenkamp Thanks for the link, nice find!!! @vwxyzjn I guess I didn't see your edit, nice work!

@Miffyli Given this publication, would an implementation PR of it get merged in SB2 or should it go in SB3?

@Miffyli
Copy link
Collaborator

Miffyli commented Nov 20, 2020

@H-Park

SB2 now mainly takes bug-fixes and small enhancements, bigger ones like this should go to SB3. sb3-contrib is designed for more experimental features and something like this would be nice, although we need to figure out if it should be a separate algorithm (for now, as dict/tuple spaces are not supported just yet).

@Remideza
Copy link

Sorry to bump this old issue, but I don't find any place mentionning the action mask in SB3

Have your environment return a 'valid_actions' array of 1's and 0's in the info dict in the step method

Is this implemented in SB3 ? And if so is it only reserved to discrete and multidiscrete spaces or can I apply it to box action spaces aswell ?

@Miffyli
Copy link
Collaborator

Miffyli commented Dec 22, 2020

Is this implemented in SB3 ? And if so is it only reserved to discrete and multidiscrete spaces or can I apply it to box action spaces aswell ?

There is an active PR for dictionary observations for stable-baselines3, which will likely be merged in near future. This feature could then be used to create action space masking the way described in the quote, but out-of-box would only work for discrete spaces. Continuous spaces need more thinking, but could be included in contrib package.

@H-Park
Copy link
Author

H-Park commented Dec 24, 2020

@Remideza Once the PR above gets merged, I'll go ahead and make a similar PR for SB3 using dictionary observations and the masking method outlined in the pre print paper above. It will be in sb3/contrib, however, as the feature is still experimental at best (the paper mentioned is not peer reviewed to my knowledge).

@Miffyli
Copy link
Collaborator

Miffyli commented Jan 4, 2021

Closing this PR in favor of SB3/SB3-contrib implementation. Big thanks to all participants to this discussion for interesting comments and arguments! :)

@Miffyli Miffyli closed this Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental Experimental Feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Invalid Action Mask