RFC-0016: Masked reductions and normalizations #27

cpuhrsch · 2021-08-26T19:41:11Z

This RFC Discusses semantics and implementation details of masked reduction and normalization operators.

Rendered

This RFC Discusses semantics and implementation details of masked reduction and normalization operators.

cpuhrsch · 2021-08-26T19:52:39Z

cc @IvanYashchuk @pearu @ngimel @mruberry @ezyang @jbschlosser

ezyang · 2021-08-27T03:39:29Z

Feels like it should be prototypable with __torch_function__ (or maybe __torch_dispatch__?)

jbschlosser · 2021-08-27T17:00:05Z

RFC-0016-Masked-reductions-and-normalizations.md

+
+**Input types:** A mask is a boolean tensor. It accompanies a dense Tensor of the same shape. If an entry is True the corresponding element at the same index in the paired dense Tensor is a "valid" value. If it is False it is not. "valid" here means that this value is meant to be included in the computation and otherwise is meant to be ignored. This matches the semantics of masked_scatter, masked_select and masked_fill. 
+
+**Fully masked rows:** If a slice (e.g. row) is fully masked out there is no guarantee the corresponding return values are filled with any specific value such as the operation's identity value. However, given a sparse input with a row entirely zero and masked out the result is likely to be zero to maximize memory savings.


Trying to decide if this lack of guarantee is a problem wrt the MHA NaN gradient issue for fully-masked out rows. Say we output non-zero values for this case, do we put the responsibility onto the user to avoid using those values in the loss calculation?

Edit: I realize this applies more to softmax (i.e. normalizations) but it looks like there's a similar statement below.

If an element is fully masked out its value shouldn't matter. If the user wants to use the values of those elements in subsequent calculations they can use the mask to fill them with the values want. But of the purposes of our functions the user told us that value of those elements don't matter, so we can change them if we want.

If there are use cases that require those values to be untouched, we can add that as a feature later on. I imagine it's more difficult to write a kernel that does this with the same level of efficiency at least for some functions.

cpuhrsch · 2021-08-27T17:03:59Z

@ezyang - does this mean you'd prefer to see this released and packaged out of tree first before considering inclusion in the core?

ezyang · 2021-08-30T03:29:43Z

does this mean you'd prefer to see this released and packaged out of tree first before considering inclusion in the core?

Not necessarily; I'm referring to this part of the spec:

Indeed the best way to describe the behavior is to implement it. Please note that this is only meant to describe semantics and is not an actual implementation.

wouldn't be a long step to have an executable specification that people can play around with.

mruberry · 2021-08-30T03:55:57Z

Since the nan* reductions, like nansum, are existing masked reductions we should be sure the semantics are equivalent. This proposal just allows the mask to be specified directly rather than by value. Supporting more general value-based masking might be interesting in the future, too.

cc @heitorschueroff

cpuhrsch · 2021-08-30T13:36:32Z

@ezyang - agreed, I'm wondering whether or when we should create an out-of-tree Python-only prototype for a MaskedTensor.

@mruberry - you can always get a value-based (let's say 4) mask by e.g. masked_sum(input, input != 4) or masked_sum(input, input == input) for nan.

pearu · 2021-08-30T14:10:24Z

masked_sum(input, ~(input != input)) for nan.

Nit: masked_sum(input, input == input) would work for the nan case as well.

ezyang · 2021-08-30T16:04:36Z

agreed, I'm wondering whether or when we should create an out-of-tree Python-only prototype for a MaskedTensor.

If it's just one person, probably sticking it in a colab is good enough. Multiple people wanting to work on the semantics ~> put it in GitHub somewhere.

pearu

I have two suggestions

re mask definition that mismatches with the one from numpy.ma
re API of masked operations to match the requirements of ReductionOpInfo

pearu · 2021-09-08T07:51:49Z

RFC-0016-Masked-reductions-and-normalizations.md

+
+
+```
+def masked_sum(input, dim, keepdim, dtype, mask):


Notice that https://github.com/pytorch/pytorch/blob/72274e2a2fd55019ec860e1743dbdc5b0c5a5624/torch/testing/_internal/common_methods_invocations.py#L860-L865 defines:

An operator is a reduction operator if it reduces one or more dimensions of the input tensor to a single value. Reduction operators must implement the following signature: - `op(input, *args, *, dim=None, keepdim=False, **kwargs) -> Tensor`

Following this definition, I suggest using

def masked_sum(input, mask=None, *, dim=None, keepdim=False, dtype=None): ...

Yes, absolutely agreed. I skipped implementing all the overloads and default values for brevity in this RFC.

pearu · 2021-09-08T08:09:40Z

RFC-0016-Masked-reductions-and-normalizations.md

+
+**Operator constraints and general signature - Reductions**
+
+**Input types:** A mask is a boolean tensor. It accompanies a dense Tensor of the same shape. If an entry is True the corresponding element at the same index in the paired dense Tensor is a "valid" value. If it is False it is not. "valid" here means that this value is meant to be included in the computation and otherwise is meant to be ignored. This matches the semantics of masked_scatter, masked_select and masked_fill. 


This definition of mask mismatches the definition of numpy.ma mask:

When an element of the mask is False, the corresponding element of the associated array is valid and is said to be unmasked. When an element of the mask is True, the corresponding element of the associated array is said to be masked (invalid).

Possible solutions:

Adjust the mask definition to match the one in numpy.ma

Rename "mask" to "valid"

Document the mismatch with numpy.ma

I would vote for 1 because the mismatch with the three non-arithmetic ops masked_scatter, ... would not be as bad as the mismatch with arithmetic ops of more widely used numpy.ma module, IMHO.

I will redraw my vote for option 1 and suggest defining:

A mask is a boolean tensor of selection.

I think the current definition of mask adheres to A mask is a boolean tensor of selection. if I understand you correctly?

Interestingly enough our MHA implementation uses True to indicate an element is meant to be ignored.

But like you mention, the other ops use the inverse of this definition.

I think the current definition of mask adheres to A mask is a boolean tensor of selection. if I understand you correctly?

yes

Interestingly enough our MHA implementation uses True to indicate an element is meant to be ignored.

I guess it boils down to not assuming that "mask" is a uniquely defined concept and the exact meaning of mask must be documented in functions docstrings.

But like you mention the other ops use the inverse of this definition.

In addition, https://numpy.org/doc/stable/reference/generated/numpy.sum.html uses where argument as a "mask of selection".

Based on our offline discussion with Ralf I think it's safe to go for mask semantics that adhere to those described in this RFC.

cpuhrsch · 2021-09-08T18:21:03Z

RFC-0016-Masked-reductions-and-normalizations.md

+
+```
+def masked_sum(input, dim, keepdim, dtype, mask):
+    return torch.sum(input * mask, dim, keepdim, dtype=dtype)


NOTE: This actually requires masked_fill, because we have no guarantee that any masked values of input are valid.

>>> torch.tensor([float('inf')]) * torch.tensor([False]) tensor([nan]) >>> torch.tensor([float('inf')]).masked_fill(~torch.tensor([False]), 0) tensor([0.])

RFC-0016: Masked reductions and normalizations

8cb9ce7

This RFC Discusses semantics and implementation details of masked reduction and normalization operators.

facebook-github-bot added the cla signed label Aug 26, 2021

cpuhrsch mentioned this pull request Aug 26, 2021

RFC-0004: Adding fill value property to PyTorch sparse tensors #8

Open

jbschlosser reviewed Aug 27, 2021

View reviewed changes

cpuhrsch mentioned this pull request Aug 30, 2021

torch median / nanmedian w/ nans speed pytorch/pytorch#63870

Open

pearu reviewed Sep 8, 2021

View reviewed changes

cpuhrsch commented Sep 8, 2021

View reviewed changes

pearu mentioned this pull request Sep 10, 2021

Strided masked reductions: sum, prod, amin, amax pytorch/pytorch#64809

Closed

25 tasks

cpuhrsch mentioned this pull request Sep 21, 2021

nn.TransformerEncoder - all nan values issues when src_key_padding_mask provided pytorch/pytorch#64525

Open

pearu mentioned this pull request Oct 14, 2021

Strided masked reductions: sum, amax. Testing of masked reductions. pytorch/pytorch#65990

Closed

This was referenced Oct 28, 2021

Add masked_softmax to speed up masking in multihead attention pytorch/pytorch#48441

Closed

Strided masked reduction: mean (2nd try) pytorch/pytorch#67088

Closed

cpuhrsch mentioned this pull request May 30, 2024

Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack pytorch/pytorch#125262

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC-0016: Masked reductions and normalizations #27

RFC-0016: Masked reductions and normalizations #27

cpuhrsch commented Aug 26, 2021 •

edited

Loading

cpuhrsch commented Aug 26, 2021 •

edited

Loading

ezyang commented Aug 27, 2021

jbschlosser Aug 27, 2021

cpuhrsch Aug 27, 2021 •

edited

Loading

cpuhrsch commented Aug 27, 2021

ezyang commented Aug 30, 2021

mruberry commented Aug 30, 2021

cpuhrsch commented Aug 30, 2021 •

edited

Loading

pearu commented Aug 30, 2021

ezyang commented Aug 30, 2021

pearu left a comment

pearu Sep 8, 2021

cpuhrsch Sep 8, 2021

pearu Sep 8, 2021

pearu Sep 8, 2021

cpuhrsch Sep 8, 2021 •

edited

Loading

pearu Sep 8, 2021

cpuhrsch Sep 8, 2021

cpuhrsch Sep 8, 2021


		Input types: A mask is a boolean tensor. It accompanies a dense Tensor of the same shape. If an entry is True the corresponding element at the same index in the paired dense Tensor is a "valid" value. If it is False it is not. "valid" here means that this value is meant to be included in the computation and otherwise is meant to be ignored. This matches the semantics of masked_scatter, masked_select and masked_fill.

		Fully masked rows: If a slice (e.g. row) is fully masked out there is no guarantee the corresponding return values are filled with any specific value such as the operation's identity value. However, given a sparse input with a row entirely zero and masked out the result is likely to be zero to maximize memory savings.


		Operator constraints and general signature - Reductions

		Input types: A mask is a boolean tensor. It accompanies a dense Tensor of the same shape. If an entry is True the corresponding element at the same index in the paired dense Tensor is a "valid" value. If it is False it is not. "valid" here means that this value is meant to be included in the computation and otherwise is meant to be ignored. This matches the semantics of masked_scatter, masked_select and masked_fill.

RFC-0016: Masked reductions and normalizations #27

Are you sure you want to change the base?

RFC-0016: Masked reductions and normalizations #27

Conversation

cpuhrsch commented Aug 26, 2021 • edited Loading

cpuhrsch commented Aug 26, 2021 • edited Loading

ezyang commented Aug 27, 2021

Choose a reason for hiding this comment

cpuhrsch Aug 27, 2021 • edited Loading

Choose a reason for hiding this comment

cpuhrsch commented Aug 27, 2021

ezyang commented Aug 30, 2021

mruberry commented Aug 30, 2021

cpuhrsch commented Aug 30, 2021 • edited Loading

pearu commented Aug 30, 2021

ezyang commented Aug 30, 2021

pearu left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch Sep 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpuhrsch commented Aug 26, 2021 •

edited

Loading

cpuhrsch commented Aug 26, 2021 •

edited

Loading

cpuhrsch Aug 27, 2021 •

edited

Loading

cpuhrsch commented Aug 30, 2021 •

edited

Loading

cpuhrsch Sep 8, 2021 •

edited

Loading