Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MaskedComputationLayer is violating the principle that the user should not need to think about rec automatic optimization #769

Closed
albertz opened this issue Nov 25, 2021 · 4 comments · Fixed by #976

Comments

@albertz
Copy link
Member

albertz commented Nov 25, 2021

See RETURNN principles.

When defining MaskedComputationLayer inside a rec loop and not thinking about rec automatic optimization, the behavior is clear. It copies the prev output (and prev state) whenever the mask is False, otherwise it uses the current output.

In this line of thought, there should be no need for UnmaskLayer.

However, the current situation is, when it is optimized out, a tensor of shape [B,D] inside would not just become [T,B,D] where T is the number of rec iterations, but instead it becomes [T',B,D] where T' = sum(mask). And some follow-up operation at some point requires the [T,B,D] shape (maybe because it combines it with other tensors from the rec loop), and thus the user explicitly must use the UnmaskLayer at some point after the MaskedComputationLayer.

This violates the RETURNN principle that the user should not need to think about rec automatic optimization.


What are possible solutions?

I don't really have a good idea so far. It should be opaque.

Maybe the new dim tag for T' could have some special flag that it is a masked subset of T, and whenever it is going to be combined with the other dim tag, it would automatically do the unmasking.

@albertz
Copy link
Member Author

albertz commented Nov 25, 2021

@Zettelkasten @mmz33 @robin-p-schmitt maybe you have some ideas, or comments?

@albertz
Copy link
Member Author

albertz commented Nov 25, 2021

Note that I already realized this as a problem some time ago, but now again when thinking about specifying the output spatial dim (out_spatial_dim #597) and the exact behavior.

When not being optimized (e.g. simply optimize_move_layers_out=False) and then accumulating the output of MaskedComputationLayer, you would get shape [T,B,D], and never actually the shape [T',B,D].

out_spatial_dim should in all cases refer to T', even when T' would actually never exist? So basically out_spatial_dim would be ignored inside the loop, and only used outside a rec loop?

@albertz
Copy link
Member Author

albertz commented Mar 4, 2022

The new dim T' would somehow have a reference to T. derived_from_tag is obvious, but we need more, esp we need the mask, or the indices for unmasking.

It's a bit like when e.g. ConvLayer did downsampling via striding. Then T' = T / 2. And this could be formulated as a mask as well.

Now we can see the unmasking as a sparse-to-dense operation.

However, this is still ambiguous. In sparse-to-dense, we would set 0 for all other frames. But this is rarely what we would want here. Specifically, when reproducing the output of a MaskedComputationLayer, we would have a very specific behavior, namely that the previous masked frame would be copied.

So, when we want to have automatic unmasking of T' to T, when such tensors are combined somewhere, e.g. via Data.get_common_data and then Data.copy_compatible_to, we don't just need the mask but we also need the kind of unmasking, i.e. how to fill the other frames.

@albertz
Copy link
Member Author

albertz commented Mar 4, 2022

We could use the derived_from_op mechanism, via a special op kind "mask", and store the mask in the attributes, along with unmask_type="left" or so, to reflect that we want this specific behavior when unmasking.

This could later be extended to be able to represent sparse data via unmask_type="fill" and unmask_fill_value.

The mask itself would be a Data instance. In our case, it would be of shape [B,T] with dtype bool. However, the shape is probably arbitrary. Although it probably should contain the original dim T.

Or as mentioned, instead of a mask, we could also store the indices, i.e. Data of shape [B,T'] pointing to T.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant