-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rec design for recurrent definitions / loops #16
Comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Extending on this, example draft:
Note that the Note that we have the principle that the user should not need to think about the automatic optimization (rwth-i6/returnn#573). The axes descriptions and axes themselves still need to be worked out (see referenced RETURNN issue above). |
We should also work out how masked computation on the example of transducer with SlowRNN and FastRNN looks like. Again, in this example, what we want:
Example code draft:
This is not so much about |
I don't really understand. It sounds like a bug. But anyway, things like this are really irrelevant for the discussion here. We should just fix or extend RETURNN in any way needed.
Exactly like in the pure Python code in the beginning. Which is extremely simpel?
How? The user probably rarely would write such code. Most
Well, speaking purely abstractly: RETURNN is generic enough to allow just any possible construction, or if not, it should be, and should be extended to allow that. So this should never be a concern here in this repo (how it translates to a RETURNN config). The main goal is to have it logical, simple, straight-forward, expectable. How it translates to RETURNN is an implementation detail, after we have clarified what design we want. Speaking more specifically now, how we actually do the implementation: I think most aspects were already discussed here (the thread gets too long to easily see that...). There might be some open questions but I think they can be solved easily. I will start working on this once I finished the generalized self attention logic in RETURNN.
I just gave you one? This is one such example:
We have. The This has multiple state assign calls because you call the same module multiple times which reuses the same hidden state. Although, maybe this is actually not such a good example. I wonder now. I'm thinking about e.g. Universal Transformer now, where you also would call the same self attention layer multiple times. But there you do not want to have it use the same hidden state, but every call (layer) should still have its own hidden state, just the parameters should be shared. So maybe the idea that a module can have hidden state is not good after all? Again, speaking of PyTorch, PyTorch modules can have that as well (as buffers), but usually it is never done this way. E.g. the Maybe we should somehow make it more explicit, what hidden state is being used? Maybe like this:
Maybe we also should differentiate between buffers and state? Buffers (in general, and also in PyTorch) are intended to not really have influence on the behavior. Or not on the model behavior at least. I'm actually not sure sure on the typical usage of buffers in PyTorch. But they are definitely not used as hidden state. Hidden state is always passed explicitly. |
This comment has been minimized.
This comment has been minimized.
Yes. In actually most of the common cases here in returnn-common it would result in
There are multiple aspects/concepts:
So the model parameters are shared in this example:
In the current proposal, also the hidden state is shared in this example. However, maybe the more common use case is that the user wants to share the parameters but not the hidden state. Again, I'm thinking about Universal Transformer. But also most other example I can think of where parameters are to be shared, you would not want to share the hidden state. I think it should be simple to write code for the common cases, while also allowing for exotic things, while also being straight-forward/logical. How would you actually implement this common case, where you want to share parameters but not the hidden state?
I'm not sure if this is still so easy. Or I would rather say no. Note, in PyTorch, this (sharing parameters but not hidden state) would look like:
In PyTorch, sharing parameters and hidden state would look like:
In both cases, you make the handling of state explicit. So there is no confusion on the behavior of state, because it is always explicit. So, I'm wondering if we also should make it always explicit. But still in a generic way. That is why I proposed This is all about parameter sharing, i.e. calling the same module multiple times. And how the hidden state should be handled in this case. Because we hide the hidden state away in the current proposal, such module call has side effects on internal state. Generally speaking, side effects are bad, because it makes it more difficult to reason about the behavior of code. Usually you want that some call like Btw, this argument is also about module calls in general. Not really about rec loop actually. E.g. this code in the current proposal:
It again would share the parameters (as expected) but also the hidden state (not sure if that is expected, although it follows logically from the current proposal). In PyTorch, the example of sharing the parameters but not the hidden state would look like:
|
Ok, following from those thoughts, I'm now thinking that we definitely should make it explicit. For all modules which have state (by themselves, or via sub modules). And state is not an attrib of the module but just an explicit parameter and return value of a module call. There can be reasonable default initial state. E.g. when you have The example for sharing parameters but not sharing hidden state would look sth like:
Which is very similar to the PyTorch example. From the point of view of the model implementation, it is a bit strange now that there is a conceptual difference between arguments which are state (and thus not normal So this is a problem. I think the model definition should handle it all the same and just expect always But then how to we handle this? Maybe more explicitly:
I think this would be very logical and straight-forward now again. And all is explicit and behavior is always clean. Other modules (e.g.
I wonder a bit if this is too complicated now, to write the Or maybe like:
This would add a bit of clever handling into the |
I was quite busy with all the recent dim tag work (rwth-i6/returnn#577) and generalized self attention (rwth-i6/returnn#391). Generalized self attention is finished now, and the concept of consistent dim tags has been improved and extended a lot. Although there are still on-going discussions on taking this even further (rwth-i6/returnn#597, rwth-i6/returnn#632). Some of these might also be relevant for returnn-common but we should discuss this independently, maybe in #17. I lost a bit track on the current state here, and the reasoning. I think the final conclusion on the hidden state was to have it always explicit as a ( I like it that way because there is no ambiguity in the code, how the hidden state is handled. This is all explicit. The initial post description is maybe outdated on this. Do we still need to have the state as an attribute to the module (e.g. And why do we need this special I can see that the Edit The last comment just before this actually says:
So as I argued above, we would not have this module attrib anymore (like However, I don't understand this anymore:
Why is there a conceptual difference? Does it need to be? Why? Why can't the Actually I also addressed this before:
But still in this comment I keep Edit The first example from above would look like:
The second example from above would look like:
The problem is that this does not exactly corresponds to the Python But the same problem is actually also for any other output. Anything in the loop which wants to use the value from the previous iteration. This was not really addressed before? Is this not a problem? Or was this solved differently? Edit Ah, this is actually also in the very first proposal. That is what So, to recap:
I still see a problem in catching errors here. When the user writes the code ignoring Can we somehow catch such errors to avoid unexpected behavior? Or is this maybe not too much a problem as this is actually not too much unexpected? Also, do we want that we can also skip the In PyTorch, this is the same behavior though, right? In PyTorch, there is the difference of |
I'm questioning now whether this explicit code is maybe too complicated for many of the common use cases. On the RETURNN side, the hidden state is hidden and not explicit. So when translating some old-style RETURNN model to this new way, it would take somewhat extra effort (although it should be straight-forward). One alternative is to introduce
Then we can use e.g. this code:
This would do param sharing between both The The not-so-nice thing about this is that we clearly differentiate between state and other input now. So it becomes complicated/non-straightforward when the user would also want to pass some custom state (custom
|
So I just read through anything and I will add my comments. Since it was quite a lot, maybe I also missunderstood something, then just correct me: One of my general questions would be, why we even do explicit recurrent handling like getting state updates and so on instead of just "references" to these updates which would be passed in to Returnn for the config. Isn't the actual handling of how the LSTM unit works something the is part of Returnn and what we try to achieve in this repo is a nice way of writing it down. From what I understand right now you are also looking to include additional concepts. From this I would also conclude my reasoning for deciding between the two variants for the hidden state: I feel like the implicit handling is one of the biggest strengths of Returnn, even though ofc. its a sharp sword to work with. Not explicitly having to worry about certain details of a layer makes it (at least from my HiWi view) quite more easy to start and also work with. What we should aim for in my opinion is an interface which is as simple as possible to "just start", but has the flexibility of allowing stronger configurations once the user is more used to it. So I feel like it would be fine to accept that basic configurations (where in this caes you don't do modifications to the internal hidden state logic) are as easy as possible but if you want to make use of some stronger concepts you would have to go a bit more in depth. The problem why this in other casees causes troubles is when the documentation is not good enough, making users trying to make the transition into more detail feel lost. Now onto the specific example: I would prefer the second option in the general case. For the more advanced options I would then include an option to get the But overall I think this is a point where we need to put some thought into. Maybe we could work our 2 or 3 concrete ways (with 3-4 examples each) and then ask other users about it, because I feel like this is something where User feedback might be meaningful to make a decision. What do you think? |
Basically. But not directly. All the discussion here should be seen independent from RETURNN really. But really more about what would be a straightforward design (for people which do not know about RETURNN). It should not be that we adopt some strange thing from RETURNN only because that is how it is now in RETURNN. So, when we think about loops ( The next question is, whether we want to allow hidden state, which can be hidden, and thus is a separate concept, or whether there should not be a separate concept for hidden state, and it would just be the same as other values from previous iteration. The cleanliness and straightforwardness of the design is of highest priority here. How this maps to RETURNN in the end is only secondary. We can surely map whatever we came up with, as long as it is well defined. Or if not, we can simply extend RETURNN such that we can. Although for almost everything discussed here, I think that RETURNN already supports it, so no extension or modification on RETURNN side would be needed.
I don't exactly understand. What do you mean by references to the updates? The argument of explicit vs hidden/implicit is simple: Because only that way, it is straightforward. Hidden/implicit is always problematic. Esp when you want to change it, or have some control over it, it becomes unnatural. As long as you do not want to touch or access the hidden state, it doesn't matter. But as soon as you do, it matters. And there always will be such cases.
We are not changing that. Here we simply discuss how we design the handling of accessing previous values (values from the prev loop iteration), and hidden state, or whether hidden state should be handled differently or just the same as other previous values.
No. No underlying concept is really new. It would still all map to what RETURNN does right now. Just the API is new. This is the whole point here of returnn-common. And for designing the API, we have the freedom to do it as we want. And I think we should try to prioritize cleanliness and straightforwardness.
Many people claim that PyTorch is easier because it is all explicit. Usually nothing is hidden away. When reading other people's code, you rarely would ask yourself what it would actually do, or whether this module has some hidden state, because it is all explicit. Explicitness can result in slightly more code but it is usually still pretty simple and short, and it is easier to follow and reason about because you don't have to think about implicit behavior. Implicit behavior is maybe fine for all the simple cases but once it gets more complex, it can make it really hard to reason about. I spoke with some other people and they all strictly preferred the explicitness.
Yes, simplicity and flexibility are both the main goals of RETURNN, and also here of returnn-common. However, I think you argue exactly for the opposite as I did before. What does simple mean? Simple does not necessarily means short code. Simple is about writing code, reading code, and understanding code. It should never be ambiguous, otherwise it is not simple. It should be clear and straightforward. Straightforwardness makes it simple to write and understand. Clearness makes it simple to read. What does flexibility means? It does not just mean that more complex things are possible. More complex things are always possible. Flexibility also means that more complex things are straightforward to do. Otherwise it is actually not really flexible, if something is not straightforward or unclear.
You cannot really compensate a complicated non-straightforward design by just having better documentation. Treating hidden state as something different than non-hidden states just makes it more complicated, and not straightforward. When you have worked with non-hidden state before, it is not clear or straightforward how to work with hidden state now, when this is a different thing or concept.
I actually asked someone on what behavior he would expect from this code:
He expected that the two
Yea, I also thought about getting some more feedback. It's a good idea to prepare some examples. In all cases, what I think is important:
So the different examples could be:
I agree, some of these examples are maybe a bit exotic. But that is my point. It should still be straightforward to do. Otherwise it is not really flexible. In PyTorch, all of these are very simple and straightforward. In current RETURNN (dict-style net def), while all are possible in principle, only the first three are simple, while the others are definitely not, esp not straightforward. I expect and argue that whenever you have it explicit, it becomes straightforward. |
Some further thoughts on the handling of state in general in a loop (orthogonal on the discussion whether hidden state should be a separate concept or not): While I'm thinking about the options to simplify that to more canonical simplified Python, while still also not doing too much magic, such that it is still clear what happens.
So, given these options, I tend to prefer |
How would the The Should we allow usages without defining the initial value or the shape? Maybe. In that case, the code for 2 LSTM layers, sharing params, not sharing hidden state can look like:
Or for 2 LSTM layers, sharing params, sharing hidden state:
Or consider this Python code:
Equivalent code here:
How to explicitly specify the initial value, and maybe other things like shape? Maybe this can just be extended, like so:
This would be maybe a bit counter intuitive as |
The "current" example at the top looks already quite understandable and straightforward, but I have some comments / questions:
|
Yes right. This is basically the discussion here: rwth-i6/returnn#597, rwth-i6/returnn#632 and related. We still did not fully clarify whether we maybe should allow some defaults for cases where it is unique. Or basically we anyway need to do that for all existing layers to not break backward compatibility. But anyway, this is somewhat orthogonal to the discussion here.
No, I don't think so. It should follow the same principles as everything in RETURNN, it should be as simple as possible, and atomic. You can very easily get this functionality e.g. by putting a causal
Yes, this is #23, but actually, when we have all hidden state also now explicit, i.e. no distinction anymore, this also becomes pretty straight forward even without any such wrapper. The only reason such wrapper can be useful is to allow potential further automatic optimizations (as |
So, a first version is implemented now. See the test in
Which results in this net dict:
(Sorry for the |
So I'm closing this now, as we have the initial design implemented. Please open separate issues if sth is broken, missing, or whatever. |
Just for reference, also |
What's still missing is the default interface for all wrapped RETURNN layers with hidden state, which should make the state more explicit, as discussed here. This is #31. |
|
I just found that JAX has quite some similar concept: There you have |
Some update: Instead of using I think this makes it more straight-forward. Example:
|
This issue is to collect some thoughts on the recurrent loops design, which wraps the
RecLayer
with an explicit subnetwork in RETURNN.The main goal is to have this very straight-forward and simple for the user. We can abstract away from the underlying
RecLayer
if that makes things easier. We can also extend RETURNN itself if needed.Related is also #6 (rec prev mechanism), and this issue here might fix/resolve #6, although not necessarily.
This also needs some mechanism for unrolling/unstacking, i.e. when we iterate over input
x
with some time-axis, i.e. to getx[t]
. This is rwth-i6/returnn#552.To define a loop like this pseudo Python code:
Current design:
There is
Loop()
which can be used in awith
context, which corresponds to thefor
-loop in the example, or in general to awhile
-loop. Like:There is
State()
which can define hidden state (for any module or any code).The example above can be written as:
Or with a module as:
For the TF name scopes (and variable scopes), we should follow #25, i.e. make it exactly as the module hierarchy.
The RETURNN layer name of the created
RecLayer
viaLoop
does not matter too much. It could be arbitrary, or some clever (but simple) logic to use the first module name or so. The RETURNN layer hierarchy can be independent from the actual TF name scopes (via #25).Special options for the
RecLayer
likeinclude_eos
can be options forLoop
, likeLoop(include_eos=True)
. Or as a method, likeloop.set_include_eos(True)
.Loop
(potential) methods:unstack
.We need RecLayer multiple inputs via explicit unstacking returnn#552 for this.
unstack
also implicitly implies that the loop runs over the time-axis ofx
.last
stack
idx
: to return some layer which wraps RETURNN':i'
State
has methodsget
andassign
. (... See discussion below for more ...)Current reasonings:
Why no special base class
Rec
which derives fromModule
? We want to easily allow to use any kind of module inside a loop. We think the current API makes this more straight-forward.Why is
h
not an argument offorward
, and whyState
instead? This allows to call other sub modules, which might define their own hidden state. So the root recurrent module does not need to know about all the hidden states of sub modules.Why to have the hidden state explicit, and not use sth more close to
self.prev
? To make the behavior more straight-forward.The current design allows for nested loops and sub modules with hidden state.
Only the
Loop()
call actually introduces a new loop.There should not be any special handling needed for the
Choice
layer.Note that the search flag and train flag logic is a separate thing (#18).
There should not be any special handling needed whether the input to a rec module call would be inside the current/same loop or not.
unstack
on some value which is already inside the loop would not make sense, though, and should result in an error. But this would all be covered by RETURNN logic already.RETURNN rec automatic optimization should not cause any problems. RETURNN already should guarantee that it is equivalent. From the user view point, it never ever should matter whether it is optimized. Otherwise this is rwth-i6/returnn#573. On this returnn-common level, it should not matter.
Example for LSTM for a single step:
The text was updated successfully, but these errors were encountered: