-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transformer Modules #55
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Okay so for now I decided on the following conventions:
For the Next point
I also added a bunch of documentation to the parameters. Let me know if I missed anything or you would name variables differently. |
Yes, all good.
I don't understand this. Why is it not possible? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Note that when we adopt #17, all the dimensions which are currently of type |
Btw, as usual, also see failing tests. |
This comment has been minimized.
This comment has been minimized.
0bfb37c
to
a30633d
Compare
As relative positional encoding is some property deep inside the Transformer (inside the self attention), and also an important mechanism at least for certain tasks, I think we should not ignore this now, because it might have a big influence on our design. Our design should allow to make use of relative pos encoding in an easy way. Or more generally: The current Transformer design should allow for changes insides the core self attention process in an easy way. E.g. also to replace the self attention by other attention types, like sparse attention, using LHS and all that, Linformer, etc. (@Zettelkasten) I think the current design does not really allow this. The way you would currently do that is probably to copy & paste the current Transformer code and replace the self attention by some own custom thing. Which basically shows that it does not allow for that. (Note that this is maybe not always bad. In some situations, this approach could just be fine.) However, whatever potential generic API we propose here, we should also be careful that it is not too difficult to follow, because it is too abstract, has too much indirections, etc. Or even worse, having some quite abstract API, which allows for rel pos encoding, but it would turn out later that it is not really generic enough to allow other things, so it is really only useful for this one specific thing, and thus pointless (if we want it to be specific, it could just be a flag Maybe this is tricky to get right. Or maybe not really possible. But we should at least think about it. |
Trying to phrase it more general: The transformer is a stack of multiple components. Depending on view and features this amount can be a bit different but in general we have smth. like a potential |
When I hear "Plug and Play", my association is a flexible set of building blocks which can be plugged together in many ways, so basically what you describe in your second variant. But anyway, this is just terminology. I'm not sure if we really need a We should make this a bit more concrete. How do the building blocks look like? What is different from what we already have right now? Because we already have building blocks right now. They are just designed in a kind of hierarchical way. I don't really have a good solution so far. I have some ideas but I'm not exactly happy with them. I think we should think a bit more on this. In any case, yes, there should be a ready-to-use Transformer as well. But this should then just be based on the building blocks, and should not be another separate implementation. Also, this ready-to-use Transformer should in any case have an easy way to use relative positional encoding. Maybe that's not always needed, but this will be often needed. Maybe that should even be the default. Maybe specifically the Transformer XL variant. |
We maybe should also not overthink this. It doesn't need to be too generic. What we want is an easy way to replace the default self attention inside the Transformer, as this is probably some frequent thing to change. So maybe |
Would you hand over a initialized layer which just is called in forwad or would you init it within the modules? Usually I would prefer within the modules, but then the arguments are fixed. |
Why would you init within the modules? Of course, we need the ability here to pass in any module with any options. There are many options, like:
|
Decided for your first option for now. Also changed So in general what is "missing" for this to be finished is:
|
Removed all init parts of search for now only kept it in forward since we agreed on that api beforehand. |
See also this: pytorch/pytorch#67999 |
I think we should not try to get this here into a perfect and ready state but just merge it soon and then iterate on it. @Atticus1806 When do you think would be a good time to merge? Do you want to improve anything further before the merge, or just merge as-is? |
So to be honest I put the extension of this model on hold for now, since this is a full model which might change a number of times in the concrete implementation depending on how much returnn_common changes until its first full release. While it still has errors in the tests I think they are related to things outside of this PR. So we could either merge it now and update it when time comes and it is ready to use or we could leave this PR as is and update this PR then. I would be fine with both, if you want to keep the open PR's as little as possible we could merge now. |
I think it's easier to merge it now (or with some cleanup given my recent comments) to easier allow for changes. Also, most things are actually ready in returnn_common to test this now. Although you are right that some things might still change. |
I did not really follow the development of returnn_common the last few weeks, since I was working on something else, so from my view what is open is:
Other things can be extended later. I think search and dimension tags might also be extended later, depending on its state right now. |
Yes, I think so (without looking at it now).
There is still some open question on the basic design but in principle it is available. But let's later look at that. Not needed for the merge.
All on dimension tags should be clear for Transformer. They should consistently be used everywhere. There should not be |
229f163
to
5e9fe3e
Compare
Okay, so the commit was a bit larger than expected. Updated the dimension tags (and added defaults in the |
Just leave it as it is. I will merge and update it. Where do the defaults come from now? This should be documented. |
updated the documentation, linking the paper adding a remark for |
ea1ba64
to
6d09e19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I cleaned up a bit. As discussed, we will just merge this now, and can the do further improvements.
I think we should rename |
Fix #53.
This is a draft for now, since Attention Modules from #52 need to be implemented for that. Comments on code style, changes etc. are welcome already of course.
Orienting on PyTorch for naming of variables and functions.
PyTorch documentation.
Note that PyTorch also has a masking logic in almost all modules, which was left out for now, since it is used in the Attention Modules in PyTorch. Depending on the structure of
rc
Attention this can of course be added again.