-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consistency with the original paper? #6
Comments
or translation experiments? |
No, I could not. Not sure why. Check out this alternate implementation https://github.com/buriburisuri/ByteNet , though from the results I don't think it is able to reproduce the paper results either. |
Thanks for the great work! I compared the current implementation with the paper and noticed a few discrepancies:
|
Can you quantify what sections of the code you feel you are confident align with the paper itself and which parts you are uncertain about? |
|
@msevrens |
If you replace the output embedding targets with word embeddings, your code actually translates semantically very well but syntactically poorly. It also over estimates common tokens in translation. I think you can actually see this pattern when the output targets are character embeddings too. In that case, uncommon collections of characters appear to be easily predicted even though the sequences are long and complex but interspersed between all of the correct learned sequences are random injections of common tokens. I don't know how that information can help others in recreating the paper but there it is. |
Why is the target embedding going into the decoder? How does that work at evaluation time with no target to feed? |
Just an update in case you are still following this. I rewrote almost the complete model. I noticed there was a bug in the encoder for the translation model earlier (in the dilated conv1d implementation). I have updated the ops and training is in progress. It looks good, but it will take time to train since I am not using a powerful GPU. Also, the model is now neater and faster. It accepts variable length sentences as opposed to the last |
Could you reproduce the result of language modeling in the original paper?
The text was updated successfully, but these errors were encountered: