Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix preprocessing for WMT14 En-De to replicate Scaling NMT paper #203

Merged
merged 1 commit into from
Jun 28, 2018

Conversation

myleott
Copy link
Contributor

@myleott myleott commented Jun 28, 2018

  • use newstest2013 for validation instead of splitting the training set
  • apply length filtering before BPE
  • final dataset is ~4.5M documents
  • confirmed this new dataset gives results on par with the Scaling NMT paper

@myleott myleott requested a review from edunov June 28, 2018 17:04
@myleott myleott merged commit a75c309 into master Jun 28, 2018
@myleott myleott deleted the preprocess_wmt_en_de branch June 28, 2018 18:19
myleott pushed a commit that referenced this pull request Aug 28, 2018
moussaKam pushed a commit to moussaKam/language-adaptive-pretraining that referenced this pull request Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants