-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example) #163
Comments
@xptree any ideas on this? |
I just printed the |
Describe the bug
When running the transformer-XL example on enwik8, the log shows there are only 204 unique tokens (vocabulary size) in enwik8 training set.
To Reproduce
Steps to reproduce the behavior:
bash ./scripts/run_enwik8_base.sh train
Expected behavior
I am not sure how many unique tokens (vocabulary size) should be for enwik8, but I suppose it should be much larger.
Logs
Run training...
Experiment dir : LM-TFM-enwik8/20230706-192048
Producing dataset enwik8...
building vocab with min_freq=0, max_size=None
final vocab size 204 from 204 unique tokens
Platform
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: