Skip to content

Commit

Permalink
In the early training stages, the example generation may produce out-…
Browse files Browse the repository at this point in the history
…of-bound tokens; make sure tiktoken doesn't see them.
  • Loading branch information
IggShaman committed Aug 26, 2024
1 parent 6104ab1 commit 301de76
Showing 1 changed file with 5 additions and 0 deletions.
5 changes: 5 additions & 0 deletions train_gpt2.py
Original file line number Diff line number Diff line change
Expand Up @@ -473,6 +473,11 @@ def get_lr(it):
xcol = torch.gather(topk_indices, -1, ix) # (B, 1)
# append to the sequence
xgen = torch.cat((xgen, xcol), dim=1)

# The model may generate a token id that is out of bounds, making tiktoken panic
# in decode(..). A quick fix is to replace them with the EOT.
xgen[xgen >= enc.max_token_value + 1] = enc.eot_token

# print the generated text
for i in range(num_return_sequences):
tokens = xgen[i, :max_length].tolist()
Expand Down

0 comments on commit 301de76

Please sign in to comment.