Skip to content

Commit

Permalink
update runner doc (#778)
Browse files Browse the repository at this point in the history
  • Loading branch information
mikekgfb authored and malfet committed Jul 17, 2024
1 parent c43135f commit 001b279
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 0 deletions.
File renamed without changes.
12 changes: 12 additions & 0 deletions parking_lot/unsupported/runner-tokenizer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
The SentencePiece tokenizer implementations for Python (developed by
Google) and the C/C++ implementation (developed by Andrej Karpathy)
use different input formats. The Python implementation reads a
tokenizer specification in tokenizer.model format. The C/C++ tokenizer
that reads the tokenizer instructions from a file in tokenizer.bin
format. We include Andrej's SentencePiece converter which translates a
SentencePiece tokenizer in tokenizer.model format to tokenizer.bin in
the XXXutilsXXX subdirectory:

```
python3 XXXutilsXXX/tokenizer.py --tokenizer-model=${MODEL_DIR}/tokenizer.model
```

0 comments on commit 001b279

Please sign in to comment.