-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Async tokenizer #86
Conversation
self._cache_dir = cache_dir or _DEFAULT_MODEL_CACHE_DIR | ||
|
||
async def __aenter__(self): | ||
await self._init_tokenizer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you not returning the tokenizer object from this function?
self._tokenizer = await self._init_tokenizer()
Its not good practice to mutate a variable inside a function and depend that it had changed outside of the function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means you need to declare on the _tokenizer
in the class's properties like so -
class AsyncJambaInstructTokenizer(BaseJambaInstructTokenizer, BaseTokenizer):
_tokenizer: Optional[Tokenizer] = None
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I declared it in BaseJambaInstructTokenizer, because it's a variable that both sync and async classes have
self._id_to_token_map = {i: self._sp.id_to_piece(i) for i in range(self.vocab_size)} | ||
self._token_to_id_map = {self._sp.id_to_piece(i): i for i in range(self.vocab_size)} | ||
self._no_show_tokens = set( | ||
self._convert_ids_to_tokens([i for i in range(self.vocab_size) if self._sp.IsControl(i)]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of these lines are already happening in the base class no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope.
In base class are all the initializations that don't depend on load_binary, so that both sync and async classes can utilize.
These lines are for the initialization of the members that depend on load_binary (the "heavy" operation, that happens in async in the async class).
|
||
|
||
class BaseJurassicTokenizer(ABC): | ||
_sp: spm.SentencePieceProcessor = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type is wrong. If something could be None
then its type is Optional[<object_type>]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.